NO

Author Topic: Reading a tab-delimited text file into a two-dimensional array  (Read 46793 times)

Offline TimoVJL

  • Global Moderator
  • Member
  • *****
  • Posts: 2115
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #15 on: April 29, 2014, 06:42:45 AM »
http://msdn.microsoft.com/en-us/library/kt0etdcs.aspx
Quote
Remarks
The fread function reads up to count items of size bytes from the input stream and stores them in buffer. The file pointer associated with stream (if there is one) is increased by the number of bytes actually read. If the given stream is opened in text mode, carriage return–linefeed pairs are replaced with single linefeed characters. The replacement has no effect on the file pointer or the return value.
May the source be with you

Offline jj2007

  • Member
  • *
  • Posts: 536
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #16 on: April 29, 2014, 07:17:23 AM »
Yes, that's perfectly correct. It's a feature, not a bug. If they had been in a good mood, or if they had been forced to by a judge, they would have added, " the useless bytes at the end of the buffer are not zeroed out, and nobody will tell you where the new content ends".

Never mind, "rb" works correctly.

Offline TimoVJL

  • Global Moderator
  • Member
  • *****
  • Posts: 2115
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #17 on: April 29, 2014, 08:33:25 AM »
Use fread return value and put zero to there.
Code: [Select]
...
size_t len = fread(buf, sizeof(char), sizeof(buf), fp);
buf[len] = 0;
...
May the source be with you

Offline jj2007

  • Member
  • *
  • Posts: 536
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #18 on: April 29, 2014, 09:34:38 AM »
Use fread return value and put zero to there.
Code: [Select]
...
size_t len = fread(buf, sizeof(char), sizeof(buf), fp);
buf[len] = 0;
...

(MSDN): "The replacement has no effect on .. the return value."

Plain wrong, because the return value changes to "bytes stored" instead of "bytes read". Which is why your solution works. But in any case, using "rb" is a much better solution - replacing CrLf with LF costs lots of cycles, especially for larger files where the cache becomes important.

czerny

  • Guest
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #19 on: April 29, 2014, 09:44:43 AM »
If you like to be portable (DOS, Unix, Mac linebreaks), the text (not binary) read is easier to handle in my opinion.
Btw. the _fread in Pelles C behaves like the MS variant. You have to use the return value to find the buffer end.

Offline jj2007

  • Member
  • *
  • Posts: 536
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #20 on: April 29, 2014, 12:07:29 PM »
If you like to be portable (DOS, Unix, Mac linebreaks), the text (not binary) read is easier to handle in my opinion.
Not a big problem, actually, and performance-wise the binary read is better.

Quote
Btw. the _fread in Pelles C behaves like the MS variant. You have to use the return value to find the buffer end.
In PellesC, the buffer end is zeroed, which is OK.

Offline jj2007

  • Member
  • *
  • Posts: 536
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #21 on: April 30, 2014, 01:12:29 AM »
Here is my "final" version. I will try to extend it with your algos in order to get timings. So far only my library algo is in, for comparison.

I've put a testfile here - a "real" statistical database from the UN in tabbed format with CrLf and some typical problems, such as isolated linefeeds. It has about 10Mb and loads in about half a second with the C algo.
« Last Edit: April 30, 2014, 01:14:11 AM by jj2007 »

Offline jj2007

  • Member
  • *
  • Posts: 536
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #22 on: April 30, 2014, 09:55:29 AM »
Here's something I have used for that type of thing:

Code: [Select]
char** Split(char *Input, char *Delim, char ***List, int *TokenCount)

Thanks, DMac. Is there a simple solution for this error?

    while ((Position = strstr(Remain, Delim)) != NULL)   //error #2168: Operands of '=' have incompatible types 'char *' and 'int'

czerny

  • Guest
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #23 on: April 30, 2014, 11:11:32 AM »
    while ((Position = strstr(Remain, Delim)) != NULL)   //error #2168: Operands of '=' have incompatible types 'char *' and 'int'
I don't have this error!

Btw. DMacs code can not deal with the whole buffer, it must be used line by line.

Offline jj2007

  • Member
  • *
  • Posts: 536
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #24 on: April 30, 2014, 02:35:10 PM »
I don't have this error!

Strange. I am still on XP with version 7, could that be the reason?

In the meantime, I've prepared the source to incorporate other algos. Example:

  MbTimer();  // start the timer

  ArrJJ=LoadTabFileJJ(fname, &totalRowsJJ, &maxcol);  // load the text file into an array

  secs=MbTimer()/1e6;
  printf("Loading %i rows took %.4f seconds with LoadTabFileJJ\n", totalRowsJJ, secs);


As it stands, it compiles fine also in VS Express 2010. While Pelles C is a better product, I find it important to be able to tell another coder "if you don't have Pelles C, just compile it in VS". People stay away from exotic compilers (and we must admit that it is a bit exotic...) because they believe they are wasting time on incompatible tools.

Offline DMac

  • Member
  • *
  • Posts: 272
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #25 on: April 30, 2014, 07:05:35 PM »
Thanks, DMac. Is there a simple solution for this error?

    while ((Position = strstr(Remain, Delim)) != NULL)   //error #2168: Operands of '=' have incompatible types 'char *' and 'int'

Did you modify the code of the function, adapting it to your application?  If so you want to make sure that the var Position is declared char * the return type of strstr() is char * and that seems to be what the compiler is complaining about.

I just tested my example on Win64 and Win32 and was not able to reproduce this error unless I changed the declaration of the var Position.

Btw. DMacs code can not deal with the whole buffer, it must be used line by line.

I'm not quite sure what you mean by this statement.  It splits out the whole buffer in two goes producing the array of strings.  However what ever you did with the resulting strings could be considered line by line I suppose.
No one cares how much you know,
until they know how much you care.

Offline jj2007

  • Member
  • *
  • Posts: 536
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #26 on: May 01, 2014, 10:04:25 AM »
Did you modify the code of the function, adapting it to your application?

I used it without modifications, but now (on my machine at home) I can't reproduce the error. Anyway, here is a complete working example:
Code: [Select]
#include <stdio.h>
#include <windows.h>
#include <conio.h> // for _getch()

#pragma warn(disable:2216)    // retval never used
#pragma warn(disable:2007)    // assembly not portable
#pragma warn(disable:2118)    // para not referenced
// #pragma warn(disable:2215)    // conversion ... loss of data

char** Split(char *Input, char *Delim, char ***List, int *TokenCount);
void FreeSplitList(char **List);

int main(int argc, char* argv[]) {
  char **outArray = NULL;
  int count, lines, cols, len;
  char *buffer;
  FILE *fp;
  fp = fopen("C:\\TEMP\\Database.tab",  "rb");
  if (!fp) {
  printf("\nfile not found");
  lines=0;
  cols=0;
  return 0;
   }
  fseek(fp, 0, SEEK_END); // go to end
  len=ftell(fp);    // get position at end (length)
  fseek(fp, 0, SEEK_SET); // back to start
  buffer=calloc(len+4, 1); // calloc buffer (zeroed, at least two bytes longer)
  fread(buffer, 1, len, fp); // read file into buffer
  fclose(fp);

  char **rtn = Split(buffer, "\t", &outArray, &count);

  if(NULL != rtn) {
char *lineOfText;
for(int i = 0; i < count && i<200; ++i)
{
      lineOfText = outArray[i];
      printf("%s ", lineOfText); // print some lines

      //do something usefull with line of text
}
FreeSplitList(outArray);
  }
  puts("\n--- ok ---");
//   _getch(); // if your IDE closes the console window
} // end main

char** Split(char *Input, char *Delim, char ***List, int *TokenCount)
{
    int Found;
    int Length;
    int DelimLen;
    char* Remain;
    char* Position;

    DelimLen = strlen(Delim);
    Found = 0;
    Remain = Input;

    if ((List == NULL) || (Input == NULL) || (Delim == NULL))
    {
        *TokenCount=-1;
        return NULL;
    }

    //first pass -- count number of delimiters
    while ((Position = strstr(Remain, Delim)) != NULL)
    {
        Found++;
        Remain = Position + DelimLen;
    }

    Found++; // increment one more time for last data chunk

    //create array based on number of delimiters
    *List = (char **)malloc((Found+1) * sizeof(char *));

    Found = 0;
    Remain = Input;

    //second pass -- populate array
    while ((Position = strstr(Remain, Delim)) != NULL)
    {
        Length = Position - Remain;
        (*List)[Found] = (char *)malloc(sizeof(char)*(Length+1));
        strncpy((*List)[Found], Remain, Length);
        (*List)[Found++][Length] = 0;
        Remain = Position + DelimLen;
    }

    Length = strlen(Remain);
    (*List)[Found] = (char *)malloc(sizeof(char)*(Length+1));
    strncpy((*List)[Found], Remain, Length);
    (*List)[Found++][Length] = 0;
    (*List)[Found] = NULL;

    *TokenCount = Found;

    return *List;
} /* Split() */

/* Destroys the array of strings structure returned by Split() */
void FreeSplitList(char **List) {
    int Count=0;
    while(List[Count] != NULL)
    free(List[Count++]);
    free(List);
} /* FreeSplitList() */

Btw. DMacs code can not deal with the whole buffer, it must be used line by line.

It does, it does. However, it uses only one delimiter, which is fine for tab but clearly poses a problem with CrLf.

Here is my inner loop as posted above:

      while (c>1) {
            c=psRight[0];
            if (c==9 || c==nl) {      // tab or Cr or Lf
              pRC[rowOffset+col]=psLeft;  // put the address of a string into the RowCol matrix of pointers
              col++;
              if (col>maxcol) {
                    printf("Too many columns: %i>%i\n", col, maxcol); // unlikely case
                  row=0;      // flag failure
                    goto TooManyColumns;
                    }
              psRight[0]=0;       // replace \t or \n with \0
              if (c==nl) {            // Cr or Lf or CrLf
                    if (psRight[1]==10){
                        psRight++;       // CrLf needs one more
                          }
                    c=1;                  // flag newline
                    }
               psLeft=psRight+1;
            }
            psRight++;                  // tab is one byte
        }
      while (col<maxcol) {
            pRC[rowOffset+col]="";      // care for empty cells
            col++;
      }
      row++;


With "real" spreadsheets, the tricky thing is finding the "right" number of columns, and dealing with varying #columns.

Offline jj2007

  • Member
  • *
  • Posts: 536
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #27 on: May 01, 2014, 11:40:00 AM »
New version with improved performance. With the default 10 MB test file (available here), it takes 0.2 seconds on my trusty Celeron to translate 43,000 rows. The assembler version is still faster, of course, but it would be interesting to see what Pelles C 64-bit can do.

The attachment includes a second source with DMac's code, slightly adjusted to make it compatible with VS.

Offline TimoVJL

  • Global Moderator
  • Member
  • *****
  • Posts: 2115
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #28 on: May 01, 2014, 01:10:49 PM »
Your program:
Loading 43123 rows took 0.0158 seconds with Recall
Loading 43123 rows took 0.1241 seconds with LoadTabFileJJ
- hit return -

my test program posix and pointers:

32-bit
68 ms
43138 rows

64-bit
101 ms
43138 rows
May the source be with you

czerny

  • Guest
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #29 on: May 01, 2014, 01:30:00 PM »


Btw. DMacs code can not deal with the whole buffer, it must be used line by line.

I'm not quite sure what you mean by this statement.  It splits out the whole buffer in two goes producing the array of strings.  However what ever you did with the resulting strings could be considered line by line I suppose.
You have one delimiter, \t in our case, not two (\t and \n). So the last string per row and the first string of the next row are only one arrayelement delimited bei an \n.

To see this, change one line in jj2007's output loop:
Code: [Select]
      printf("%d %s ", i ,lineOfText); // print some lines
« Last Edit: May 01, 2014, 01:57:59 PM by czerny »