Pelles C forum

C language => Beginner questions => Topic started by: jj2007 on April 27, 2014, 11:51:56 AM

Title: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on April 27, 2014, 11:51:56 AM
I am thoroughly stuck and too tired to RTFM, so here is some code hoping that somebody can teach me the basics... ;-)

The idea is simple:
- load a tab-delimited text file into a buffer
- create an array of (rows, columns) pointers
- scan the buffer with char c=buffer[currentposition]
- if c = \tab, isolate the string (i.e. replace tab with \0) and put its address into matrix[row][col], then col++
- if c = \newline, isolate the string and put its address into matrix[row][0], then row++, col=0;

So far, so simple. The good news: It works, and it's by far the fastest method to read a tab-delimited text into a matrix of pointers.

The bad news: It works only in assembler - in Pelles C, I seem to stumble over very basic problems... can somebody help?

EDIT: Version 3 works. The [r,c] assignment is OK now, but I know there should be a sizeof int in malloc().

One problem was this:
  char *tmpc;
  tmpc=(char*)buffer;
  // mov eax, [ebp-0C]
  // mov [ebp-3C], eax
  printf("tmp=%X\n", tmpc);   // tmp=4101AC, address

  tmpc=(char*)buffer[0];
  // mov eax, [ebp-0C]
  // mov eax, [eax]
  // mov [ebp-3C], eax
  printf("tmp=%X\n", tmpc);   // tmp=73726946, content


Code: [Select]
#include <stdio.h>
#include <windows.h>
#include <conio.h> // for _getch()

#pragma comment(linker, "-subsystem:console")
#pragma warn(disable:2216)    // retval never used
#pragma warn(disable:2007)    // assembly not portable
#pragma warn(disable:2118)    // para not referenced
// #pragma warn(disable:2215)    // conversion ... loss of data

int main(int argc, char* argv[]) {
  #define cols 6
  FILE *fp = fopen("Database.tab",  "r");
// Name FamilyName Age Profession Street City
// Bill Watson 55 lawyer Main Street, 12 London
// John Doe 33 coder Small lane, 22 Edinburgh
// Will Smith 44 actor Catwalk Hollyword

  fseek(fp, 0, SEEK_END); // go to end
  long len=ftell(fp); // get position at end (length)
  fseek(fp, 0, SEEK_SET); // back to start
  char *psRight=malloc(len); // malloc buffer
  char *psLeft=psRight; // get a copy pointer to the content
  fread(psRight, len, 1, fp); // read file into buffer
  fclose(fp);
  int **rows=malloc(len/cols/4+100); // rough estimate of required #rows
  int row=0, col, i, j;
  byte c=99; // some value different from zero

  while (c) {
rows[row]=malloc(cols*4); // reserve memory for one row of pointers
// would like to preset matrix[r,c] to a nullstring but this sets bytes only...
// memset(rows[row], 0, 1);
col=0;
while (c && c!=10) {
c=psRight[0];
if (c<=10) { // tab or linefeed
  // put the address of a string into the matrix of pointers
  rows[row][col]=(int)psLeft;
  col++;
    // replace \t or \n with \0
  psRight[0]=0;
    psLeft=psRight+1;
  }
psRight++; // tab is one byte
  }
psRight++; // CrLf is 2 bytes
row++;
c=psRight[0];
  }
  for (i=0; i<row; i++) {
  printf("\n");
  for (j=0; j<cols; j++) {
    printf("%s\t", (char*) rows[i][j]);
    }
    }
}

Attached code compiles fine and is almost working.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: czerny on April 27, 2014, 03:02:42 PM
Why do you work with an int-buffer?
You should use an char-buffer or an WCHAR-buffer (unsigned short) in case of an unicode file.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on April 27, 2014, 03:22:14 PM
Right, that was corrected in version 2.

Version 3 (above) fixes also the malloc() bug, so everything works fine now.
Is there a "pointer size" memset?

Thanks for the feedback, czerny.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: czerny on April 27, 2014, 03:48:37 PM
Right, that was corrected in version 2.
You still have   int **rows, why int? What's your idea behind int?
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on April 27, 2014, 03:52:50 PM
You still have   int **rows, why int? What's your idea behind int?

For the brave assembler programmer, everything is a DWORD ;-)
Besides, if I use char**, it throws a nasty exception. What would be the correct way to do it?
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: czerny on April 27, 2014, 03:56:27 PM
Hope this helps:
Code: [Select]
int main(int argc, char* argv[]) {
  #define cols 6
#define rows 3
char **array; // This is a one-dimensional array of char-pointers, but we treat it as a two-dim. array
array=malloc(rows*cols*sizeof(char *));
array[0*cols+0] = "Maria";
array[2*cols+5] = "Susi";

printf("%s %s\n",array[0*cols+0],array[2*cols+5]);
return 0;
}
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: czerny on April 27, 2014, 04:18:27 PM
Here a second, but minor compact version:
Code: [Select]
int main(int argc, char* argv[]) {
  #define ncols 6
#define nrows 3
char ***rows; // This is an array of pointers to arrays of char-pointers.
rows=malloc(sizeof(char *)*nrows);
for (int i=0; i<nrows; i++)
rows[i]=malloc(sizeof(char *)*ncols);
rows[0][0] = "Maria";
rows[2][5] = "Susi";
printf("%s %s\n",rows[0][0],rows[2][5]);
return 0;

Edit: It is not common in csv-data to append an tab char at end of line. So you should'nt  do that to be more compatible.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on April 27, 2014, 04:43:04 PM
Thanks, czerny. Interesting, I had never seen a char *** ;-)

BTW, csv means comma-separated values, which is a different animal (with escapes and quotes and all that mess).

In my assembly version (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1172), I use a different strategy: The string array is one-dimensional under the hood, but if the user specifies a column, the algo returns the partial string between the n-th and the (n+1)th tab character. Superfast for loading, sufficiently fast for getting single cells.

So here is the final version, with active support from czerny - thanxalot.

The #columns is fixed for now, one might add a commandline option, or check the first 100 lines or so for the max #columns.

Code: [Select]
#include <stdio.h>
#include <windows.h>
#include <conio.h> // for _getch()

#pragma comment(linker, "-subsystem:console")
#pragma warn(disable:2216)    // retval never used
// #pragma warn(disable:2007)    // assembly not portable
// #pragma warn(disable:2118)    // para not referenced

#define cols 6  // #columns known at compile time
// Name FamilyName Age Profession Street City
// Bill Watson 55 lawyer Main Street, 12 London
// John Doe 33 coder Small lane, 22 Edinburgh
// Will Smith 44 actor Catwalk Hollyword


int main(int argc, char* argv[]) {
  char *fname=argv[0];
  if (!strstr(fname, ".tab")) fname="Database.tab"; // default file for testing
  FILE *fp = fopen(fname,  "r");
  fseek(fp, 0, SEEK_END); // go to end
  long len=ftell(fp); // get position at end (length)
  fseek(fp, 0, SEEK_SET); // back to start
  char *psRight=malloc(len); // malloc buffer
  char *psLeft=psRight; // get a copy pointer to the content
  fread(psRight, len, 1, fp); // read file into buffer
  fclose(fp);
  char ***rows=malloc(len/cols/4+100); // rough estimate of required #rows
  int row=0, col, i, j;
  byte c=99; // some value different from zero

  while (c) {
rows[row]=malloc(cols*sizeof(int)); // reserve memory for one row of pointers
// would like to preset matrix[r,c] to a nullstring but this sets bytes only...
// memset(rows[row], 0, 1);
col=0;
while (c && c!=10) {
c=psRight[0];
if (c<=10) { // tab or linefeed
  // put the address of a string into the matrix of pointers
  rows[row][col]=psLeft;
  col++;
    // replace \t or \n with \0
  psRight[0]=0;
    psLeft=psRight+1;
  }
psRight++; // tab is one byte
  }
while (col<cols) {
rows[row][col]=""; // care for empty cells
col++;
}
psRight++; // CrLf is 2 bytes
row++;
c=psRight[0];
  }
  for (i=0; i<row; i++) {
  printf("\n");
  for (j=0; j<cols; j++) {
    printf("%s\t", rows[i][j]);
    }
    }
}
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: czerny on April 27, 2014, 05:50:47 PM
Code: [Select]
  char ***rows=malloc(len/cols/4+100); // rough estimate of required #rows
Why not count your '\n' chars in your buffer? If you count (n), you have (n+1) or (n) rows, depending on the special structer of the last line.
You should free the (rows+1) arrays afterwards!
You can use calloc() to initialize your arrays.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on April 27, 2014, 10:02:48 PM
Why not count your '\n' chars in your buffer? ... You can use calloc() to initialize your arrays.

It's a performance issue: The scanning is very slow anyway (at least compared to hand-crafted assembler), but doing it twice would slow down the load considerably, especially if the file is big enough to affect the cache.

Re calloc, yes I could do that but all elements of the matrix will be filled anyway, so zeroing is just a waste of time.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: TimoVJL on April 28, 2014, 09:55:55 AM
Nice exercise, my spoon in porridge.
Code: [Select]
#include <io.h>
#include <fcntl.h>
#include <stdlib.h>
#include <stdio.h>

#pragma comment(linker, "-subsystem:console")
#pragma warn(disable:2216 2118) // retval never used, para not referenced

int main(int argc, char *argv[])
{
//#define COLS 6
#define ROWS 10
int fh = _open("Database.tsv", _O_BINARY | _O_RDONLY);
// FirstName    FamilyName  Age Profession  Street  City     
// Bill Watson  55  lawyer  Main Street, 12 London   
// John Doe 33  coder   Small lane, 22  Edinburgh   
// Will Smith   44  actor   Catwalk Holyword

long len = _filelength(fh);
char *buffer = (char *)malloc(len); // malloc buffer
_read(fh, buffer, len); // read file into buffer
_close(fh);
char ***prows = NULL; // = (char ***)malloc(ROWS * sizeof(char *));
//memset(prows, 0, 4 * sizeof(char *));
int row = 0, col = 0, pos = 0, posold, cols, rows;
unsigned char c = 10; // to alloc cols
posold = 0;
cols = 0;
rows = 0;
do { // count cols from first line
c = buffer[pos++];
if (c == 9 || c == 10)
cols++;
} while (c && c!= 10);
pos = 0;
while (c)
{
if (!rows || row > rows) {
prows = (char ***)realloc(prows, ROWS * sizeof(char *));
rows += ROWS;
}
prows[row] = (char **)malloc(cols * sizeof(char *)); // reserve memory for one row of pointers
col = 0;
do {
do {
c = buffer[pos];
if (c <= 10)
break;
if (c == 13)
buffer[pos] = 0; // CR off
pos++;
} while (c > 10);
if (!c)
break;
if (c <= 10)
{ // tab or linefeed
prows[row][col] = &buffer[posold];
buffer[pos++] = 0; // advance after zero
posold = pos;
col++;
}
} while (c < 10);
while (col < cols) // missing cols
prows[row][col++] = 0;
if (c == 10)
row++;
}
rows = row; // total count of rows
for (row = 0; row < rows; row++)
{
printf("\n%s '%s'", prows[0][0], prows[row][0]);
printf("\t%s '%s'", prows[0][1], prows[row][1]);
printf("\t%s '%s'", prows[0][2], prows[row][2]);
}
printf("\n");
return 0;
}
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: czerny on April 28, 2014, 04:13:58 PM
Ok, my two cents! ;D
Should work with dos, unix and mac linebreaks, but not with trailing tabs.
Code: [Select]
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char* argv[])
{
  #define ncols 6

long i, k, start, cols=0, rows=0;
char *buffer, **a;
FILE *fp = fopen("Database.tab",  "r");

fseek(fp, 0, SEEK_END);
long len=ftell(fp);
fseek(fp, 0, SEEK_SET);
buffer = malloc(len+1);
  len=fread(buffer, len, 1, fp);
buffer[len] = '\n';
  fclose(fp);

for (i=0; i<len; i++)
switch (buffer[i]) {
case '\r': ;
case '\n': rows++;
if (++cols != ncols)
printf("%d columns in line %d!\n", cols, rows);
cols=-1;
case '\t': cols++;
buffer[i] = '\0';
}

a = malloc(rows*ncols*sizeof(char *));
k = 0;
start = 1;

for (i=0; i<len; i++) {
if (start) a[k++] = &buffer[i];
start = ('\0' == buffer[i]);
}

for (i=0; i<rows; i++) {
for (k=0; k<ncols; k++)
printf("%s ",a[i*ncols+k]);
puts("");
}

return 0;
}
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on April 28, 2014, 05:11:36 PM
Hi czerny & Timo,

Nice to see that you are having fun, thanks ;-)

I am still chasing a bug, and desperately trying to "port" it to Visual Studio, but soon you'll get my "final" version, too.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: DMac on April 28, 2014, 05:46:18 PM
Here's something I have used for that type of thing:
Code: [Select]
char** Split(char *Input, char *Delim, char ***List, int *TokenCount)
{
    int Found;
    int Length;
    int DelimLen;
    char* Remain;
    char* Position;

    DelimLen = strlen(Delim);
    Found = 0;
    Remain = Input;

    if ((List == NULL) || (Input == NULL) || (Delim == NULL))
    {
        *TokenCount=-1;
        return NULL;
    }

    //first pass -- count number of delimiters
    while ((Position = strstr(Remain, Delim)) != NULL)
    {
        Found++;
        Remain = Position + DelimLen;
    }

    Found++; // increment one more time for last data chunk

    //create array based on number of delimiters
    *List = (char **)malloc((Found+1) * sizeof(char *));

    Found = 0;
    Remain = Input;

    //second pass -- populate array
    while ((Position = strstr(Remain, Delim)) != NULL)
    {
        Length = Position - Remain;
        (*List)[Found] = (char *)malloc(sizeof(char)*(Length+1));
        strncpy((*List)[Found], Remain, Length);
        (*List)[Found++][Length] = 0;
        Remain = Position + DelimLen;
    }

    Length = strlen(Remain);
    (*List)[Found] = (char *)malloc(sizeof(char)*(Length+1));
    strncpy((*List)[Found], Remain, Length);
    (*List)[Found++][Length] = 0;
    (*List)[Found] = NULL;

    *TokenCount = Found;

    return *List;
} /* Split() */

/* Destroys the array of strings structure returned by Split() */
void FreeSplitList(char **List) {
    int Count;

    Count = 0;
    while(List[Count] != NULL)
    free(List[Count++]);
    free(List);

} /* FreeSplitList() */

Use it like so:
Code: [Select]
char **outArray = NULL;
int count;

char **rtn = Split(bufferOfTabDeliminatedText, "\t", &outArray, &count);

if(NULL != rtn)
{
     char *lineOfText;
     for(int i = 0; i < count; ++i)
    {
          lineOfText = outArray[i];

          //do something usefull with line of text
     }
     FreeSplitList(outArray);
}
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on April 29, 2014, 03:32:23 AM
Thanks a lot to everybody - I will try to put together a testbed for timing the algos.

In the meantime, I have resolved the mystery of my bug, i.e. I found out why on Visual Studio Express I got hundred of lines more.

It's hilarious, and Microsoft specific - Pelles C is not affected. Consider something really harmless like this:

  fp = fopen(fname,  "r");
..
  fread(buffer, len, 1, fp);   // read file into buffer

Guess what it does? Yeah, it reads the content for fname into the buffer. But not only:

1. Since there are intelligent people in Redmond, they convert CrLf to Lf only - WOW!

2. Since Microsoft employs only true geniuses, they do the conversion in-place, i.e. they use the original buffer for doing so. Isn't that SUPER CLEVER??

There is a minor inconvenience, though: when scanning the buffer, you won't find a zero byte at the end of the converted content, so the buffer=calloc(len+4, 1); is pretty useless ;-)

Oh Redmond 8)
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: TimoVJL on April 29, 2014, 06:42:45 AM
http://msdn.microsoft.com/en-us/library/kt0etdcs.aspx
Quote
Remarks
The fread function reads up to count items of size bytes from the input stream and stores them in buffer. The file pointer associated with stream (if there is one) is increased by the number of bytes actually read. If the given stream is opened in text mode, carriage return–linefeed pairs are replaced with single linefeed characters. The replacement has no effect on the file pointer or the return value.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on April 29, 2014, 07:17:23 AM
Yes, that's perfectly correct. It's a feature, not a bug. If they had been in a good mood, or if they had been forced to by a judge, they would have added, " the useless bytes at the end of the buffer are not zeroed out, and nobody will tell you where the new content ends".

Never mind, "rb" works correctly.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: TimoVJL on April 29, 2014, 08:33:25 AM
Use fread return value and put zero to there.
Code: [Select]
...
size_t len = fread(buf, sizeof(char), sizeof(buf), fp);
buf[len] = 0;
...
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on April 29, 2014, 09:34:38 AM
Use fread return value and put zero to there.
Code: [Select]
...
size_t len = fread(buf, sizeof(char), sizeof(buf), fp);
buf[len] = 0;
...

(MSDN): "The replacement has no effect on .. the return value."

Plain wrong, because the return value changes to "bytes stored" instead of "bytes read". Which is why your solution works. But in any case, using "rb" is a much better solution - replacing CrLf with LF costs lots of cycles, especially for larger files where the cache becomes important.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: czerny on April 29, 2014, 09:44:43 AM
If you like to be portable (DOS, Unix, Mac linebreaks), the text (not binary) read is easier to handle in my opinion.
Btw. the _fread in Pelles C behaves like the MS variant. You have to use the return value to find the buffer end.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on April 29, 2014, 12:07:29 PM
If you like to be portable (DOS, Unix, Mac linebreaks), the text (not binary) read is easier to handle in my opinion.
Not a big problem, actually, and performance-wise the binary read is better.

Quote
Btw. the _fread in Pelles C behaves like the MS variant. You have to use the return value to find the buffer end.
In PellesC, the buffer end is zeroed, which is OK.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on April 30, 2014, 01:12:29 AM
Here is my "final" version. I will try to extend it with your algos in order to get timings. So far only my library algo is in, for comparison.

I've put a testfile here (http://www.webalice.it/jj2006/Cruscotto/pics/Database.zip) - a "real" statistical database from the UN in tabbed format with CrLf and some typical problems, such as isolated linefeeds. It has about 10Mb and loads in about half a second with the C algo.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on April 30, 2014, 09:55:29 AM
Here's something I have used for that type of thing:

Code: [Select]
char** Split(char *Input, char *Delim, char ***List, int *TokenCount)

Thanks, DMac. Is there a simple solution for this error?

    while ((Position = strstr(Remain, Delim)) != NULL)   //error #2168: Operands of '=' have incompatible types 'char *' and 'int'
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: czerny on April 30, 2014, 11:11:32 AM
    while ((Position = strstr(Remain, Delim)) != NULL)   //error #2168: Operands of '=' have incompatible types 'char *' and 'int'
I don't have this error!

Btw. DMacs code can not deal with the whole buffer, it must be used line by line.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on April 30, 2014, 02:35:10 PM
I don't have this error!

Strange. I am still on XP with version 7, could that be the reason?

In the meantime, I've prepared the source to incorporate other algos. Example:

  MbTimer();  // start the timer

  ArrJJ=LoadTabFileJJ(fname, &totalRowsJJ, &maxcol);  // load the text file into an array

  secs=MbTimer()/1e6;
  printf("Loading %i rows took %.4f seconds with LoadTabFileJJ\n", totalRowsJJ, secs);


As it stands, it compiles fine also in VS Express 2010. While Pelles C is a better product, I find it important to be able to tell another coder "if you don't have Pelles C, just compile it in VS". People stay away from exotic compilers (and we must admit that it is a bit exotic...) because they believe they are wasting time on incompatible tools.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: DMac on April 30, 2014, 07:05:35 PM
Thanks, DMac. Is there a simple solution for this error?

    while ((Position = strstr(Remain, Delim)) != NULL)   //error #2168: Operands of '=' have incompatible types 'char *' and 'int'

Did you modify the code of the function, adapting it to your application?  If so you want to make sure that the var Position is declared char * the return type of strstr() is char * and that seems to be what the compiler is complaining about.

I just tested my example on Win64 and Win32 and was not able to reproduce this error unless I changed the declaration of the var Position.

Btw. DMacs code can not deal with the whole buffer, it must be used line by line.

I'm not quite sure what you mean by this statement.  It splits out the whole buffer in two goes producing the array of strings.  However what ever you did with the resulting strings could be considered line by line I suppose.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on May 01, 2014, 10:04:25 AM
Did you modify the code of the function, adapting it to your application?

I used it without modifications, but now (on my machine at home) I can't reproduce the error. Anyway, here is a complete working example:
Code: [Select]
#include <stdio.h>
#include <windows.h>
#include <conio.h> // for _getch()

#pragma warn(disable:2216)    // retval never used
#pragma warn(disable:2007)    // assembly not portable
#pragma warn(disable:2118)    // para not referenced
// #pragma warn(disable:2215)    // conversion ... loss of data

char** Split(char *Input, char *Delim, char ***List, int *TokenCount);
void FreeSplitList(char **List);

int main(int argc, char* argv[]) {
  char **outArray = NULL;
  int count, lines, cols, len;
  char *buffer;
  FILE *fp;
  fp = fopen("C:\\TEMP\\Database.tab",  "rb");
  if (!fp) {
  printf("\nfile not found");
  lines=0;
  cols=0;
  return 0;
   }
  fseek(fp, 0, SEEK_END); // go to end
  len=ftell(fp);    // get position at end (length)
  fseek(fp, 0, SEEK_SET); // back to start
  buffer=calloc(len+4, 1); // calloc buffer (zeroed, at least two bytes longer)
  fread(buffer, 1, len, fp); // read file into buffer
  fclose(fp);

  char **rtn = Split(buffer, "\t", &outArray, &count);

  if(NULL != rtn) {
char *lineOfText;
for(int i = 0; i < count && i<200; ++i)
{
      lineOfText = outArray[i];
      printf("%s ", lineOfText); // print some lines

      //do something usefull with line of text
}
FreeSplitList(outArray);
  }
  puts("\n--- ok ---");
//   _getch(); // if your IDE closes the console window
} // end main

char** Split(char *Input, char *Delim, char ***List, int *TokenCount)
{
    int Found;
    int Length;
    int DelimLen;
    char* Remain;
    char* Position;

    DelimLen = strlen(Delim);
    Found = 0;
    Remain = Input;

    if ((List == NULL) || (Input == NULL) || (Delim == NULL))
    {
        *TokenCount=-1;
        return NULL;
    }

    //first pass -- count number of delimiters
    while ((Position = strstr(Remain, Delim)) != NULL)
    {
        Found++;
        Remain = Position + DelimLen;
    }

    Found++; // increment one more time for last data chunk

    //create array based on number of delimiters
    *List = (char **)malloc((Found+1) * sizeof(char *));

    Found = 0;
    Remain = Input;

    //second pass -- populate array
    while ((Position = strstr(Remain, Delim)) != NULL)
    {
        Length = Position - Remain;
        (*List)[Found] = (char *)malloc(sizeof(char)*(Length+1));
        strncpy((*List)[Found], Remain, Length);
        (*List)[Found++][Length] = 0;
        Remain = Position + DelimLen;
    }

    Length = strlen(Remain);
    (*List)[Found] = (char *)malloc(sizeof(char)*(Length+1));
    strncpy((*List)[Found], Remain, Length);
    (*List)[Found++][Length] = 0;
    (*List)[Found] = NULL;

    *TokenCount = Found;

    return *List;
} /* Split() */

/* Destroys the array of strings structure returned by Split() */
void FreeSplitList(char **List) {
    int Count=0;
    while(List[Count] != NULL)
    free(List[Count++]);
    free(List);
} /* FreeSplitList() */

Btw. DMacs code can not deal with the whole buffer, it must be used line by line.

It does, it does. However, it uses only one delimiter, which is fine for tab but clearly poses a problem with CrLf.

Here is my inner loop as posted above:

      while (c>1) {
            c=psRight[0];
            if (c==9 || c==nl) {      // tab or Cr or Lf
              pRC[rowOffset+col]=psLeft;  // put the address of a string into the RowCol matrix of pointers
              col++;
              if (col>maxcol) {
                    printf("Too many columns: %i>%i\n", col, maxcol); // unlikely case
                  row=0;      // flag failure
                    goto TooManyColumns;
                    }
              psRight[0]=0;       // replace \t or \n with \0
              if (c==nl) {            // Cr or Lf or CrLf
                    if (psRight[1]==10){
                        psRight++;       // CrLf needs one more
                          }
                    c=1;                  // flag newline
                    }
               psLeft=psRight+1;
            }
            psRight++;                  // tab is one byte
        }
      while (col<maxcol) {
            pRC[rowOffset+col]="";      // care for empty cells
            col++;
      }
      row++;


With "real" spreadsheets, the tricky thing is finding the "right" number of columns, and dealing with varying #columns.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on May 01, 2014, 11:40:00 AM
New version with improved performance. With the default 10 MB test file (available here (http://www.webalice.it/jj2006/Cruscotto/pics/Database.zip)), it takes 0.2 seconds on my trusty Celeron to translate 43,000 rows. The assembler version is still faster, of course, but it would be interesting to see what Pelles C 64-bit can do.

The attachment includes a second source with DMac's code, slightly adjusted to make it compatible with VS.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: TimoVJL on May 01, 2014, 01:10:49 PM
Your program:
Loading 43123 rows took 0.0158 seconds with Recall
Loading 43123 rows took 0.1241 seconds with LoadTabFileJJ
- hit return -

my test program posix and pointers:

32-bit
68 ms
43138 rows

64-bit
101 ms
43138 rows
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: czerny on May 01, 2014, 01:30:00 PM


Btw. DMacs code can not deal with the whole buffer, it must be used line by line.

I'm not quite sure what you mean by this statement.  It splits out the whole buffer in two goes producing the array of strings.  However what ever you did with the resulting strings could be considered line by line I suppose.
You have one delimiter, \t in our case, not two (\t and \n). So the last string per row and the first string of the next row are only one arrayelement delimited bei an \n.

To see this, change one line in jj2007's output loop:
Code: [Select]
      printf("%d %s ", i ,lineOfText); // print some lines
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: czerny on May 01, 2014, 02:25:26 PM
jj2007: change
Code: [Select]
  if (col>maxcol) maxcol=col+1;to
Code: [Select]
  if (col>=maxcol) maxcol=col+1;
You have to free buffer!

Not sure what your 3+ cols should be?

If you really have more cols, how will you test? Check all your 3*rows for NULL? And what then?
What to do if you have 4+ cols?
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on May 01, 2014, 02:38:04 PM
Loading 43123 rows took 0.1241 seconds with LoadTabFileJJ
..
my test program posix and pointers:

32-bit
68 ms
43138 rows
68 vs 124 is very good indeed, but have a look at the difference in row counts. The database has some rows with isolated LFs, a frequent case you can sometimes see in MS Excel.

What do you mean with posix?

@czerny:
if (col>=maxcol) maxcol=col+1;

Can be done but the maxcol=maxcol+3; has the same effect. The maxcol loop at the beginning checks 50...1000 lines (depending on #columns), and while it has no measurable influence on the timings, it should detect the maximum number of columns.  Further down is
if (col>maxcol) {
but that should be triggered only if you have really "malformed" sources.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: TimoVJL on May 01, 2014, 07:26:34 PM
68 vs 124 is very good indeed, but have a look at the difference in row counts. The database has some rows with isolated LFs, a frequent case you can sometimes see in MS Excel.
why  ???, in that example file there is 3 tables and in first table there is 40357 rows.
What do you mean with posix?
i was using functions open,close,read,filelength (http://en.wikipedia.org/wiki/POSIX)
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on May 01, 2014, 09:15:17 PM
in that example file there is 3 tables and in first table there is 40357 rows.

Open in Excel and check row 41243 (Eurostat) and 43024 (prepaid) to see linefeeds.
Last line is 43123 when opened in Excel. BTW I didn't cook up this file, it comes directly from the UN. (http://mdgs.un.org/unsd/mdg/Handlers/ExportHandler.ashx?Type=Csv)

EDIT: I attach a sample file with linefeeds. This kind of problem comes from pasting cells with linefeeds from MS Word tables into single Excel cells.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: czerny on May 02, 2014, 12:30:17 AM
How accurate are your time measurements?

I have here big variability, about 30%.

What are you using? QueryPerformanceCounter? _rdtsc? Other?
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on May 02, 2014, 01:15:48 AM
I have here big variability, about 30%.

For me, typically within a few % with MbTimer(). Under the hood it's QPC.
Don't forget the first load is always way off, but once the file is in the disk cache, timings should be very consistent. Which timer are you using?

Here is an example using a for loop:
Loading 43123 rows took 0.0259 seconds with Recall
Loading 43123 rows took 0.0274 seconds with Recall
Loading 43123 rows took 0.0268 seconds with Recall
Loading 43123 rows took 0.0269 seconds with Recall
Loading 43123 rows took 0.0263 seconds with Recall
Loading 43123 rows took 0.0264 seconds with Recall
Loading 43123 rows took 0.0259 seconds with Recall
Loading 43123 rows took 0.0261 seconds with Recall
Loading 43123 rows took 0.0257 seconds with Recall
Loading 43123 rows took 0.0258 seconds with Recall
Loading 43123 rows took 0.0252 seconds with Recall
Loading 43123 rows took 0.0261 seconds with Recall
Loading 43123 rows took 0.0249 seconds with Recall
Loading 43123 rows took 0.0247 seconds with Recall
Loading 43123 rows took 0.0246 seconds with Recall
Loading 43123 rows took 0.0246 seconds with Recall
Loading 43123 rows took 0.0242 seconds with Recall
Loading 43123 rows took 0.0243 seconds with Recall
Loading 43123 rows took 0.0239 seconds with Recall
Loading 43123 rows took 0.0239 seconds with Recall
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: TimoVJL on May 02, 2014, 09:04:16 AM
Confused ???

LibreOffice 4.2, 43121 lines.

First table 1 - 40357
Second 40359 - 43112
Third 43114 - 43121

Excel 2010 Finnish version can't handle that .csv nor that .tab:(

Excel 2003 list separator: here (http://office.microsoft.com/en-us/excel-help/import-or-export-text-files-HP001130599.aspx)

I know that UN file have 43123 rows.

EDIT:
There is a trick to how to use .csv file in no-US version of Excel: here (http://superuser.com/questions/606272/how-to-get-excel-to-interpret-the-comma-as-a-default-delimiter-in-csv-files)
just insert "sep=," as first line and Excel 2010 use that separator.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: Robert on May 02, 2014, 09:51:17 AM
Confused ???

LibreOffice 4.2, 43121 lines.

First table 1 - 40357
Second 40359 - 43112
Third 43114 - 43121

Excel can't handle that csv without changing local settings (list separator) :(

Using the original U.N. database

Microsoft Excel 2003, 43123 lines

First table 1 - 40357 including header
Second 40359 - 43114 including header
Third 43116 - 43123 including header

There are several instances of embedded linefeed characters in the second table. Excel handles them as expected because they are in between double quotes.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: czerny on May 02, 2014, 09:55:11 AM
I have here big variability, about 30%.

For me, typically within a few % with MbTimer(). Under the hood it's QPC.
Which timer are you using?
I have compared both: QueryPerformanceCounter and  _rdtsc. They have both a very high variability and they differ from each other.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: czerny on May 02, 2014, 09:58:09 AM
There are several instances of embedded linefeed characters in the second table. Excel handles them as expected because they are in between double quotes.
But JJ2007's test database is a tab delimited file!

JJ2007: Do you have made the conversation from csv to tab?
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: Robert on May 02, 2014, 10:31:09 AM
There are several instances of embedded linefeed characters in the second table. Excel handles them as expected because they are in between double quotes.
But JJ2007's test database is a tab delimited file!

JJ2007: Do you have made the conversation from csv to tab?

Using JJ2007's database.tab file in Excel 2003 displays the same as the original U.N. file.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on May 02, 2014, 12:01:09 PM
I have compared both: QueryPerformanceCounter and  _rdtsc. They have both a very high variability and they differ from each other.

Strange. With rdtsc, it could be an issue with core switching, but QPC should be immune against that. What do you get with this loop (exe attached)?

  for (int i=0; i<30; i++) {
   MbTimer();
   ArrJJ=LoadTabFileJJ(fname, &totalRowsJJ, &maxcol);
   secs=MbTimer()/1e6;
   printf("Loading %i rows took %.4f seconds with LoadTabFileJJ\n", totalRowsJJ, secs);
  }

P.S.: Yes, I did the csv to tab conversion, but as Robert (thanks!) wrote, they are identical - just tabs instead of commas.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: czerny on May 02, 2014, 01:11:34 PM
I have compared both: QueryPerformanceCounter and  _rdtsc. They have both a very high variability and they differ from each other.

Strange. With rdtsc, it could be an issue with core switching, but QPC should be immune against that. What do you get with this loop (exe attached)?

  for (int i=0; i<30; i++) {
   MbTimer();
   ArrJJ=LoadTabFileJJ(fname, &totalRowsJJ, &maxcol);
   secs=MbTimer()/1e6;
   printf("Loading %i rows took %.4f seconds with LoadTabFileJJ\n", totalRowsJJ, secs);
  }
I do not have the code available in the moment. I will do some more tests at evening.
P.S.: Yes, I did the csv to tab conversion, but as Robert (thanks!) wrote, they are identical - just tabs instead of commas.
What about the quotes? It looks as you have deleted them sometimes but not allways.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on May 02, 2014, 01:55:15 PM
I do not have the code available in the moment. I will do some more tests at evening.

The MbTimer aka QPC version is attached above.

Quote
What about the quotes? It looks as you have deleted them sometimes but not allways.

Yes, I saw that also, thanks for reminding me. The conversion is bad for "Test","There are, commas", "here". I will fix that asap, but it doesn't influence the row count.

The database file is tricky - see e.g. footnotes 152 and 663, no solution for those! - but on the other hand it's a good test case because it's a real example of a malformed spreadsheet coming from an official source of statistics.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: TimoVJL on May 02, 2014, 02:11:38 PM
Timer functions in C.
A lib of these is useful for console programs.
Code: [Select]
#define WIN32_LEAN_AND_MEAN
#include <windows.h>

LONGLONG __cdecl StartTimer(void)
{
LONGLONG t1; // ticks
// start timer
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
return t1;
}

LONGLONG __cdecl StopTimer(LONGLONG t1)
{
LONGLONG t2;
LONGLONG frequency; // ticks per second
// stop timer
QueryPerformanceCounter((LARGE_INTEGER *) & t2);
// get ticks per second
QueryPerformanceFrequency((LARGE_INTEGER *) & frequency);
// compute and print the elapsed time in millisec
LONGLONG elapsedTime = (t2 - t1) * 1000 / frequency;
return (LONGLONG)elapsedTime;
}
definitions for console program:
Code: [Select]
long long __cdecl StartTimer(void);
long long __cdecl StopTimer(long long t1);
usage:
Code: [Select]
long long llStart = StartTimer();
// do something or nothing
long long llTime = StopTimer(llStart);
printf("time: %d ms\n", (int)llTime);
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: TimoVJL on May 02, 2014, 02:41:53 PM
StopTimer returns elapsedTime in msecs.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on May 02, 2014, 03:03:35 PM
OK, got it:
Loading 43123 rows took 0.3145 seconds
Loading 43123 rows took 0.3132 seconds
Loading 43123 rows took 0.3124 seconds
Loading 43123 rows took 0.3103 seconds
Loading 43123 rows took 0.3123 seconds
Loading 43123 rows took 309 milliseconds
Loading 43123 rows took 313 milliseconds
Loading 43123 rows took 310 milliseconds
Loading 43123 rows took 310 milliseconds
Loading 43123 rows took 313 milliseconds


No big variation here...
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: czerny on May 02, 2014, 05:12:12 PM
I have for example the following timings:
Code: [Select]
for (i=0; i<10; i++)
{
  LoadTabFileJJ();
  Czernys();
  LoadTabFileJJ();
}

158.8260519144 ms       180.9129372588 ms       160.1033092195 ms
162.1560840833 ms       181.4037817656 ms       163.1542556386 ms
224.5581491502 ms       205.7775499400 ms       207.7685851135 ms
210.3840013186 ms       227.3836479217 ms       242.3699101422 ms
193.2228816791 ms       222.2067329786 ms       178.7003655493 ms
177.3895844304 ms       342.9254276731 ms       188.9189827199 ms
180.6846959600 ms       280.0210641297 ms       245.3736438570 ms
189.7260685366 ms       221.5546948006 ms       182.0309564484 ms
178.0321241946 ms       213.9358493887 ms       244.7769199717 ms
242.6847546266 ms       228.9170830371 ms       242.0028243813 ms

Hier with Timos code:

160 ms  180 ms  162 ms
252 ms  249 ms  211 ms
207 ms  237 ms  180 ms
181 ms  218 ms  181 ms
245 ms  223 ms  208 ms
207 ms  281 ms  196 ms
183 ms  224 ms  287 ms
236 ms  1373 ms 454 ms
182 ms  1248 ms 486 ms
180 ms  2177 ms 492 ms

The last three lines may be caused by swapping?

Here is Timos and my timer in parallel:

Code: [Select]
for (i=0; i<10; i++)
{
   LoadTabFileJJ();
}

163 ms  163.1665477037 ms
220 ms  220.7501232699 ms
192 ms  192.2875672746 ms
165 ms  165.9534940893 ms
220 ms  220.4213105297 ms
197 ms  197.9391235478 ms
211 ms  211.4291062132 ms
2579 ms 2579.4286704036 ms
2684 ms 2684.3026138797 ms
3112 ms 3112.7481286029 ms

Here both timers a consistent. But what about the variations?
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on May 02, 2014, 05:35:15 PM
Swapping is somewhat unlikely, for the handful of bytes we are dealing with. Here are my timings for the old Celeron notebook with a lousy 2GB of RAM, tested with MbTimer() and Start/StopTimer():

Loading 43123 rows took 0.203 seconds with LoadTabFileJJ
Loading 43123 rows took 0.196 seconds with LoadTabFileJJ
Loading 43123 rows took 0.200 seconds with LoadTabFileJJ
Loading 43123 rows took 0.203 seconds with LoadTabFileJJ
Loading 43123 rows took 0.200 seconds with LoadTabFileJJ
Loading 43123 rows took 196 milliseconds with LoadTabFileJJ
Loading 43123 rows took 197 milliseconds with LoadTabFileJJ
Loading 43123 rows took 197 milliseconds with LoadTabFileJJ
Loading 43123 rows took 200 milliseconds with LoadTabFileJJ
Loading 43123 rows took 198 milliseconds with LoadTabFileJJ
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: czerny on May 02, 2014, 05:39:12 PM
Here is the 2-timer test above with an somewhat quicker machine:

131 ms  131.1705817359 ms
143 ms  143.6193706183 ms
130 ms  130.6548737340 ms
131 ms  131.1560547500 ms
147 ms  147.2851996553 ms
131 ms  131.0951531549 ms
130 ms  130.4649054559 ms
130 ms  130.3785816354 ms
130 ms  130.7406388242 ms
131 ms  131.2108103125 ms
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: czerny on May 02, 2014, 05:50:23 PM
Hey JJ2007, I got thouse 4 digit measurements not any more after freeing your buffer!

255 ms  255.4034102100 ms
223 ms  223.6789871338 ms
197 ms  197.7851933695 ms
215 ms  215.1318114453 ms
222 ms  222.2874694968 ms
197 ms  197.4552631689 ms
210 ms  210.9949728248 ms
235 ms  235.9663029798 ms
183 ms  183.3612931252 ms
192 ms  192.9044054482 ms

Here 5 timers: Timo, GetTickCount, timeGetTime, _rdtsc, my


590 ms  594 ms  591 ms  1414755126 cycles       590.9614210745 ms
167 ms  172 ms  168 ms  400770772 cycles        167.4084276074 ms
166 ms  156 ms  167 ms  398143312 cycles        166.3121988968 ms
167 ms  156 ms  168 ms  401748552 cycles        167.8168594053 ms
203 ms  203 ms  203 ms  485991200 cycles        203.0087622868 ms
197 ms  188 ms  197 ms  472113756 cycles        197.2110980586 ms
245 ms  235 ms  245 ms  586719924 cycles        245.0811485817 ms
200 ms  203 ms  202 ms  480600504 cycles        200.7559620008 ms
276 ms  281 ms  277 ms  663128552 cycles        276.9974954917 ms
2083 ms 2078 ms 2083 ms 693108120 cycles        2083.9235154189 ms

Look at the last two lines. There is a big difference in ms but a small difference in cycles?
And, JJ2007, there is a 4-digit time again! :-(
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: TimoVJL on May 02, 2014, 07:04:28 PM
As i'm hobbyist, i shall share this test project too.

32-bit:
Process: 58 ms
43123 nRows

64-bit:
Process: 71 ms
43123 nRows

Bit slow, but who cares :D
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on May 02, 2014, 07:52:03 PM
Bit slow, but who cares :D

Practically on par with my algo (MbRecall is fast but doesn't count in this context).
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on May 02, 2014, 07:54:01 PM
Hey JJ2007, I got thouse 4 digit measurements not any more after freeing your buffer!
Which buffer?

Quote
Look at the last two lines. There is a big difference in ms but a small difference in cycles?
And, JJ2007, there is a 4-digit time again! :-(

Mysterious. Doesn't make any sense 8)
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: TimoVJL on May 02, 2014, 09:04:52 PM
Thank's Jochen for that BrainTrainer material.
That database.tab was good learning material for seeing some weakness of Excel and LibreOffice.
I hope that beginners enjoy this journey in those csv/tbs/tab files :)

Thank's to Pelle for PellesC  to make this exercise.

Thank's Lara Fabian's CD LARA FABIAN to be good mood when programming :).

When i go to my daily walk into fields, Paulina Rubio and Laura Pausini puts my legs in right beat :D
And i have Patricia Kaas and Middle of The Road (if sun shines Soley Soley) in my MP3 player too  8)
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: czerny on May 02, 2014, 09:46:11 PM
Here is my last version: I had a lot of fun with this little exercise, too.
Title: Re: Reading a tab-delimited text file into a two-dimensional array
Post by: jj2007 on May 03, 2014, 12:25:47 AM
Thanks a lot to you, Timo and Czerny and Robert, for your help. I've learnt some new tricks ;-)

And you forced me to update my assembler library, because it didn't treat quotes and commas correctly. It's ok now, although not yet online.

BTW in case you wondered how MbRecall can be 5* as fast: It always reads whole lines and stores only the beginning and len. Then, if you ask for element[5,9], it goes to row 5 and extracts the substring between the 5th and the 6th tab.

@Timo: Soley Soley - reminds me of the good times when I hung around in discos. You might like The Rain, the Park and Other Things (http://www.youtube.com/watch?v=X6uh7zXw-WI) ;-)

P.S.: New test files here (http://www.webalice.it/jj2006/Cruscotto/pics/Database.zip), with:
- original csv file
- tab file converted by Excel 2010
- tab file produced by my own converter

When testing csv or tab readers, pay attention to the treatment of commas and quotes inside cells. I attach a file in *.xls format that has some goodies for testing.
Title: Spreadsheet sorting
Post by: jj2007 on March 12, 2016, 10:36:42 AM
Just an update: I have added a "sort by numerical value of column n" function to my Recall() (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1172) and QSort() (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1175) procedures, see here (http://masm32.com/board/index.php?topic=94.msg55979#msg55979). It's assembler, but if anybody needs a Pelles C version, let me know.