NO

Author Topic: Reading a tab-delimited text file into a two-dimensional array  (Read 46809 times)

Offline jj2007

  • Member
  • *
  • Posts: 536
I am thoroughly stuck and too tired to RTFM, so here is some code hoping that somebody can teach me the basics... ;-)

The idea is simple:
- load a tab-delimited text file into a buffer
- create an array of (rows, columns) pointers
- scan the buffer with char c=buffer[currentposition]
- if c = \tab, isolate the string (i.e. replace tab with \0) and put its address into matrix[row][col], then col++
- if c = \newline, isolate the string and put its address into matrix[row][0], then row++, col=0;

So far, so simple. The good news: It works, and it's by far the fastest method to read a tab-delimited text into a matrix of pointers.

The bad news: It works only in assembler - in Pelles C, I seem to stumble over very basic problems... can somebody help?

EDIT: Version 3 works. The [r,c] assignment is OK now, but I know there should be a sizeof int in malloc().

One problem was this:
  char *tmpc;
  tmpc=(char*)buffer;
  // mov eax, [ebp-0C]
  // mov [ebp-3C], eax
  printf("tmp=%X\n", tmpc);   // tmp=4101AC, address

  tmpc=(char*)buffer[0];
  // mov eax, [ebp-0C]
  // mov eax, [eax]
  // mov [ebp-3C], eax
  printf("tmp=%X\n", tmpc);   // tmp=73726946, content


Code: [Select]
#include <stdio.h>
#include <windows.h>
#include <conio.h> // for _getch()

#pragma comment(linker, "-subsystem:console")
#pragma warn(disable:2216)    // retval never used
#pragma warn(disable:2007)    // assembly not portable
#pragma warn(disable:2118)    // para not referenced
// #pragma warn(disable:2215)    // conversion ... loss of data

int main(int argc, char* argv[]) {
  #define cols 6
  FILE *fp = fopen("Database.tab",  "r");
// Name FamilyName Age Profession Street City
// Bill Watson 55 lawyer Main Street, 12 London
// John Doe 33 coder Small lane, 22 Edinburgh
// Will Smith 44 actor Catwalk Hollyword

  fseek(fp, 0, SEEK_END); // go to end
  long len=ftell(fp); // get position at end (length)
  fseek(fp, 0, SEEK_SET); // back to start
  char *psRight=malloc(len); // malloc buffer
  char *psLeft=psRight; // get a copy pointer to the content
  fread(psRight, len, 1, fp); // read file into buffer
  fclose(fp);
  int **rows=malloc(len/cols/4+100); // rough estimate of required #rows
  int row=0, col, i, j;
  byte c=99; // some value different from zero

  while (c) {
rows[row]=malloc(cols*4); // reserve memory for one row of pointers
// would like to preset matrix[r,c] to a nullstring but this sets bytes only...
// memset(rows[row], 0, 1);
col=0;
while (c && c!=10) {
c=psRight[0];
if (c<=10) { // tab or linefeed
  // put the address of a string into the matrix of pointers
  rows[row][col]=(int)psLeft;
  col++;
    // replace \t or \n with \0
  psRight[0]=0;
    psLeft=psRight+1;
  }
psRight++; // tab is one byte
  }
psRight++; // CrLf is 2 bytes
row++;
c=psRight[0];
  }
  for (i=0; i<row; i++) {
  printf("\n");
  for (j=0; j<cols; j++) {
    printf("%s\t", (char*) rows[i][j]);
    }
    }
}

Attached code compiles fine and is almost working.
« Last Edit: April 27, 2014, 03:25:23 PM by jj2007 »

czerny

  • Guest
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #1 on: April 27, 2014, 03:02:42 PM »
Why do you work with an int-buffer?
You should use an char-buffer or an WCHAR-buffer (unsigned short) in case of an unicode file.

Offline jj2007

  • Member
  • *
  • Posts: 536
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #2 on: April 27, 2014, 03:22:14 PM »
Right, that was corrected in version 2.

Version 3 (above) fixes also the malloc() bug, so everything works fine now.
Is there a "pointer size" memset?

Thanks for the feedback, czerny.

czerny

  • Guest
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #3 on: April 27, 2014, 03:48:37 PM »
Right, that was corrected in version 2.
You still have   int **rows, why int? What's your idea behind int?

Offline jj2007

  • Member
  • *
  • Posts: 536
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #4 on: April 27, 2014, 03:52:50 PM »
You still have   int **rows, why int? What's your idea behind int?

For the brave assembler programmer, everything is a DWORD ;-)
Besides, if I use char**, it throws a nasty exception. What would be the correct way to do it?

czerny

  • Guest
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #5 on: April 27, 2014, 03:56:27 PM »
Hope this helps:
Code: [Select]
int main(int argc, char* argv[]) {
  #define cols 6
#define rows 3
char **array; // This is a one-dimensional array of char-pointers, but we treat it as a two-dim. array
array=malloc(rows*cols*sizeof(char *));
array[0*cols+0] = "Maria";
array[2*cols+5] = "Susi";

printf("%s %s\n",array[0*cols+0],array[2*cols+5]);
return 0;
}

czerny

  • Guest
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #6 on: April 27, 2014, 04:18:27 PM »
Here a second, but minor compact version:
Code: [Select]
int main(int argc, char* argv[]) {
  #define ncols 6
#define nrows 3
char ***rows; // This is an array of pointers to arrays of char-pointers.
rows=malloc(sizeof(char *)*nrows);
for (int i=0; i<nrows; i++)
rows[i]=malloc(sizeof(char *)*ncols);
rows[0][0] = "Maria";
rows[2][5] = "Susi";
printf("%s %s\n",rows[0][0],rows[2][5]);
return 0;

Edit: It is not common in csv-data to append an tab char at end of line. So you should'nt  do that to be more compatible.
« Last Edit: April 27, 2014, 04:24:50 PM by czerny »

Offline jj2007

  • Member
  • *
  • Posts: 536
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #7 on: April 27, 2014, 04:43:04 PM »
Thanks, czerny. Interesting, I had never seen a char *** ;-)

BTW, csv means comma-separated values, which is a different animal (with escapes and quotes and all that mess).

In my assembly version, I use a different strategy: The string array is one-dimensional under the hood, but if the user specifies a column, the algo returns the partial string between the n-th and the (n+1)th tab character. Superfast for loading, sufficiently fast for getting single cells.

So here is the final version, with active support from czerny - thanxalot.

The #columns is fixed for now, one might add a commandline option, or check the first 100 lines or so for the max #columns.

Code: [Select]
#include <stdio.h>
#include <windows.h>
#include <conio.h> // for _getch()

#pragma comment(linker, "-subsystem:console")
#pragma warn(disable:2216)    // retval never used
// #pragma warn(disable:2007)    // assembly not portable
// #pragma warn(disable:2118)    // para not referenced

#define cols 6  // #columns known at compile time
// Name FamilyName Age Profession Street City
// Bill Watson 55 lawyer Main Street, 12 London
// John Doe 33 coder Small lane, 22 Edinburgh
// Will Smith 44 actor Catwalk Hollyword


int main(int argc, char* argv[]) {
  char *fname=argv[0];
  if (!strstr(fname, ".tab")) fname="Database.tab"; // default file for testing
  FILE *fp = fopen(fname,  "r");
  fseek(fp, 0, SEEK_END); // go to end
  long len=ftell(fp); // get position at end (length)
  fseek(fp, 0, SEEK_SET); // back to start
  char *psRight=malloc(len); // malloc buffer
  char *psLeft=psRight; // get a copy pointer to the content
  fread(psRight, len, 1, fp); // read file into buffer
  fclose(fp);
  char ***rows=malloc(len/cols/4+100); // rough estimate of required #rows
  int row=0, col, i, j;
  byte c=99; // some value different from zero

  while (c) {
rows[row]=malloc(cols*sizeof(int)); // reserve memory for one row of pointers
// would like to preset matrix[r,c] to a nullstring but this sets bytes only...
// memset(rows[row], 0, 1);
col=0;
while (c && c!=10) {
c=psRight[0];
if (c<=10) { // tab or linefeed
  // put the address of a string into the matrix of pointers
  rows[row][col]=psLeft;
  col++;
    // replace \t or \n with \0
  psRight[0]=0;
    psLeft=psRight+1;
  }
psRight++; // tab is one byte
  }
while (col<cols) {
rows[row][col]=""; // care for empty cells
col++;
}
psRight++; // CrLf is 2 bytes
row++;
c=psRight[0];
  }
  for (i=0; i<row; i++) {
  printf("\n");
  for (j=0; j<cols; j++) {
    printf("%s\t", rows[i][j]);
    }
    }
}
« Last Edit: April 27, 2014, 05:31:31 PM by jj2007 »

czerny

  • Guest
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #8 on: April 27, 2014, 05:50:47 PM »
Code: [Select]
  char ***rows=malloc(len/cols/4+100); // rough estimate of required #rows
Why not count your '\n' chars in your buffer? If you count (n), you have (n+1) or (n) rows, depending on the special structer of the last line.
You should free the (rows+1) arrays afterwards!
You can use calloc() to initialize your arrays.
« Last Edit: April 27, 2014, 06:17:35 PM by czerny »

Offline jj2007

  • Member
  • *
  • Posts: 536
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #9 on: April 27, 2014, 10:02:48 PM »
Why not count your '\n' chars in your buffer? ... You can use calloc() to initialize your arrays.

It's a performance issue: The scanning is very slow anyway (at least compared to hand-crafted assembler), but doing it twice would slow down the load considerably, especially if the file is big enough to affect the cache.

Re calloc, yes I could do that but all elements of the matrix will be filled anyway, so zeroing is just a waste of time.

Offline TimoVJL

  • Global Moderator
  • Member
  • *****
  • Posts: 2115
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #10 on: April 28, 2014, 09:55:55 AM »
Nice exercise, my spoon in porridge.
Code: [Select]
#include <io.h>
#include <fcntl.h>
#include <stdlib.h>
#include <stdio.h>

#pragma comment(linker, "-subsystem:console")
#pragma warn(disable:2216 2118) // retval never used, para not referenced

int main(int argc, char *argv[])
{
//#define COLS 6
#define ROWS 10
int fh = _open("Database.tsv", _O_BINARY | _O_RDONLY);
// FirstName    FamilyName  Age Profession  Street  City     
// Bill Watson  55  lawyer  Main Street, 12 London   
// John Doe 33  coder   Small lane, 22  Edinburgh   
// Will Smith   44  actor   Catwalk Holyword

long len = _filelength(fh);
char *buffer = (char *)malloc(len); // malloc buffer
_read(fh, buffer, len); // read file into buffer
_close(fh);
char ***prows = NULL; // = (char ***)malloc(ROWS * sizeof(char *));
//memset(prows, 0, 4 * sizeof(char *));
int row = 0, col = 0, pos = 0, posold, cols, rows;
unsigned char c = 10; // to alloc cols
posold = 0;
cols = 0;
rows = 0;
do { // count cols from first line
c = buffer[pos++];
if (c == 9 || c == 10)
cols++;
} while (c && c!= 10);
pos = 0;
while (c)
{
if (!rows || row > rows) {
prows = (char ***)realloc(prows, ROWS * sizeof(char *));
rows += ROWS;
}
prows[row] = (char **)malloc(cols * sizeof(char *)); // reserve memory for one row of pointers
col = 0;
do {
do {
c = buffer[pos];
if (c <= 10)
break;
if (c == 13)
buffer[pos] = 0; // CR off
pos++;
} while (c > 10);
if (!c)
break;
if (c <= 10)
{ // tab or linefeed
prows[row][col] = &buffer[posold];
buffer[pos++] = 0; // advance after zero
posold = pos;
col++;
}
} while (c < 10);
while (col < cols) // missing cols
prows[row][col++] = 0;
if (c == 10)
row++;
}
rows = row; // total count of rows
for (row = 0; row < rows; row++)
{
printf("\n%s '%s'", prows[0][0], prows[row][0]);
printf("\t%s '%s'", prows[0][1], prows[row][1]);
printf("\t%s '%s'", prows[0][2], prows[row][2]);
}
printf("\n");
return 0;
}
« Last Edit: April 28, 2014, 11:30:20 AM by TimoVJL »
May the source be with you

czerny

  • Guest
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #11 on: April 28, 2014, 04:13:58 PM »
Ok, my two cents! ;D
Should work with dos, unix and mac linebreaks, but not with trailing tabs.
Code: [Select]
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char* argv[])
{
  #define ncols 6

long i, k, start, cols=0, rows=0;
char *buffer, **a;
FILE *fp = fopen("Database.tab",  "r");

fseek(fp, 0, SEEK_END);
long len=ftell(fp);
fseek(fp, 0, SEEK_SET);
buffer = malloc(len+1);
  len=fread(buffer, len, 1, fp);
buffer[len] = '\n';
  fclose(fp);

for (i=0; i<len; i++)
switch (buffer[i]) {
case '\r': ;
case '\n': rows++;
if (++cols != ncols)
printf("%d columns in line %d!\n", cols, rows);
cols=-1;
case '\t': cols++;
buffer[i] = '\0';
}

a = malloc(rows*ncols*sizeof(char *));
k = 0;
start = 1;

for (i=0; i<len; i++) {
if (start) a[k++] = &buffer[i];
start = ('\0' == buffer[i]);
}

for (i=0; i<rows; i++) {
for (k=0; k<ncols; k++)
printf("%s ",a[i*ncols+k]);
puts("");
}

return 0;
}
« Last Edit: April 28, 2014, 05:00:45 PM by czerny »

Offline jj2007

  • Member
  • *
  • Posts: 536
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #12 on: April 28, 2014, 05:11:36 PM »
Hi czerny & Timo,

Nice to see that you are having fun, thanks ;-)

I am still chasing a bug, and desperately trying to "port" it to Visual Studio, but soon you'll get my "final" version, too.

Offline DMac

  • Member
  • *
  • Posts: 272
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #13 on: April 28, 2014, 05:46:18 PM »
Here's something I have used for that type of thing:
Code: [Select]
char** Split(char *Input, char *Delim, char ***List, int *TokenCount)
{
    int Found;
    int Length;
    int DelimLen;
    char* Remain;
    char* Position;

    DelimLen = strlen(Delim);
    Found = 0;
    Remain = Input;

    if ((List == NULL) || (Input == NULL) || (Delim == NULL))
    {
        *TokenCount=-1;
        return NULL;
    }

    //first pass -- count number of delimiters
    while ((Position = strstr(Remain, Delim)) != NULL)
    {
        Found++;
        Remain = Position + DelimLen;
    }

    Found++; // increment one more time for last data chunk

    //create array based on number of delimiters
    *List = (char **)malloc((Found+1) * sizeof(char *));

    Found = 0;
    Remain = Input;

    //second pass -- populate array
    while ((Position = strstr(Remain, Delim)) != NULL)
    {
        Length = Position - Remain;
        (*List)[Found] = (char *)malloc(sizeof(char)*(Length+1));
        strncpy((*List)[Found], Remain, Length);
        (*List)[Found++][Length] = 0;
        Remain = Position + DelimLen;
    }

    Length = strlen(Remain);
    (*List)[Found] = (char *)malloc(sizeof(char)*(Length+1));
    strncpy((*List)[Found], Remain, Length);
    (*List)[Found++][Length] = 0;
    (*List)[Found] = NULL;

    *TokenCount = Found;

    return *List;
} /* Split() */

/* Destroys the array of strings structure returned by Split() */
void FreeSplitList(char **List) {
    int Count;

    Count = 0;
    while(List[Count] != NULL)
    free(List[Count++]);
    free(List);

} /* FreeSplitList() */

Use it like so:
Code: [Select]
char **outArray = NULL;
int count;

char **rtn = Split(bufferOfTabDeliminatedText, "\t", &outArray, &count);

if(NULL != rtn)
{
     char *lineOfText;
     for(int i = 0; i < count; ++i)
    {
          lineOfText = outArray[i];

          //do something usefull with line of text
     }
     FreeSplitList(outArray);
}
No one cares how much you know,
until they know how much you care.

Offline jj2007

  • Member
  • *
  • Posts: 536
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #14 on: April 29, 2014, 03:32:23 AM »
Thanks a lot to everybody - I will try to put together a testbed for timing the algos.

In the meantime, I have resolved the mystery of my bug, i.e. I found out why on Visual Studio Express I got hundred of lines more.

It's hilarious, and Microsoft specific - Pelles C is not affected. Consider something really harmless like this:

  fp = fopen(fname,  "r");
..
  fread(buffer, len, 1, fp);   // read file into buffer

Guess what it does? Yeah, it reads the content for fname into the buffer. But not only:

1. Since there are intelligent people in Redmond, they convert CrLf to Lf only - WOW!

2. Since Microsoft employs only true geniuses, they do the conversion in-place, i.e. they use the original buffer for doing so. Isn't that SUPER CLEVER??

There is a minor inconvenience, though: when scanning the buffer, you won't find a zero byte at the end of the converted content, so the buffer=calloc(len+4, 1); is pretty useless ;-)

Oh Redmond 8)