Pelles C forum
C language => Beginner questions => Topic started by: jj2007 on April 27, 2014, 11:51:56 AM
-
I am thoroughly stuck and too tired to RTFM, so here is some code hoping that somebody can teach me the basics... ;-)
The idea is simple:
- load a tab-delimited text file into a buffer
- create an array of (rows, columns) pointers
- scan the buffer with char c=buffer[currentposition]
- if c = \tab, isolate the string (i.e. replace tab with \0) and put its address into matrix[row][col], then col++
- if c = \newline, isolate the string and put its address into matrix[row][0], then row++, col=0;
So far, so simple. The good news: It works, and it's by far the fastest method to read a tab-delimited text into a matrix of pointers.
The bad news: It works only in assembler - in Pelles C, I seem to stumble over very basic problems... can somebody help?
EDIT: Version 3 works. The [r,c] assignment is OK now, but I know there should be a sizeof int in malloc().
One problem was this:
char *tmpc;
tmpc=(char*)buffer;
// mov eax, [ebp-0C]
// mov [ebp-3C], eax
printf("tmp=%X\n", tmpc); // tmp=4101AC, address
tmpc=(char*)buffer[0];
// mov eax, [ebp-0C]
// mov eax, [eax]
// mov [ebp-3C], eax
printf("tmp=%X\n", tmpc); // tmp=73726946, content
#include <stdio.h>
#include <windows.h>
#include <conio.h> // for _getch()
#pragma comment(linker, "-subsystem:console")
#pragma warn(disable:2216) // retval never used
#pragma warn(disable:2007) // assembly not portable
#pragma warn(disable:2118) // para not referenced
// #pragma warn(disable:2215) // conversion ... loss of data
int main(int argc, char* argv[]) {
#define cols 6
FILE *fp = fopen("Database.tab", "r");
// Name FamilyName Age Profession Street City
// Bill Watson 55 lawyer Main Street, 12 London
// John Doe 33 coder Small lane, 22 Edinburgh
// Will Smith 44 actor Catwalk Hollyword
fseek(fp, 0, SEEK_END); // go to end
long len=ftell(fp); // get position at end (length)
fseek(fp, 0, SEEK_SET); // back to start
char *psRight=malloc(len); // malloc buffer
char *psLeft=psRight; // get a copy pointer to the content
fread(psRight, len, 1, fp); // read file into buffer
fclose(fp);
int **rows=malloc(len/cols/4+100); // rough estimate of required #rows
int row=0, col, i, j;
byte c=99; // some value different from zero
while (c) {
rows[row]=malloc(cols*4); // reserve memory for one row of pointers
// would like to preset matrix[r,c] to a nullstring but this sets bytes only...
// memset(rows[row], 0, 1);
col=0;
while (c && c!=10) {
c=psRight[0];
if (c<=10) { // tab or linefeed
// put the address of a string into the matrix of pointers
rows[row][col]=(int)psLeft;
col++;
// replace \t or \n with \0
psRight[0]=0;
psLeft=psRight+1;
}
psRight++; // tab is one byte
}
psRight++; // CrLf is 2 bytes
row++;
c=psRight[0];
}
for (i=0; i<row; i++) {
printf("\n");
for (j=0; j<cols; j++) {
printf("%s\t", (char*) rows[i][j]);
}
}
}
Attached code compiles fine and is almost working.
-
Why do you work with an int-buffer?
You should use an char-buffer or an WCHAR-buffer (unsigned short) in case of an unicode file.
-
Right, that was corrected in version 2.
Version 3 (above) fixes also the malloc() bug, so everything works fine now.
Is there a "pointer size" memset?
Thanks for the feedback, czerny.
-
Right, that was corrected in version 2.
You still have int **rows, why int? What's your idea behind int?
-
You still have int **rows, why int? What's your idea behind int?
For the brave assembler programmer, everything is a DWORD ;-)
Besides, if I use char**, it throws a nasty exception. What would be the correct way to do it?
-
Hope this helps:
int main(int argc, char* argv[]) {
#define cols 6
#define rows 3
char **array; // This is a one-dimensional array of char-pointers, but we treat it as a two-dim. array
array=malloc(rows*cols*sizeof(char *));
array[0*cols+0] = "Maria";
array[2*cols+5] = "Susi";
printf("%s %s\n",array[0*cols+0],array[2*cols+5]);
return 0;
}
-
Here a second, but minor compact version:
int main(int argc, char* argv[]) {
#define ncols 6
#define nrows 3
char ***rows; // This is an array of pointers to arrays of char-pointers.
rows=malloc(sizeof(char *)*nrows);
for (int i=0; i<nrows; i++)
rows[i]=malloc(sizeof(char *)*ncols);
rows[0][0] = "Maria";
rows[2][5] = "Susi";
printf("%s %s\n",rows[0][0],rows[2][5]);
return 0;
Edit: It is not common in csv-data to append an tab char at end of line. So you should'nt do that to be more compatible.
-
Thanks, czerny. Interesting, I had never seen a char *** ;-)
BTW, csv means comma-separated values, which is a different animal (with escapes and quotes and all that mess).
In my assembly version (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1172), I use a different strategy: The string array is one-dimensional under the hood, but if the user specifies a column, the algo returns the partial string between the n-th and the (n+1)th tab character. Superfast for loading, sufficiently fast for getting single cells.
So here is the final version, with active support from czerny - thanxalot.
The #columns is fixed for now, one might add a commandline option, or check the first 100 lines or so for the max #columns.
#include <stdio.h>
#include <windows.h>
#include <conio.h> // for _getch()
#pragma comment(linker, "-subsystem:console")
#pragma warn(disable:2216) // retval never used
// #pragma warn(disable:2007) // assembly not portable
// #pragma warn(disable:2118) // para not referenced
#define cols 6 // #columns known at compile time
// Name FamilyName Age Profession Street City
// Bill Watson 55 lawyer Main Street, 12 London
// John Doe 33 coder Small lane, 22 Edinburgh
// Will Smith 44 actor Catwalk Hollyword
int main(int argc, char* argv[]) {
char *fname=argv[0];
if (!strstr(fname, ".tab")) fname="Database.tab"; // default file for testing
FILE *fp = fopen(fname, "r");
fseek(fp, 0, SEEK_END); // go to end
long len=ftell(fp); // get position at end (length)
fseek(fp, 0, SEEK_SET); // back to start
char *psRight=malloc(len); // malloc buffer
char *psLeft=psRight; // get a copy pointer to the content
fread(psRight, len, 1, fp); // read file into buffer
fclose(fp);
char ***rows=malloc(len/cols/4+100); // rough estimate of required #rows
int row=0, col, i, j;
byte c=99; // some value different from zero
while (c) {
rows[row]=malloc(cols*sizeof(int)); // reserve memory for one row of pointers
// would like to preset matrix[r,c] to a nullstring but this sets bytes only...
// memset(rows[row], 0, 1);
col=0;
while (c && c!=10) {
c=psRight[0];
if (c<=10) { // tab or linefeed
// put the address of a string into the matrix of pointers
rows[row][col]=psLeft;
col++;
// replace \t or \n with \0
psRight[0]=0;
psLeft=psRight+1;
}
psRight++; // tab is one byte
}
while (col<cols) {
rows[row][col]=""; // care for empty cells
col++;
}
psRight++; // CrLf is 2 bytes
row++;
c=psRight[0];
}
for (i=0; i<row; i++) {
printf("\n");
for (j=0; j<cols; j++) {
printf("%s\t", rows[i][j]);
}
}
}
-
char ***rows=malloc(len/cols/4+100); // rough estimate of required #rows
Why not count your '\n' chars in your buffer? If you count (n), you have (n+1) or (n) rows, depending on the special structer of the last line.
You should free the (rows+1) arrays afterwards!
You can use calloc() to initialize your arrays.
-
Why not count your '\n' chars in your buffer? ... You can use calloc() to initialize your arrays.
It's a performance issue: The scanning is very slow anyway (at least compared to hand-crafted assembler), but doing it twice would slow down the load considerably, especially if the file is big enough to affect the cache.
Re calloc, yes I could do that but all elements of the matrix will be filled anyway, so zeroing is just a waste of time.
-
Nice exercise, my spoon in porridge.
#include <io.h>
#include <fcntl.h>
#include <stdlib.h>
#include <stdio.h>
#pragma comment(linker, "-subsystem:console")
#pragma warn(disable:2216 2118) // retval never used, para not referenced
int main(int argc, char *argv[])
{
//#define COLS 6
#define ROWS 10
int fh = _open("Database.tsv", _O_BINARY | _O_RDONLY);
// FirstName FamilyName Age Profession Street City
// Bill Watson 55 lawyer Main Street, 12 London
// John Doe 33 coder Small lane, 22 Edinburgh
// Will Smith 44 actor Catwalk Holyword
long len = _filelength(fh);
char *buffer = (char *)malloc(len); // malloc buffer
_read(fh, buffer, len); // read file into buffer
_close(fh);
char ***prows = NULL; // = (char ***)malloc(ROWS * sizeof(char *));
//memset(prows, 0, 4 * sizeof(char *));
int row = 0, col = 0, pos = 0, posold, cols, rows;
unsigned char c = 10; // to alloc cols
posold = 0;
cols = 0;
rows = 0;
do { // count cols from first line
c = buffer[pos++];
if (c == 9 || c == 10)
cols++;
} while (c && c!= 10);
pos = 0;
while (c)
{
if (!rows || row > rows) {
prows = (char ***)realloc(prows, ROWS * sizeof(char *));
rows += ROWS;
}
prows[row] = (char **)malloc(cols * sizeof(char *)); // reserve memory for one row of pointers
col = 0;
do {
do {
c = buffer[pos];
if (c <= 10)
break;
if (c == 13)
buffer[pos] = 0; // CR off
pos++;
} while (c > 10);
if (!c)
break;
if (c <= 10)
{ // tab or linefeed
prows[row][col] = &buffer[posold];
buffer[pos++] = 0; // advance after zero
posold = pos;
col++;
}
} while (c < 10);
while (col < cols) // missing cols
prows[row][col++] = 0;
if (c == 10)
row++;
}
rows = row; // total count of rows
for (row = 0; row < rows; row++)
{
printf("\n%s '%s'", prows[0][0], prows[row][0]);
printf("\t%s '%s'", prows[0][1], prows[row][1]);
printf("\t%s '%s'", prows[0][2], prows[row][2]);
}
printf("\n");
return 0;
}
-
Ok, my two cents! ;D
Should work with dos, unix and mac linebreaks, but not with trailing tabs.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char* argv[])
{
#define ncols 6
long i, k, start, cols=0, rows=0;
char *buffer, **a;
FILE *fp = fopen("Database.tab", "r");
fseek(fp, 0, SEEK_END);
long len=ftell(fp);
fseek(fp, 0, SEEK_SET);
buffer = malloc(len+1);
len=fread(buffer, len, 1, fp);
buffer[len] = '\n';
fclose(fp);
for (i=0; i<len; i++)
switch (buffer[i]) {
case '\r': ;
case '\n': rows++;
if (++cols != ncols)
printf("%d columns in line %d!\n", cols, rows);
cols=-1;
case '\t': cols++;
buffer[i] = '\0';
}
a = malloc(rows*ncols*sizeof(char *));
k = 0;
start = 1;
for (i=0; i<len; i++) {
if (start) a[k++] = &buffer[i];
start = ('\0' == buffer[i]);
}
for (i=0; i<rows; i++) {
for (k=0; k<ncols; k++)
printf("%s ",a[i*ncols+k]);
puts("");
}
return 0;
}
-
Hi czerny & Timo,
Nice to see that you are having fun, thanks ;-)
I am still chasing a bug, and desperately trying to "port" it to Visual Studio, but soon you'll get my "final" version, too.
-
Here's something I have used for that type of thing:
char** Split(char *Input, char *Delim, char ***List, int *TokenCount)
{
int Found;
int Length;
int DelimLen;
char* Remain;
char* Position;
DelimLen = strlen(Delim);
Found = 0;
Remain = Input;
if ((List == NULL) || (Input == NULL) || (Delim == NULL))
{
*TokenCount=-1;
return NULL;
}
//first pass -- count number of delimiters
while ((Position = strstr(Remain, Delim)) != NULL)
{
Found++;
Remain = Position + DelimLen;
}
Found++; // increment one more time for last data chunk
//create array based on number of delimiters
*List = (char **)malloc((Found+1) * sizeof(char *));
Found = 0;
Remain = Input;
//second pass -- populate array
while ((Position = strstr(Remain, Delim)) != NULL)
{
Length = Position - Remain;
(*List)[Found] = (char *)malloc(sizeof(char)*(Length+1));
strncpy((*List)[Found], Remain, Length);
(*List)[Found++][Length] = 0;
Remain = Position + DelimLen;
}
Length = strlen(Remain);
(*List)[Found] = (char *)malloc(sizeof(char)*(Length+1));
strncpy((*List)[Found], Remain, Length);
(*List)[Found++][Length] = 0;
(*List)[Found] = NULL;
*TokenCount = Found;
return *List;
} /* Split() */
/* Destroys the array of strings structure returned by Split() */
void FreeSplitList(char **List) {
int Count;
Count = 0;
while(List[Count] != NULL)
free(List[Count++]);
free(List);
} /* FreeSplitList() */
Use it like so:
char **outArray = NULL;
int count;
char **rtn = Split(bufferOfTabDeliminatedText, "\t", &outArray, &count);
if(NULL != rtn)
{
char *lineOfText;
for(int i = 0; i < count; ++i)
{
lineOfText = outArray[i];
//do something usefull with line of text
}
FreeSplitList(outArray);
}
-
Thanks a lot to everybody - I will try to put together a testbed for timing the algos.
In the meantime, I have resolved the mystery of my bug, i.e. I found out why on Visual Studio Express I got hundred of lines more.
It's hilarious, and Microsoft specific - Pelles C is not affected. Consider something really harmless like this:
fp = fopen(fname, "r");
..
fread(buffer, len, 1, fp); // read file into buffer
Guess what it does? Yeah, it reads the content for fname into the buffer. But not only:
1. Since there are intelligent people in Redmond, they convert CrLf to Lf only - WOW!
2. Since Microsoft employs only true geniuses, they do the conversion in-place, i.e. they use the original buffer for doing so. Isn't that SUPER CLEVER??
There is a minor inconvenience, though: when scanning the buffer, you won't find a zero byte at the end of the converted content, so the buffer=calloc(len+4, 1); is pretty useless ;-)
Oh Redmond 8)
-
http://msdn.microsoft.com/en-us/library/kt0etdcs.aspx
Remarks
The fread function reads up to count items of size bytes from the input stream and stores them in buffer. The file pointer associated with stream (if there is one) is increased by the number of bytes actually read. If the given stream is opened in text mode, carriage return–linefeed pairs are replaced with single linefeed characters. The replacement has no effect on the file pointer or the return value.
-
Yes, that's perfectly correct. It's a feature, not a bug. If they had been in a good mood, or if they had been forced to by a judge, they would have added, " the useless bytes at the end of the buffer are not zeroed out, and nobody will tell you where the new content ends".
Never mind, "rb" works correctly.
-
Use fread return value and put zero to there.
...
size_t len = fread(buf, sizeof(char), sizeof(buf), fp);
buf[len] = 0;
...
-
Use fread return value and put zero to there....
size_t len = fread(buf, sizeof(char), sizeof(buf), fp);
buf[len] = 0;
...
(MSDN): "The replacement has no effect on .. the return value."
Plain wrong, because the return value changes to "bytes stored" instead of "bytes read". Which is why your solution works. But in any case, using "rb" is a much better solution - replacing CrLf with LF costs lots of cycles, especially for larger files where the cache becomes important.
-
If you like to be portable (DOS, Unix, Mac linebreaks), the text (not binary) read is easier to handle in my opinion.
Btw. the _fread in Pelles C behaves like the MS variant. You have to use the return value to find the buffer end.
-
If you like to be portable (DOS, Unix, Mac linebreaks), the text (not binary) read is easier to handle in my opinion.
Not a big problem, actually, and performance-wise the binary read is better.
Btw. the _fread in Pelles C behaves like the MS variant. You have to use the return value to find the buffer end.
In PellesC, the buffer end is zeroed, which is OK.
-
Here is my "final" version. I will try to extend it with your algos in order to get timings. So far only my library algo is in, for comparison.
I've put a testfile here (http://www.webalice.it/jj2006/Cruscotto/pics/Database.zip) - a "real" statistical database from the UN in tabbed format with CrLf and some typical problems, such as isolated linefeeds. It has about 10Mb and loads in about half a second with the C algo.
-
Here's something I have used for that type of thing:
char** Split(char *Input, char *Delim, char ***List, int *TokenCount)
Thanks, DMac. Is there a simple solution for this error?
while ((Position = strstr(Remain, Delim)) != NULL) //error #2168: Operands of '=' have incompatible types 'char *' and 'int'
-
while ((Position = strstr(Remain, Delim)) != NULL) //error #2168: Operands of '=' have incompatible types 'char *' and 'int'
I don't have this error!
Btw. DMacs code can not deal with the whole buffer, it must be used line by line.
-
I don't have this error!
Strange. I am still on XP with version 7, could that be the reason?
In the meantime, I've prepared the source to incorporate other algos. Example:
MbTimer(); // start the timer
ArrJJ=LoadTabFileJJ(fname, &totalRowsJJ, &maxcol); // load the text file into an array
secs=MbTimer()/1e6;
printf("Loading %i rows took %.4f seconds with LoadTabFileJJ\n", totalRowsJJ, secs);
As it stands, it compiles fine also in VS Express 2010. While Pelles C is a better product, I find it important to be able to tell another coder "if you don't have Pelles C, just compile it in VS". People stay away from exotic compilers (and we must admit that it is a bit exotic...) because they believe they are wasting time on incompatible tools.
-
Thanks, DMac. Is there a simple solution for this error?
while ((Position = strstr(Remain, Delim)) != NULL) //error #2168: Operands of '=' have incompatible types 'char *' and 'int'
Did you modify the code of the function, adapting it to your application? If so you want to make sure that the var Position is declared char * the return type of strstr() is char * and that seems to be what the compiler is complaining about.
I just tested my example on Win64 and Win32 and was not able to reproduce this error unless I changed the declaration of the var Position.
Btw. DMacs code can not deal with the whole buffer, it must be used line by line.
I'm not quite sure what you mean by this statement. It splits out the whole buffer in two goes producing the array of strings. However what ever you did with the resulting strings could be considered line by line I suppose.
-
Did you modify the code of the function, adapting it to your application?
I used it without modifications, but now (on my machine at home) I can't reproduce the error. Anyway, here is a complete working example:#include <stdio.h>
#include <windows.h>
#include <conio.h> // for _getch()
#pragma warn(disable:2216) // retval never used
#pragma warn(disable:2007) // assembly not portable
#pragma warn(disable:2118) // para not referenced
// #pragma warn(disable:2215) // conversion ... loss of data
char** Split(char *Input, char *Delim, char ***List, int *TokenCount);
void FreeSplitList(char **List);
int main(int argc, char* argv[]) {
char **outArray = NULL;
int count, lines, cols, len;
char *buffer;
FILE *fp;
fp = fopen("C:\\TEMP\\Database.tab", "rb");
if (!fp) {
printf("\nfile not found");
lines=0;
cols=0;
return 0;
}
fseek(fp, 0, SEEK_END); // go to end
len=ftell(fp); // get position at end (length)
fseek(fp, 0, SEEK_SET); // back to start
buffer=calloc(len+4, 1); // calloc buffer (zeroed, at least two bytes longer)
fread(buffer, 1, len, fp); // read file into buffer
fclose(fp);
char **rtn = Split(buffer, "\t", &outArray, &count);
if(NULL != rtn) {
char *lineOfText;
for(int i = 0; i < count && i<200; ++i)
{
lineOfText = outArray[i];
printf("%s ", lineOfText); // print some lines
//do something usefull with line of text
}
FreeSplitList(outArray);
}
puts("\n--- ok ---");
// _getch(); // if your IDE closes the console window
} // end main
char** Split(char *Input, char *Delim, char ***List, int *TokenCount)
{
int Found;
int Length;
int DelimLen;
char* Remain;
char* Position;
DelimLen = strlen(Delim);
Found = 0;
Remain = Input;
if ((List == NULL) || (Input == NULL) || (Delim == NULL))
{
*TokenCount=-1;
return NULL;
}
//first pass -- count number of delimiters
while ((Position = strstr(Remain, Delim)) != NULL)
{
Found++;
Remain = Position + DelimLen;
}
Found++; // increment one more time for last data chunk
//create array based on number of delimiters
*List = (char **)malloc((Found+1) * sizeof(char *));
Found = 0;
Remain = Input;
//second pass -- populate array
while ((Position = strstr(Remain, Delim)) != NULL)
{
Length = Position - Remain;
(*List)[Found] = (char *)malloc(sizeof(char)*(Length+1));
strncpy((*List)[Found], Remain, Length);
(*List)[Found++][Length] = 0;
Remain = Position + DelimLen;
}
Length = strlen(Remain);
(*List)[Found] = (char *)malloc(sizeof(char)*(Length+1));
strncpy((*List)[Found], Remain, Length);
(*List)[Found++][Length] = 0;
(*List)[Found] = NULL;
*TokenCount = Found;
return *List;
} /* Split() */
/* Destroys the array of strings structure returned by Split() */
void FreeSplitList(char **List) {
int Count=0;
while(List[Count] != NULL)
free(List[Count++]);
free(List);
} /* FreeSplitList() */
Btw. DMacs code can not deal with the whole buffer, it must be used line by line.
It does, it does. However, it uses only one delimiter, which is fine for tab but clearly poses a problem with CrLf.
Here is my inner loop as posted above:
while (c>1) {
c=psRight[0];
if (c==9 || c==nl) { // tab or Cr or Lf
pRC[rowOffset+col]=psLeft; // put the address of a string into the RowCol matrix of pointers
col++;
if (col>maxcol) {
printf("Too many columns: %i>%i\n", col, maxcol); // unlikely case
row=0; // flag failure
goto TooManyColumns;
}
psRight[0]=0; // replace \t or \n with \0
if (c==nl) { // Cr or Lf or CrLf
if (psRight[1]==10){
psRight++; // CrLf needs one more
}
c=1; // flag newline
}
psLeft=psRight+1;
}
psRight++; // tab is one byte
}
while (col<maxcol) {
pRC[rowOffset+col]=""; // care for empty cells
col++;
}
row++;
With "real" spreadsheets, the tricky thing is finding the "right" number of columns, and dealing with varying #columns.
-
New version with improved performance. With the default 10 MB test file (available here (http://www.webalice.it/jj2006/Cruscotto/pics/Database.zip)), it takes 0.2 seconds on my trusty Celeron to translate 43,000 rows. The assembler version is still faster, of course, but it would be interesting to see what Pelles C 64-bit can do.
The attachment includes a second source with DMac's code, slightly adjusted to make it compatible with VS.
-
Your program:
Loading 43123 rows took 0.0158 seconds with Recall
Loading 43123 rows took 0.1241 seconds with LoadTabFileJJ
- hit return -
my test program posix and pointers:
32-bit
68 ms
43138 rows
64-bit
101 ms
43138 rows
-
Btw. DMacs code can not deal with the whole buffer, it must be used line by line.
I'm not quite sure what you mean by this statement. It splits out the whole buffer in two goes producing the array of strings. However what ever you did with the resulting strings could be considered line by line I suppose.
You have one delimiter, \t in our case, not two (\t and \n). So the last string per row and the first string of the next row are only one arrayelement delimited bei an \n.
To see this, change one line in jj2007's output loop:
printf("%d %s ", i ,lineOfText); // print some lines
-
jj2007: change
if (col>maxcol) maxcol=col+1;
to
if (col>=maxcol) maxcol=col+1;
You have to free buffer!
Not sure what your 3+ cols should be?
If you really have more cols, how will you test? Check all your 3*rows for NULL? And what then?
What to do if you have 4+ cols?
-
Loading 43123 rows took 0.1241 seconds with LoadTabFileJJ
..
my test program posix and pointers:
32-bit
68 ms
43138 rows
68 vs 124 is very good indeed, but have a look at the difference in row counts. The database has some rows with isolated LFs, a frequent case you can sometimes see in MS Excel.
What do you mean with posix?
@czerny:
if (col>=maxcol) maxcol=col+1;
Can be done but the maxcol=maxcol+3; has the same effect. The maxcol loop at the beginning checks 50...1000 lines (depending on #columns), and while it has no measurable influence on the timings, it should detect the maximum number of columns. Further down is
if (col>maxcol) {
but that should be triggered only if you have really "malformed" sources.
-
68 vs 124 is very good indeed, but have a look at the difference in row counts. The database has some rows with isolated LFs, a frequent case you can sometimes see in MS Excel.
why ???, in that example file there is 3 tables and in first table there is 40357 rows.
What do you mean with posix?
i was using functions open,close,read,filelength (http://en.wikipedia.org/wiki/POSIX)
-
in that example file there is 3 tables and in first table there is 40357 rows.
Open in Excel and check row 41243 (Eurostat) and 43024 (prepaid) to see linefeeds.
Last line is 43123 when opened in Excel. BTW I didn't cook up this file, it comes directly from the UN. (http://mdgs.un.org/unsd/mdg/Handlers/ExportHandler.ashx?Type=Csv)
EDIT: I attach a sample file with linefeeds. This kind of problem comes from pasting cells with linefeeds from MS Word tables into single Excel cells.
-
How accurate are your time measurements?
I have here big variability, about 30%.
What are you using? QueryPerformanceCounter? _rdtsc? Other?
-
I have here big variability, about 30%.
For me, typically within a few % with MbTimer(). Under the hood it's QPC.
Don't forget the first load is always way off, but once the file is in the disk cache, timings should be very consistent. Which timer are you using?
Here is an example using a for loop:
Loading 43123 rows took 0.0259 seconds with Recall
Loading 43123 rows took 0.0274 seconds with Recall
Loading 43123 rows took 0.0268 seconds with Recall
Loading 43123 rows took 0.0269 seconds with Recall
Loading 43123 rows took 0.0263 seconds with Recall
Loading 43123 rows took 0.0264 seconds with Recall
Loading 43123 rows took 0.0259 seconds with Recall
Loading 43123 rows took 0.0261 seconds with Recall
Loading 43123 rows took 0.0257 seconds with Recall
Loading 43123 rows took 0.0258 seconds with Recall
Loading 43123 rows took 0.0252 seconds with Recall
Loading 43123 rows took 0.0261 seconds with Recall
Loading 43123 rows took 0.0249 seconds with Recall
Loading 43123 rows took 0.0247 seconds with Recall
Loading 43123 rows took 0.0246 seconds with Recall
Loading 43123 rows took 0.0246 seconds with Recall
Loading 43123 rows took 0.0242 seconds with Recall
Loading 43123 rows took 0.0243 seconds with Recall
Loading 43123 rows took 0.0239 seconds with Recall
Loading 43123 rows took 0.0239 seconds with Recall
-
Confused ???
LibreOffice 4.2, 43121 lines.
First table 1 - 40357
Second 40359 - 43112
Third 43114 - 43121
Excel 2010 Finnish version can't handle that .csv nor that .tab:(
Excel 2003 list separator: here (http://office.microsoft.com/en-us/excel-help/import-or-export-text-files-HP001130599.aspx)
I know that UN file have 43123 rows.
EDIT:
There is a trick to how to use .csv file in no-US version of Excel: here (http://superuser.com/questions/606272/how-to-get-excel-to-interpret-the-comma-as-a-default-delimiter-in-csv-files)
just insert "sep=," as first line and Excel 2010 use that separator.
-
Confused ???
LibreOffice 4.2, 43121 lines.
First table 1 - 40357
Second 40359 - 43112
Third 43114 - 43121
Excel can't handle that csv without changing local settings (list separator) :(
Using the original U.N. database
Microsoft Excel 2003, 43123 lines
First table 1 - 40357 including header
Second 40359 - 43114 including header
Third 43116 - 43123 including header
There are several instances of embedded linefeed characters in the second table. Excel handles them as expected because they are in between double quotes.
-
I have here big variability, about 30%.
For me, typically within a few % with MbTimer(). Under the hood it's QPC.
Which timer are you using?
I have compared both: QueryPerformanceCounter and _rdtsc. They have both a very high variability and they differ from each other.
-
There are several instances of embedded linefeed characters in the second table. Excel handles them as expected because they are in between double quotes.
But JJ2007's test database is a tab delimited file!
JJ2007: Do you have made the conversation from csv to tab?
-
There are several instances of embedded linefeed characters in the second table. Excel handles them as expected because they are in between double quotes.
But JJ2007's test database is a tab delimited file!
JJ2007: Do you have made the conversation from csv to tab?
Using JJ2007's database.tab file in Excel 2003 displays the same as the original U.N. file.
-
I have compared both: QueryPerformanceCounter and _rdtsc. They have both a very high variability and they differ from each other.
Strange. With rdtsc, it could be an issue with core switching, but QPC should be immune against that. What do you get with this loop (exe attached)?
for (int i=0; i<30; i++) {
MbTimer();
ArrJJ=LoadTabFileJJ(fname, &totalRowsJJ, &maxcol);
secs=MbTimer()/1e6;
printf("Loading %i rows took %.4f seconds with LoadTabFileJJ\n", totalRowsJJ, secs);
}
P.S.: Yes, I did the csv to tab conversion, but as Robert (thanks!) wrote, they are identical - just tabs instead of commas.
-
I have compared both: QueryPerformanceCounter and _rdtsc. They have both a very high variability and they differ from each other.
Strange. With rdtsc, it could be an issue with core switching, but QPC should be immune against that. What do you get with this loop (exe attached)?
for (int i=0; i<30; i++) {
MbTimer();
ArrJJ=LoadTabFileJJ(fname, &totalRowsJJ, &maxcol);
secs=MbTimer()/1e6;
printf("Loading %i rows took %.4f seconds with LoadTabFileJJ\n", totalRowsJJ, secs);
}
I do not have the code available in the moment. I will do some more tests at evening.
P.S.: Yes, I did the csv to tab conversion, but as Robert (thanks!) wrote, they are identical - just tabs instead of commas.
What about the quotes? It looks as you have deleted them sometimes but not allways.
-
I do not have the code available in the moment. I will do some more tests at evening.
The MbTimer aka QPC version is attached above.
What about the quotes? It looks as you have deleted them sometimes but not allways.
Yes, I saw that also, thanks for reminding me. The conversion is bad for "Test","There are, commas", "here". I will fix that asap, but it doesn't influence the row count.
The database file is tricky - see e.g. footnotes 152 and 663, no solution for those! - but on the other hand it's a good test case because it's a real example of a malformed spreadsheet coming from an official source of statistics.
-
Timer functions in C.
A lib of these is useful for console programs.
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
LONGLONG __cdecl StartTimer(void)
{
LONGLONG t1; // ticks
// start timer
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
return t1;
}
LONGLONG __cdecl StopTimer(LONGLONG t1)
{
LONGLONG t2;
LONGLONG frequency; // ticks per second
// stop timer
QueryPerformanceCounter((LARGE_INTEGER *) & t2);
// get ticks per second
QueryPerformanceFrequency((LARGE_INTEGER *) & frequency);
// compute and print the elapsed time in millisec
LONGLONG elapsedTime = (t2 - t1) * 1000 / frequency;
return (LONGLONG)elapsedTime;
}
definitions for console program:long long __cdecl StartTimer(void);
long long __cdecl StopTimer(long long t1);
usage:long long llStart = StartTimer();
// do something or nothing
long long llTime = StopTimer(llStart);
printf("time: %d ms\n", (int)llTime);
-
StopTimer returns elapsedTime in msecs.
-
OK, got it:
Loading 43123 rows took 0.3145 seconds
Loading 43123 rows took 0.3132 seconds
Loading 43123 rows took 0.3124 seconds
Loading 43123 rows took 0.3103 seconds
Loading 43123 rows took 0.3123 seconds
Loading 43123 rows took 309 milliseconds
Loading 43123 rows took 313 milliseconds
Loading 43123 rows took 310 milliseconds
Loading 43123 rows took 310 milliseconds
Loading 43123 rows took 313 milliseconds
No big variation here...
-
I have for example the following timings:
for (i=0; i<10; i++)
{
LoadTabFileJJ();
Czernys();
LoadTabFileJJ();
}
158.8260519144 ms 180.9129372588 ms 160.1033092195 ms
162.1560840833 ms 181.4037817656 ms 163.1542556386 ms
224.5581491502 ms 205.7775499400 ms 207.7685851135 ms
210.3840013186 ms 227.3836479217 ms 242.3699101422 ms
193.2228816791 ms 222.2067329786 ms 178.7003655493 ms
177.3895844304 ms 342.9254276731 ms 188.9189827199 ms
180.6846959600 ms 280.0210641297 ms 245.3736438570 ms
189.7260685366 ms 221.5546948006 ms 182.0309564484 ms
178.0321241946 ms 213.9358493887 ms 244.7769199717 ms
242.6847546266 ms 228.9170830371 ms 242.0028243813 ms
Hier with Timos code:
160 ms 180 ms 162 ms
252 ms 249 ms 211 ms
207 ms 237 ms 180 ms
181 ms 218 ms 181 ms
245 ms 223 ms 208 ms
207 ms 281 ms 196 ms
183 ms 224 ms 287 ms
236 ms 1373 ms 454 ms
182 ms 1248 ms 486 ms
180 ms 2177 ms 492 ms
The last three lines may be caused by swapping?
Here is Timos and my timer in parallel:
for (i=0; i<10; i++)
{
LoadTabFileJJ();
}
163 ms 163.1665477037 ms
220 ms 220.7501232699 ms
192 ms 192.2875672746 ms
165 ms 165.9534940893 ms
220 ms 220.4213105297 ms
197 ms 197.9391235478 ms
211 ms 211.4291062132 ms
2579 ms 2579.4286704036 ms
2684 ms 2684.3026138797 ms
3112 ms 3112.7481286029 ms
Here both timers a consistent. But what about the variations?
-
Swapping is somewhat unlikely, for the handful of bytes we are dealing with. Here are my timings for the old Celeron notebook with a lousy 2GB of RAM, tested with MbTimer() and Start/StopTimer():
Loading 43123 rows took 0.203 seconds with LoadTabFileJJ
Loading 43123 rows took 0.196 seconds with LoadTabFileJJ
Loading 43123 rows took 0.200 seconds with LoadTabFileJJ
Loading 43123 rows took 0.203 seconds with LoadTabFileJJ
Loading 43123 rows took 0.200 seconds with LoadTabFileJJ
Loading 43123 rows took 196 milliseconds with LoadTabFileJJ
Loading 43123 rows took 197 milliseconds with LoadTabFileJJ
Loading 43123 rows took 197 milliseconds with LoadTabFileJJ
Loading 43123 rows took 200 milliseconds with LoadTabFileJJ
Loading 43123 rows took 198 milliseconds with LoadTabFileJJ
-
Here is the 2-timer test above with an somewhat quicker machine:
131 ms 131.1705817359 ms
143 ms 143.6193706183 ms
130 ms 130.6548737340 ms
131 ms 131.1560547500 ms
147 ms 147.2851996553 ms
131 ms 131.0951531549 ms
130 ms 130.4649054559 ms
130 ms 130.3785816354 ms
130 ms 130.7406388242 ms
131 ms 131.2108103125 ms
-
Hey JJ2007, I got thouse 4 digit measurements not any more after freeing your buffer!
255 ms 255.4034102100 ms
223 ms 223.6789871338 ms
197 ms 197.7851933695 ms
215 ms 215.1318114453 ms
222 ms 222.2874694968 ms
197 ms 197.4552631689 ms
210 ms 210.9949728248 ms
235 ms 235.9663029798 ms
183 ms 183.3612931252 ms
192 ms 192.9044054482 ms
Here 5 timers: Timo, GetTickCount, timeGetTime, _rdtsc, my
590 ms 594 ms 591 ms 1414755126 cycles 590.9614210745 ms
167 ms 172 ms 168 ms 400770772 cycles 167.4084276074 ms
166 ms 156 ms 167 ms 398143312 cycles 166.3121988968 ms
167 ms 156 ms 168 ms 401748552 cycles 167.8168594053 ms
203 ms 203 ms 203 ms 485991200 cycles 203.0087622868 ms
197 ms 188 ms 197 ms 472113756 cycles 197.2110980586 ms
245 ms 235 ms 245 ms 586719924 cycles 245.0811485817 ms
200 ms 203 ms 202 ms 480600504 cycles 200.7559620008 ms
276 ms 281 ms 277 ms 663128552 cycles 276.9974954917 ms
2083 ms 2078 ms 2083 ms 693108120 cycles 2083.9235154189 ms
Look at the last two lines. There is a big difference in ms but a small difference in cycles?
And, JJ2007, there is a 4-digit time again! :-(
-
As i'm hobbyist, i shall share this test project too.
32-bit:
Process: 58 ms
43123 nRows
64-bit:
Process: 71 ms
43123 nRows
Bit slow, but who cares :D
-
Bit slow, but who cares :D
Practically on par with my algo (MbRecall is fast but doesn't count in this context).
-
Hey JJ2007, I got thouse 4 digit measurements not any more after freeing your buffer!
Which buffer?
Look at the last two lines. There is a big difference in ms but a small difference in cycles?
And, JJ2007, there is a 4-digit time again! :-(
Mysterious. Doesn't make any sense 8)
-
Thank's Jochen for that BrainTrainer material.
That database.tab was good learning material for seeing some weakness of Excel and LibreOffice.
I hope that beginners enjoy this journey in those csv/tbs/tab files :)
Thank's to Pelle for PellesC to make this exercise.
Thank's Lara Fabian's CD LARA FABIAN to be good mood when programming :).
When i go to my daily walk into fields, Paulina Rubio and Laura Pausini puts my legs in right beat :D
And i have Patricia Kaas and Middle of The Road (if sun shines Soley Soley) in my MP3 player too 8)
-
Here is my last version: I had a lot of fun with this little exercise, too.
-
Thanks a lot to you, Timo and Czerny and Robert, for your help. I've learnt some new tricks ;-)
And you forced me to update my assembler library, because it didn't treat quotes and commas correctly. It's ok now, although not yet online.
BTW in case you wondered how MbRecall can be 5* as fast: It always reads whole lines and stores only the beginning and len. Then, if you ask for element[5,9], it goes to row 5 and extracts the substring between the 5th and the 6th tab.
@Timo: Soley Soley - reminds me of the good times when I hung around in discos. You might like The Rain, the Park and Other Things (http://www.youtube.com/watch?v=X6uh7zXw-WI) ;-)
P.S.: New test files here (http://www.webalice.it/jj2006/Cruscotto/pics/Database.zip), with:
- original csv file
- tab file converted by Excel 2010
- tab file produced by my own converter
When testing csv or tab readers, pay attention to the treatment of commas and quotes inside cells. I attach a file in *.xls format that has some goodies for testing.
-
Just an update: I have added a "sort by numerical value of column n" function to my Recall() (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1172) and QSort() (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1175) procedures, see here (http://masm32.com/board/index.php?topic=94.msg55979#msg55979). It's assembler, but if anybody needs a Pelles C version, let me know.