NO

Author Topic: Reading a tab-delimited text file into a two-dimensional array  (Read 46803 times)

czerny

  • Guest
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #30 on: May 01, 2014, 02:25:26 PM »
jj2007: change
Code: [Select]
  if (col>maxcol) maxcol=col+1;to
Code: [Select]
  if (col>=maxcol) maxcol=col+1;
You have to free buffer!

Not sure what your 3+ cols should be?

If you really have more cols, how will you test? Check all your 3*rows for NULL? And what then?
What to do if you have 4+ cols?
« Last Edit: May 01, 2014, 02:42:50 PM by czerny »

Offline jj2007

  • Member
  • *
  • Posts: 536
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #31 on: May 01, 2014, 02:38:04 PM »
Loading 43123 rows took 0.1241 seconds with LoadTabFileJJ
..
my test program posix and pointers:

32-bit
68 ms
43138 rows
68 vs 124 is very good indeed, but have a look at the difference in row counts. The database has some rows with isolated LFs, a frequent case you can sometimes see in MS Excel.

What do you mean with posix?

@czerny:
if (col>=maxcol) maxcol=col+1;

Can be done but the maxcol=maxcol+3; has the same effect. The maxcol loop at the beginning checks 50...1000 lines (depending on #columns), and while it has no measurable influence on the timings, it should detect the maximum number of columns.  Further down is
if (col>maxcol) {
but that should be triggered only if you have really "malformed" sources.

Offline TimoVJL

  • Global Moderator
  • Member
  • *****
  • Posts: 2115
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #32 on: May 01, 2014, 07:26:34 PM »
68 vs 124 is very good indeed, but have a look at the difference in row counts. The database has some rows with isolated LFs, a frequent case you can sometimes see in MS Excel.
why  ???, in that example file there is 3 tables and in first table there is 40357 rows.
What do you mean with posix?
i was using functions open,close,read,filelength (http://en.wikipedia.org/wiki/POSIX)
May the source be with you

Offline jj2007

  • Member
  • *
  • Posts: 536
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #33 on: May 01, 2014, 09:15:17 PM »
in that example file there is 3 tables and in first table there is 40357 rows.

Open in Excel and check row 41243 (Eurostat) and 43024 (prepaid) to see linefeeds.
Last line is 43123 when opened in Excel. BTW I didn't cook up this file, it comes directly from the UN.

EDIT: I attach a sample file with linefeeds. This kind of problem comes from pasting cells with linefeeds from MS Word tables into single Excel cells.
« Last Edit: May 01, 2014, 10:19:26 PM by jj2007 »

czerny

  • Guest
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #34 on: May 02, 2014, 12:30:17 AM »
How accurate are your time measurements?

I have here big variability, about 30%.

What are you using? QueryPerformanceCounter? _rdtsc? Other?
« Last Edit: May 02, 2014, 01:16:14 AM by czerny »

Offline jj2007

  • Member
  • *
  • Posts: 536
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #35 on: May 02, 2014, 01:15:48 AM »
I have here big variability, about 30%.

For me, typically within a few % with MbTimer(). Under the hood it's QPC.
Don't forget the first load is always way off, but once the file is in the disk cache, timings should be very consistent. Which timer are you using?

Here is an example using a for loop:
Loading 43123 rows took 0.0259 seconds with Recall
Loading 43123 rows took 0.0274 seconds with Recall
Loading 43123 rows took 0.0268 seconds with Recall
Loading 43123 rows took 0.0269 seconds with Recall
Loading 43123 rows took 0.0263 seconds with Recall
Loading 43123 rows took 0.0264 seconds with Recall
Loading 43123 rows took 0.0259 seconds with Recall
Loading 43123 rows took 0.0261 seconds with Recall
Loading 43123 rows took 0.0257 seconds with Recall
Loading 43123 rows took 0.0258 seconds with Recall
Loading 43123 rows took 0.0252 seconds with Recall
Loading 43123 rows took 0.0261 seconds with Recall
Loading 43123 rows took 0.0249 seconds with Recall
Loading 43123 rows took 0.0247 seconds with Recall
Loading 43123 rows took 0.0246 seconds with Recall
Loading 43123 rows took 0.0246 seconds with Recall
Loading 43123 rows took 0.0242 seconds with Recall
Loading 43123 rows took 0.0243 seconds with Recall
Loading 43123 rows took 0.0239 seconds with Recall
Loading 43123 rows took 0.0239 seconds with Recall
« Last Edit: May 02, 2014, 01:20:38 AM by jj2007 »

Offline TimoVJL

  • Global Moderator
  • Member
  • *****
  • Posts: 2115
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #36 on: May 02, 2014, 09:04:16 AM »
Confused ???

LibreOffice 4.2, 43121 lines.

First table 1 - 40357
Second 40359 - 43112
Third 43114 - 43121

Excel 2010 Finnish version can't handle that .csv nor that .tab:(

Excel 2003 list separator: here

I know that UN file have 43123 rows.

EDIT:
There is a trick to how to use .csv file in no-US version of Excel: here
just insert "sep=," as first line and Excel 2010 use that separator.
« Last Edit: May 02, 2014, 12:34:58 PM by TimoVJL »
May the source be with you

Offline Robert

  • Member
  • *
  • Posts: 247
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #37 on: May 02, 2014, 09:51:17 AM »
Confused ???

LibreOffice 4.2, 43121 lines.

First table 1 - 40357
Second 40359 - 43112
Third 43114 - 43121

Excel can't handle that csv without changing local settings (list separator) :(

Using the original U.N. database

Microsoft Excel 2003, 43123 lines

First table 1 - 40357 including header
Second 40359 - 43114 including header
Third 43116 - 43123 including header

There are several instances of embedded linefeed characters in the second table. Excel handles them as expected because they are in between double quotes.

czerny

  • Guest
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #38 on: May 02, 2014, 09:55:11 AM »
I have here big variability, about 30%.

For me, typically within a few % with MbTimer(). Under the hood it's QPC.
Which timer are you using?
I have compared both: QueryPerformanceCounter and  _rdtsc. They have both a very high variability and they differ from each other.

czerny

  • Guest
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #39 on: May 02, 2014, 09:58:09 AM »
There are several instances of embedded linefeed characters in the second table. Excel handles them as expected because they are in between double quotes.
But JJ2007's test database is a tab delimited file!

JJ2007: Do you have made the conversation from csv to tab?

Offline Robert

  • Member
  • *
  • Posts: 247
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #40 on: May 02, 2014, 10:31:09 AM »
There are several instances of embedded linefeed characters in the second table. Excel handles them as expected because they are in between double quotes.
But JJ2007's test database is a tab delimited file!

JJ2007: Do you have made the conversation from csv to tab?

Using JJ2007's database.tab file in Excel 2003 displays the same as the original U.N. file.

Offline jj2007

  • Member
  • *
  • Posts: 536
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #41 on: May 02, 2014, 12:01:09 PM »
I have compared both: QueryPerformanceCounter and  _rdtsc. They have both a very high variability and they differ from each other.

Strange. With rdtsc, it could be an issue with core switching, but QPC should be immune against that. What do you get with this loop (exe attached)?

  for (int i=0; i<30; i++) {
   MbTimer();
   ArrJJ=LoadTabFileJJ(fname, &totalRowsJJ, &maxcol);
   secs=MbTimer()/1e6;
   printf("Loading %i rows took %.4f seconds with LoadTabFileJJ\n", totalRowsJJ, secs);
  }

P.S.: Yes, I did the csv to tab conversion, but as Robert (thanks!) wrote, they are identical - just tabs instead of commas.

czerny

  • Guest
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #42 on: May 02, 2014, 01:11:34 PM »
I have compared both: QueryPerformanceCounter and  _rdtsc. They have both a very high variability and they differ from each other.

Strange. With rdtsc, it could be an issue with core switching, but QPC should be immune against that. What do you get with this loop (exe attached)?

  for (int i=0; i<30; i++) {
   MbTimer();
   ArrJJ=LoadTabFileJJ(fname, &totalRowsJJ, &maxcol);
   secs=MbTimer()/1e6;
   printf("Loading %i rows took %.4f seconds with LoadTabFileJJ\n", totalRowsJJ, secs);
  }
I do not have the code available in the moment. I will do some more tests at evening.
P.S.: Yes, I did the csv to tab conversion, but as Robert (thanks!) wrote, they are identical - just tabs instead of commas.
What about the quotes? It looks as you have deleted them sometimes but not allways.

Offline jj2007

  • Member
  • *
  • Posts: 536
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #43 on: May 02, 2014, 01:55:15 PM »
I do not have the code available in the moment. I will do some more tests at evening.

The MbTimer aka QPC version is attached above.

Quote
What about the quotes? It looks as you have deleted them sometimes but not allways.

Yes, I saw that also, thanks for reminding me. The conversion is bad for "Test","There are, commas", "here". I will fix that asap, but it doesn't influence the row count.

The database file is tricky - see e.g. footnotes 152 and 663, no solution for those! - but on the other hand it's a good test case because it's a real example of a malformed spreadsheet coming from an official source of statistics.
« Last Edit: May 02, 2014, 02:13:18 PM by jj2007 »

Offline TimoVJL

  • Global Moderator
  • Member
  • *****
  • Posts: 2115
Re: Reading a tab-delimited text file into a two-dimensional array
« Reply #44 on: May 02, 2014, 02:11:38 PM »
Timer functions in C.
A lib of these is useful for console programs.
Code: [Select]
#define WIN32_LEAN_AND_MEAN
#include <windows.h>

LONGLONG __cdecl StartTimer(void)
{
LONGLONG t1; // ticks
// start timer
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
return t1;
}

LONGLONG __cdecl StopTimer(LONGLONG t1)
{
LONGLONG t2;
LONGLONG frequency; // ticks per second
// stop timer
QueryPerformanceCounter((LARGE_INTEGER *) & t2);
// get ticks per second
QueryPerformanceFrequency((LARGE_INTEGER *) & frequency);
// compute and print the elapsed time in millisec
LONGLONG elapsedTime = (t2 - t1) * 1000 / frequency;
return (LONGLONG)elapsedTime;
}
definitions for console program:
Code: [Select]
long long __cdecl StartTimer(void);
long long __cdecl StopTimer(long long t1);
usage:
Code: [Select]
long long llStart = StartTimer();
// do something or nothing
long long llTime = StopTimer(llStart);
printf("time: %d ms\n", (int)llTime);
« Last Edit: May 02, 2014, 02:47:49 PM by TimoVJL »
May the source be with you