NO

Author Topic: Searching a string for a single character  (Read 1018 times)

Offline John Z

  • Member
  • *
  • Posts: 796
Searching a string for a single character
« on: January 14, 2023, 04:49:35 PM »
So . . .  I guess depending on what preceded my need to search a string for a specific character sometimes I used strchr and sometimes strstr.  This is not an error of course one can certainly search for a single character as "B" for example.  After observing my inconsistency I decided to try to time searching with each method.  Timing with ticks is always tricky because of other system activities but if you do it enough times you will get a sense of which procedure is faster. 
Probably not a surprise to anyone strchr has a slight edge over strstr, maybe less than 3% faster.

Then  I figured why not test wcschr and wcsstr while I was at it.   Since I was using "B"  I just use L"B" . Again wcschr was a bit faster by a small margin maybe < 2% faster.  Now here comes the reason for the post -

I decided to try wcsstr vs wcschr with a character out of the first 256 characters.  Here is where is gets very interesting.
Searching for 号 showed that wcschr was still a bit faster than wcsstr looking for the character.  BUT wait there is more:
The wcschr results appeared to be far far faster than searching with wcschr for a wide B char.

So another test just wcschr for 号 vs. for B.  In repeated testing, finding  号 was about ~200x faster than finding B !!
Here is the basic code but of course you'll not have my Write_to_Log procedure, but it is enough if you want to try this yourself.  Yes it ran 3 times at 9 million searches each time, also checked that it was actually found at the end...
Of course if you ran it just once you'll never notice the difference.....

Code: [Select]

 long long loup;
 wchar_t tmp[2000], *p_tmp; p_tmp = tmp;
 wchar_t tmp3[2000], *p_tmp3; p_tmp3 = tmp3;
 wchar_t *p_found=NULL;

 for (loup = 0;loup<1001; loup++)
    {p_tmp[loup] = L'称'; }
 p_tmp[1001] = L'号';p_tmp[1002] = 0;

 for (loup = 0;loup<1001; loup++)
    {p_tmp3[loup] = L'A';}
 p_tmp3[1001] = L'B';p_tmp[1002] = 0;
int oloup;

for (oloup=0;oloup<3; oloup++)
{
Write_To_Log("C://temp//trace.log","Start wcschr uni B",TRUE,0);
for (loup=0; loup < 9000000; loup++)
  { p_found = wcschr(p_tmp3,L'B'); }
Write_To_Log("C://temp//trace.log","End wcschr uni B",TRUE,0);
if (p_found == NULL)
  { Write_To_Log("C://temp//trace.log","wcschr uni not found",FALSE,0);}

p_found = NULL;
Write_To_Log("C://temp//trace.log","Start wcschr uni 号",TRUE,0);
for (loup=0; loup < 9000000; loup++)
  { p_found = wcschr(p_tmp,L'号'); }
Write_To_Log("C://temp//trace.log","End wcschr uni 号",TRUE,0);
if (p_found == NULL)
  { Write_To_Log("C://temp//trace.log","wcschr uni not found",FALSE,0);}

}

Why is wcschr for 号 so much faster than wcschr for B ??

ANYWAY - maybe this will stir up the forum  ;D

Happy New Year,
John Z

Offline frankie

  • Global Moderator
  • Member
  • *****
  • Posts: 2096
Re: Searching a string for a single character
« Reply #1 on: January 21, 2023, 12:05:26 PM »
I imagine that, because the latin 'B' character has many representation in unicode, the search must compare for different encodings, while the chinese character hasn't any more representation than the chinese encoding.
It is better to be hated for what you are than to be loved for what you are not. - Andre Gide

Offline John Z

  • Member
  • *
  • Posts: 796
Re: Searching a string for a single character
« Reply #2 on: January 22, 2023, 11:06:06 AM »
I imagine that, because the latin 'B' character has many representation in unicode, the search must compare for different encodings, while the chinese character hasn't any more representation than the chinese encoding.

This is a very good hypothesis, I like it.  So for the next test I should ensure that both are encoded the same for example as  UTF-16.

Thanks frankie!

Offline frankie

  • Global Moderator
  • Member
  • *****
  • Posts: 2096
Re: Searching a string for a single character
« Reply #3 on: January 22, 2023, 11:17:21 AM »
There is also 'collation' and other nice tricky things coming handy when using unicode (i.e. see this for collation).
« Last Edit: January 23, 2023, 10:44:54 PM by frankie »
It is better to be hated for what you are than to be loved for what you are not. - Andre Gide

Offline John Z

  • Member
  • *
  • Posts: 796
Re: Searching a string for a single character
« Reply #4 on: January 22, 2023, 01:31:33 PM »
Great call frankie!

I added a UTF-16LE page and repeated the testing 8 times.  With UTF-16LE all characters in the same code group
there was for all practical purpose no search speed difference.  The "B" case 'won' 5 out of 8 while the '号' won in 3,
and the differences were tiny in all cases.

An interesting diversion.. :)
John Z