NO

Author Topic: How to find string in utf-8 charset?  (Read 3921 times)

Offline bitcoin

  • Member
  • *
  • Posts: 179
How to find string in utf-8 charset?
« on: October 23, 2019, 02:09:55 PM »
I want to find strings in files. Code is
...read file data, skip it..
lpFdata  = MapViewOfFile..
checkLen = 32

Next , i try to know charset
Code: [Select]
if (IsTextUnicode(lpFdata, checkLen, 0))
{
wprintf(L"File %s is Unicode\n", lpFileName);
if (StrStrW((PCWSTR)lpFdata, L"Фаерфокс"))
{
wprintf(L"Matched unicode\n");
}

}
else
{
wprintf(L"File %s is ascii\n", lpFileName);
if (StrStrA(lpFdata, "Фаерфокс"))
{
wprintf(L"Matched asci\n");
}

if (_mbsstr(lpFdata, "Фаерфокс"))
{
wprintf(L"Matched utf-8\n");
}
}

It works very well for Unicode and English letters. But - not works with Russian, if encoding is NOT utf-16. Why? How to search utf-8 strings in C language or Winapi?

Offline frankie

  • Global Moderator
  • Member
  • *****
  • Posts: 2114
Re: How to find string in utf-8 charset?
« Reply #1 on: October 23, 2019, 06:01:32 PM »
The wide string functions works only with with wide-characters that are a variant of 16bits encoding UTF (known as UTF-16). They don't work with 8bits encoding (known as UTF-8).
In any case wide chars and UTF-16 aren't the same, and for some limited cases they are inconsistent and this could lead to function failures.
To compare UTF-8 strings you can use standard ascii functions. The only problem that you can experience is due to different encoding of the same character (yes you can encode differently the same symbol).
Or you need to use some library for UTF strings.
"It is better to be hated for what you are than to be loved for what you are not." - Andre Gide

Offline bitcoin

  • Member
  • *
  • Posts: 179
Re: How to find string in utf-8 charset?
« Reply #2 on: October 23, 2019, 07:15:59 PM »
To compare UTF-8 strings you can use standard ascii functions. The only problem that you can experience is due to different encoding of the same character (yes you can encode differently the same symbol).
Or you need to use some library for UTF strings.
Ascii functions don't works with russian symbols(
can you tell me some utf-8 library? In pure C / Masm.

Offline frankie

  • Global Moderator
  • Member
  • *****
  • Posts: 2114
Re: How to find string in utf-8 charset?
« Reply #3 on: October 24, 2019, 09:33:56 AM »
Ascii functions don't works with russian symbols
can you tell me some utf-8 library? In pure C / Masm.
Are you sure that the strings you're comparing are really UTF-8 coded?
I'll look for a library and let you know.
"It is better to be hated for what you are than to be loved for what you are not." - Andre Gide

Offline bitcoin

  • Member
  • *
  • Posts: 179
Re: How to find string in utf-8 charset?
« Reply #4 on: October 25, 2019, 08:34:08 AM »
Are you sure that the strings you're comparing are really UTF-8 coded?
See file in attach.

it has ascii-strings (i can find it) , and string "Фаерфокс", that i can't find with any method.

I cannot convert encoding in file, because is very slow.

Offline TimoVJL

  • Global Moderator
  • Member
  • *****
  • Posts: 2144
Re: How to find string in utf-8 charset?
« Reply #5 on: October 25, 2019, 09:59:09 AM »
Code: [Select]
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char scyr[] = {0xD0,0xA4,0xD0,0xB0,0xD0,0xB5,0xD1,0x80,0xD1,0x84,0xD0,0xBE,0xD0,0xBA,0xD1,0x81,0};
int __cdecl main(void)
{
puts(scyr);
FILE *fp = fopen("ajax.txt", "r");
if (fp) {
fseek(fp, 0, SEEK_END);
int len = ftell(fp);
fseek(fp, 0, SEEK_SET);
char *pbuf = malloc(len);
fread(pbuf, len, 1, fp);
char *p = strstr(pbuf, u8"Фаерфокс");
char *p1 = strstr(pbuf, scyr);
if (p) puts(p);
if (p1) puts(p1);
fclose(fp);
}
return 0;
}
EDIT: if source file is in ANSI format, a warning
Code: [Select]
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char scyr1[] = {0xD0,0xA4,0xD0,0xB0,0xD0,0xB5,0xD1,0x80,0xD1,0x84,0xD0,0xBE,0xD0,0xBA,0xD1,0x81,0};
char scyr2[] = u8"Фаерфокс";

int __cdecl main(void)
{
puts(scyr1);
puts(scyr2);
if (strcmp(scyr1, scyr2)) puts ("strings are not same, source in ANSI format ?");
FILE *fp = fopen("ajax.txt", "r");
if (fp) {
fseek(fp, 0, SEEK_END);
int len = ftell(fp);
fseek(fp, 0, SEEK_SET);
char *pbuf = malloc(len);
fread(pbuf, len, 1, fp);
char *p1 = strstr(pbuf, scyr1);
char *p2 = strstr(pbuf, scyr2);
if (p1) puts(p1);
if (p2) puts(p2);
fclose(fp);
}
return 0;
}
« Last Edit: October 28, 2019, 11:16:41 AM by TimoVJL »
May the source be with you

Offline frankie

  • Global Moderator
  • Member
  • *****
  • Posts: 2114
Re: How to find string in utf-8 charset?
« Reply #6 on: October 25, 2019, 01:07:40 PM »
I think that some clarifications are required here.
UNICODE is a codeset covering all existent language symbols. Because it is so wide the whole codeset requires, actually, a 32bits representation.
A 32bits UTF code can be encoded in different ways using a base size characters of 8 bits in UTF-8 encoding, or 16bits in UTF-16. A single symbol can use more than one base size to represent a character, i.e. UTF-8 can use from 1 to 4 bytes, UTF-16 use 1 to 3 16bits words (i.e. the symbol '1', ANSI value 0x31, will be encoded in UTF-8 as 0x31 single byte).
Microsoft in origin, read DOS, used the codepage mechanism for internationalization, which is a table remapping of 256 values of a single byte in equivalent symbols for the specific language codepage. They had a lot of problems with languages as the chinese where the single range of 256 values was not enough to represent the whole chinese character set.
Then Microsoft decided to expand its international coding using a 16bits word and created the widechar set, that incidentally coincide with many UNICODE value in the range 0x0001 to 0xD7FF 0xE000 to 0xFFFF, but fails in the range D800 to 0xDFFF, that MS maps to valid characters, while in UNICODE these values are reserved for encoding of high or low surrogates.
The MS console still use codepages for I/O.

Now, being clear that you cannot compare pears and apples, if you're trying to compare file contents, that are really UTF-8 coded (BOM present also), with a user inputted string code page formatted the result will always be a failure.

Timo showed you that the standard comparing functions works perfectly with UTF-8.

EDIT: Maybe you need libiconv
« Last Edit: October 25, 2019, 04:50:42 PM by frankie »
"It is better to be hated for what you are than to be loved for what you are not." - Andre Gide

Offline jj2007

  • Member
  • *
  • Posts: 536
Re: How to find string in utf-8 charset?
« Reply #7 on: October 28, 2019, 09:07:05 AM »
In pure C / Masm.

Using a Masm library:

include \masm32\MasmBasic\MasmBasic.inc         ; download
  Init
  if 1
        Let esi=FileRead$("ajax.txt")   ; read the file into a string
  else
        Let esi=FileRead$(CL$())        ; activate in case you want the filename via the commandline
  endif
  .if Instr_(esi, "Фаерфокс")   ; find the string
        Inkey eax
  .else
        Inkey "Not found: Фаерфокс"
  .endif
EndOfCode

Output:
Фаерфокс
Cookie: PHPSESSID=ata9d3d5hr4f7urdeu923ogit7


Note that your editor or IDE must support Utf-8, otherwise neither C nor Asm will produce useful output

Offline bitcoin

  • Member
  • *
  • Posts: 179
Re: How to find string in utf-8 charset?
« Reply #8 on: October 31, 2019, 11:41:07 PM »
Thank you for all , it works.