How to find string in utf-8 charset?

bitcoin · October 23, 2019, 02:09:55 PM

I want to find strings in files. Code is
...read file data, skip it..
lpFdata = MapViewOfFile..
checkLen = 32

Next , i try to know charset

Code Select

if (IsTextUnicode(lpFdata, checkLen, 0))
{
	wprintf(L"File %s is Unicode\n", lpFileName);
	if (StrStrW((PCWSTR)lpFdata, L"Фаерфокс"))
	{
		wprintf(L"Matched unicode\n");
	}

}
else
{
	wprintf(L"File %s is ascii\n", lpFileName);
	if (StrStrA(lpFdata, "Фаерфокс"))
	{
		wprintf(L"Matched asci\n");
	}

	if (_mbsstr(lpFdata, "Фаерфокс"))
	{
		wprintf(L"Matched utf-8\n");
	}
}

It works very well for Unicode and English letters. But - not works with Russian, if encoding is NOT utf-16. Why? How to search utf-8 strings in C language or Winapi?

frankie · October 23, 2019, 06:01:32 PM

The wide string functions works only with with wide-characters that are a variant of 16bits encoding UTF (known as UTF-16). They don't work with 8bits encoding (known as UTF-8).
In any case wide chars and UTF-16 aren't the same, and for some limited cases they are inconsistent and this could lead to function failures.
To compare UTF-8 strings you can use standard ascii functions. The only problem that you can experience is due to different encoding of the same character (yes you can encode differently the same symbol).
Or you need to use some library for UTF strings.

bitcoin · October 23, 2019, 07:15:59 PM

Quote from: frankie on October 23, 2019, 06:01:32 PM
To compare UTF-8 strings you can use standard ascii functions. The only problem that you can experience is due to different encoding of the same character (yes you can encode differently the same symbol).
Or you need to use some library for UTF strings.

Ascii functions don't works with russian symbols(
can you tell me some utf-8 library? In pure C / Masm.

frankie · October 24, 2019, 09:33:56 AM

Quote from: bitcoin on October 23, 2019, 07:15:59 PM
Ascii functions don't works with russian symbols
can you tell me some utf-8 library? In pure C / Masm.

Are you sure that the strings you're comparing are really UTF-8 coded?
I'll look for a library and let you know.

bitcoin · October 25, 2019, 08:34:08 AM

Quote from: frankie on October 24, 2019, 09:33:56 AM
Are you sure that the strings you're comparing are really UTF-8 coded?

See file in attach.

it has ascii-strings (i can find it) , and string "Фаерфокс", that i can't find with any method.

I cannot convert encoding in file, because is very slow.

TimoVJL · October 25, 2019, 09:59:09 AM

Code Select

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char scyr[] = {0xD0,0xA4,0xD0,0xB0,0xD0,0xB5,0xD1,0x80,0xD1,0x84,0xD0,0xBE,0xD0,0xBA,0xD1,0x81,0};
int __cdecl main(void)
{
	puts(scyr);
	FILE *fp = fopen("ajax.txt", "r");
	if (fp) {
		fseek(fp, 0, SEEK_END);
		int len = ftell(fp);
		fseek(fp, 0, SEEK_SET);
		char *pbuf = malloc(len);
		fread(pbuf, len, 1, fp);
		char *p = strstr(pbuf, u8"Фаерфокс");
		char *p1 = strstr(pbuf, scyr);
		if (p) puts(p);
		if (p1) puts(p1);
		fclose(fp);
	}
	return 0;
}

EDIT: if source file is in ANSI format, a warning

Code Select

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char scyr1[] = {0xD0,0xA4,0xD0,0xB0,0xD0,0xB5,0xD1,0x80,0xD1,0x84,0xD0,0xBE,0xD0,0xBA,0xD1,0x81,0};
char scyr2[] = u8"Фаерфокс";

int __cdecl main(void)
{
	puts(scyr1);
	puts(scyr2);
	if (strcmp(scyr1, scyr2)) puts ("strings are not same, source in ANSI format ?");
	FILE *fp = fopen("ajax.txt", "r");
	if (fp) {
		fseek(fp, 0, SEEK_END);
		int len = ftell(fp);
		fseek(fp, 0, SEEK_SET);
		char *pbuf = malloc(len);
		fread(pbuf, len, 1, fp);
		char *p1 = strstr(pbuf, scyr1);
		char *p2 = strstr(pbuf, scyr2);
		if (p1) puts(p1);
		if (p2) puts(p2);
		fclose(fp);
	}
	return 0;
}

frankie · October 25, 2019, 01:07:40 PM

I think that some clarifications are required here.
UNICODE is a codeset covering all existent language symbols. Because it is so wide the whole codeset requires, actually, a 32bits representation.
A 32bits UTF code can be encoded in different ways using a base size characters of 8 bits in UTF-8 encoding, or 16bits in UTF-16. A single symbol can use more than one base size to represent a character, i.e. UTF-8 can use from 1 to 4 bytes, UTF-16 use 1 to 3 16bits words (i.e. the symbol '1', ANSI value 0x31, will be encoded in UTF-8 as 0x31 single byte).
Microsoft in origin, read DOS, used the codepage mechanism for internationalization, which is a table remapping of 256 values of a single byte in equivalent symbols for the specific language codepage. They had a lot of problems with languages as the chinese where the single range of 256 values was not enough to represent the whole chinese character set.
Then Microsoft decided to expand its international coding using a 16bits word and created the widechar set, that incidentally coincide with many UNICODE value in the range 0x0001 to 0xD7FF 0xE000 to 0xFFFF, but fails in the range D800 to 0xDFFF, that MS maps to valid characters, while in UNICODE these values are reserved for encoding of high or low surrogates.
The MS console still use codepages for I/O.

Now, being clear that you cannot compare pears and apples, if you're trying to compare file contents, that are really UTF-8 coded (BOM present also), with a user inputted string code page formatted the result will always be a failure.

Timo showed you that the standard comparing functions works perfectly with UTF-8.

EDIT: Maybe you need libiconv

jj2007 · October 28, 2019, 09:07:05 AM

Quote from: bitcoin on October 23, 2019, 07:15:59 PMIn pure C / Masm.

Using a Masm library:

include \masm32\MasmBasic\MasmBasic.inc ; download
Init
if 1
Let esi=FileRead$("ajax.txt") ; read the file into a string
else
Let esi=FileRead$(CL$()) ; activate in case you want the filename via the commandline
endif
.if Instr_(esi, "Фаерфокс") ; find the string
Inkey eax
.else
Inkey "Not found: Фаерфокс"
.endif
EndOfCode

Output:
Фаерфокс
Cookie: PHPSESSID=ata9d3d5hr4f7urdeu923ogit7

Note that your editor or IDE must support Utf-8, otherwise neither C nor Asm will produce useful output

bitcoin · October 31, 2019, 11:41:07 PM

Thank you for all , it works.

News:

How to find string in utf-8 charset?