UTF16 reading with fgetwc and fgetws

nsp · December 21, 2016, 08:38:25 AM

I did try to read an unicode file (UTF16-LE) from a console application but both fgetws and fgetwc seems to only read char by char.
As i want to convert to ansi using wctob to result is wrong.
Same code compiled with gcc gives expected result.

Code Select

#include <windows.h>
#include <stdio.h>
#include <wchar.h>
#include <string.h>

int main(int argc, char *argv[])
{
	int i;
	for (i=1;i<argc;i++){
		printf("%s\n",argv[i]);
		FILE *o=fopen(argv[i],"r");
		FILE *f;
		if (fgetc(o)!=0xFF){
			return -1;	
		} else {
			f=fopen(argv[i],"rt,ccs=UNICODE");
		}
		fclose(o);
		char towrite;
		wint_t c=fgetwc(f);
		while (c!=WEOF) {
            printf("%c",wctob(c));
			c=fgetwc(f);

		}
		printf("---------> %s\n",strerror(GetLastError()));
		fclose(f);
	}
	return 0;
}

What i did wrong ?

TimoVJL · December 21, 2016, 02:36:13 PM

Even

Code Select

		if (fwide(f, 0) > 0) printf("f wide\n");
		else printf("f byte\n");

print wide, it reads bytes. Something is wrong with crt.

frankie · December 21, 2016, 03:12:38 PM

Quote from: nsp on December 21, 2016, 08:38:25 AM
What i did wrong ?

Follow MS.

Ok, it's a Joke

From where to begin?

Code Select

f=fopen(argv[i],"rt,ccs=UNICODE");
PellesC, and standard C, doesn't include any native support for UNICODE. This is an MS extension. So don't expect that it works under PellesC.
Ansi compliant compilers offer only some support for multibyte, or, more technically correct, UTF-8 UNICODE encoding.
A common mistake is the tendency to confuse the UNICODE set and the encoding of the UNICODE set itself. Actually UNICODE set (2.0) requires 21 bits to represent the whole characters included in the set, meaning that the plain format, not encoded use 32bits words on the commonly available machines that are based on words being multiples of 8bits.
MS, on the excitation of the first version of UNICODE, was too fast to define WCHAR as 16 bits integer, the result is that the standard wchar cannot represent all the available symbols belonging to the UNICODE set 2.0. I recommend to read this page to understand the basics.
The standard C runtime function fgetwc reads from an open stream any UTF-8 encoded UNICODE symbol. The routine read from 1 to 4 bytes following the encoding and give back a symbol expressed in 16bits wchar. For symbols, larger that 16 bits, that cannot be represented in 16 bits, will be used a replacement symbol or will generate an error depending on the function you use.
Using fgetwc on an UTF-16LE stream, in almost all cases, will give back one byte at time: An ascii value for the first byte of the wchar, and a 0 for the other part. Try to read an UTF-16L file containing symbols having value that requires more than 8 bits and you'll see a really strange behavior from your code

.

How you can handle an UTF-16LE file using an ANSI compliant compiler?
This way:

Code Select


#include <windows.h>
#include <stdio.h>
#include <wchar.h>
#include <string.h>

#define BOM 0xFEFF

int main(int argc, char *argv[])
{
	int i;
	for (i = 1; i < argc; i++)
	{
		printf("%s\n", argv[i]);
		FILE *fp = fopen(argv[i], "rb");
		wint_t c;

		//Read BOM
		fread((unsigned char *)&c, 2, 1, fp);

		if ( c != BOM)
		{
			return -1;
		}

		while (fread((unsigned char *)&c, 2, 1, fp))
		{
			//The following test emulates the filtering of CR in Text files
			if (c == '\r')	//You can avoid this test if you don't care for CR
				continue;

			printf("%c", wctob(c));
		}
		printf("---------> %s\n", strerror(GetLastError()));
		fclose(fp);
	}
	return 0;
}

Your last question:

Quote from: nsp on December 21, 2016, 08:38:25 AM
Same code compiled with gcc gives expected result.

GCC ports for windows uses the MSVCRuntime, so they are absolutely compliant with ... MS extensions!

Merry Christmas

nsp · December 21, 2016, 07:40:45 PM

Many thanks Frankie, it is already X'mass !

Great explanation for the MS logic ! The fread in binary mode seems to be the only compliant way to do. Now can get the code working as expected !

I was just hopping to get this working with Enable Microsoft Extension in Pelles C project but it seems that full MSVCRuntime is not "implemented"

Merry X'Mass !

frankie · December 22, 2016, 01:56:20 PM

Timo fwide() seems to work only for standard I/O.
See:

Code Select


char *Orientation(int i)
{
	if (i > 0)
		return "Wide";

	if (i < 0)
		return "Byte";

	return "No Orientation";
}

int main(int argc, char *argv[])
{
	int in, out, err;
	in  = fwide(stdin, 0);
	out = fwide(stdout, 0);
	err = fwide(stderr, 0);
	printf("orientations: stdin=\"%s\", stdout=\"%s\", stderr=\"%s\"\n",
						Orientation(in), Orientation(out), Orientation(err));
	fwprintf(stderr, L"Press return ...");
	fgetwc(stdin);
	in  = fwide(stdin, 0);
	out = fwide(stdout, 0);
	err = fwide(stderr, 0);
	printf("orientations: stdin=\"%s\", stdout=\"%s\", stderr=\"%s\"\n",
						Orientation(in), Orientation(out), Orientation(err));
	return 0;
}

And I'm not even sure it works for stdin, because is difficult to check that the int returned by an getwc() is really a wide char using a western language and keyboard...
The implementation for disk I/O seems broken...

But this is an already known bug.

TimoVJL · December 24, 2016, 07:01:27 PM

fgetwc() use mbtowc()
crt support to this function is only partial.

Code Select

#include <stdio.h>
#include <locale.h>
#include <string.h>
#include <stdlib.h>
#include <wchar.h>
// http://en.cppreference.com/w/c/string/multibyte/mbtowc
// print multibyte string to wide-oriented stdout
// equivalent to wprintf(L"%s\n", ptr);
void print_mb(const char* ptr)
{
    mbtowc(NULL, 0, 0); // reset the conversion state
    const char* end = ptr + strlen(ptr);
    int ret;
    for (wchar_t wc; (ret = mbtowc(&wc, ptr, end-ptr)) > 0; ptr+=ret) {
        wprintf(L"%lc", wc);
    }
    wprintf(L"\n");
}
 
int main(void)
{
    setlocale(LC_ALL, "en_US.utf8");
    // UTF-8 narrow multibyte encoding
    print_mb(u8"z\u00df\u6c34\U0001F34C");"
}

so better avoid those mb??? functions, if UTF was used.

News:

UTF16 reading with fgetwc and fgetws

nsp

TimoVJL

frankie

nsp

frankie

TimoVJL