NO

Author Topic: UTF16 reading with fgetwc and fgetws  (Read 7795 times)

Offline nsp

  • Member
  • *
  • Posts: 15
UTF16 reading with fgetwc and fgetws
« on: December 21, 2016, 08:38:25 AM »
I did try to read an unicode file (UTF16-LE) from a console application but both fgetws and fgetwc seems to only read char by char.
As i want to convert to ansi using wctob to result is wrong.
Same code compiled with gcc gives expected result.
Code: [Select]
#include <windows.h>
#include <stdio.h>
#include <wchar.h>
#include <string.h>

int main(int argc, char *argv[])
{
int i;
for (i=1;i<argc;i++){
printf("%s\n",argv[i]);
FILE *o=fopen(argv[i],"r");
FILE *f;
if (fgetc(o)!=0xFF){
return -1;
} else {
f=fopen(argv[i],"rt,ccs=UNICODE");
}
fclose(o);
char towrite;
wint_t c=fgetwc(f);
while (c!=WEOF) {
            printf("%c",wctob(c));
c=fgetwc(f);

}
printf("---------> %s\n",strerror(GetLastError()));
fclose(f);
}
return 0;
}

What i did wrong ?

Offline TimoVJL

  • Global Moderator
  • Member
  • *****
  • Posts: 2091
Re: UTF16 reading with fgetwc and fgetws
« Reply #1 on: December 21, 2016, 02:36:13 PM »
Even
Code: [Select]
if (fwide(f, 0) > 0) printf("f wide\n");
else printf("f byte\n");
print wide, it reads bytes. Something is wrong with crt. :(
May the source be with you

Offline frankie

  • Global Moderator
  • Member
  • *****
  • Posts: 2096
Re: UTF16 reading with fgetwc and fgetws
« Reply #2 on: December 21, 2016, 03:12:38 PM »
What i did wrong ?
Follow MS.  ;D
Ok, it's a Joke  :D
From where to begin?  :P
Code: [Select]
f=fopen(argv[i],"rt,ccs=UNICODE");PellesC, and standard C, doesn't include any native support for UNICODE. This is an MS extension. So don't expect that it works under PellesC.
Ansi compliant compilers offer only some support for multibyte, or, more technically correct, UTF-8 UNICODE encoding.
A common mistake is the tendency to confuse the UNICODE set and the encoding of the UNICODE set itself. Actually UNICODE set (2.0) requires 21 bits to represent the whole characters included in the set, meaning that the plain format, not encoded use 32bits words on the commonly available machines that are based on words being multiples of 8bits.
MS, on the excitation of the first version of UNICODE, was too fast to define WCHAR as 16 bits integer, the result is that the standard wchar cannot represent all the available symbols belonging to the UNICODE set 2.0. I recommend to read this page to understand the basics.
The standard C runtime function fgetwc reads from an open stream any UTF-8 encoded UNICODE symbol. The routine read from 1 to 4 bytes following the encoding and give back a symbol expressed in 16bits wchar. For symbols, larger that 16 bits, that cannot be represented in 16 bits, will be used a replacement symbol or will generate an error depending on the function you use.
Using  fgetwc on an UTF-16LE stream, in almost all cases, will give back one byte at time: An ascii value for the first byte of the wchar, and a 0 for the other part. Try to read an UTF-16L file containing symbols having value that requires more than 8 bits and you'll see a really strange behavior from your code  8).

How you can handle an UTF-16LE file using an ANSI compliant compiler?
This way:
Code: [Select]
#include <windows.h>
#include <stdio.h>
#include <wchar.h>
#include <string.h>

#define BOM 0xFEFF

int main(int argc, char *argv[])
{
int i;
for (i = 1; i < argc; i++)
{
printf("%s\n", argv[i]);
FILE *fp = fopen(argv[i], "rb");
wint_t c;

//Read BOM
fread((unsigned char *)&c, 2, 1, fp);

if ( c != BOM)
{
return -1;
}

while (fread((unsigned char *)&c, 2, 1, fp))
{
//The following test emulates the filtering of CR in Text files
if (c == '\r') //You can avoid this test if you don't care for CR
continue;

printf("%c", wctob(c));
}
printf("---------> %s\n", strerror(GetLastError()));
fclose(fp);
}
return 0;
}

Your last question:
Same code compiled with gcc gives expected result.
GCC ports for windows uses the MSVCRuntime, so they are absolutely compliant with ... MS extensions8)

Merry Christmas
« Last Edit: December 21, 2016, 03:25:41 PM by frankie »
It is better to be hated for what you are than to be loved for what you are not. - Andre Gide

Offline nsp

  • Member
  • *
  • Posts: 15
Re: UTF16 reading with fgetwc and fgetws
« Reply #3 on: December 21, 2016, 07:40:45 PM »
Many thanks Frankie, it is already X'mass !

Great explanation for the MS logic ! The fread in binary mode seems to be the only compliant way to do. Now can get the code working as expected !

I was just hopping to get this working with Enable Microsoft Extension in Pelles C project but it seems that  full MSVCRuntime is not "implemented" :(

Merry X'Mass !
 


   

Offline frankie

  • Global Moderator
  • Member
  • *****
  • Posts: 2096
Re: UTF16 reading with fgetwc and fgetws
« Reply #4 on: December 22, 2016, 01:56:20 PM »
Timo fwide() seems to work only for standard I/O.
See:
Code: [Select]
char *Orientation(int i)
{
if (i > 0)
return "Wide";

if (i < 0)
return "Byte";

return "No Orientation";
}

int main(int argc, char *argv[])
{
int in, out, err;
in  = fwide(stdin, 0);
out = fwide(stdout, 0);
err = fwide(stderr, 0);
printf("orientations: stdin=\"%s\", stdout=\"%s\", stderr=\"%s\"\n",
Orientation(in), Orientation(out), Orientation(err));
fwprintf(stderr, L"Press return ...");
fgetwc(stdin);
in  = fwide(stdin, 0);
out = fwide(stdout, 0);
err = fwide(stderr, 0);
printf("orientations: stdin=\"%s\", stdout=\"%s\", stderr=\"%s\"\n",
Orientation(in), Orientation(out), Orientation(err));
return 0;
}
And I'm not even sure it works for stdin, because is difficult to check that the int returned by an getwc() is really a wide char using a western language and keyboard...
The implementation for disk I/O seems broken...  :( But this is an already known bug.
It is better to be hated for what you are than to be loved for what you are not. - Andre Gide

Offline TimoVJL

  • Global Moderator
  • Member
  • *****
  • Posts: 2091
Re: UTF16 reading with fgetwc and fgetws
« Reply #5 on: December 24, 2016, 07:01:27 PM »
fgetwc() use mbtowc()
crt support to this function is only partial.
Code: [Select]
#include <stdio.h>
#include <locale.h>
#include <string.h>
#include <stdlib.h>
#include <wchar.h>
// http://en.cppreference.com/w/c/string/multibyte/mbtowc
// print multibyte string to wide-oriented stdout
// equivalent to wprintf(L"%s\n", ptr);
void print_mb(const char* ptr)
{
    mbtowc(NULL, 0, 0); // reset the conversion state
    const char* end = ptr + strlen(ptr);
    int ret;
    for (wchar_t wc; (ret = mbtowc(&wc, ptr, end-ptr)) > 0; ptr+=ret) {
        wprintf(L"%lc", wc);
    }
    wprintf(L"\n");
}
 
int main(void)
{
    setlocale(LC_ALL, "en_US.utf8");
    // UTF-8 narrow multibyte encoding
    print_mb(u8"z\u00df\u6c34\U0001F34C");"
}
so better avoid those mb??? functions, if UTF was used.
May the source be with you