Print Page - Problem with C11 unicode support

Title: Problem with C11 unicode support
Post by: migf1 on June 20, 2012, 09:32:02 PM

Quote from: Pelle on June 17, 2012, 03:05:37 PM

Did someone actually test any of the new C11 features? This was the main point of the original release candidate, you know...

/Pelle

J just started playing around with some of the C11 unicode additions. The following code fails to compile on Win XP Pro SP3, 32-bits, resulting in a fatal error #1065: Failed converting input using codepage 65001 error...

Code Select


#include <stdlib.h>
#include <stdio.h>
//#include <uchar.h>

#define pressENTER()                                \
    do{                                    \
        char mYcHAr;                            \
        printf( u8"πατήστε ENTER..." );                    \
        while ( (mYcHAr=getchar()) != '\n' && mYcHAr != EOF )        \
            ;                            \
    }while(0)


/*****************************************/
int main( void )
{
#if defined(__STDC_UTF_16__)
    puts( "utf16 enabled" );
#endif

#if defined(__STDC_UTF_32__)
    puts( "utf32 enabled" );
#endif

    char u8str[] = u8"αβγδ";   // this is "abcd" in Greek

    printf( "%s\n", u8str );

    pressENTER();
    exit(0);
}

Project options:

CCFLAGS: -std:C11 -Tx86-coff -Ot -Ob1 -fp:precise -W1 -Gd
ASFLAGS: -AIA32 -Gd
LINKFLAGS: -subsystem:console -machine:x86 kernel32.lib advapi32.lib delayimp.lib

The source file encoding doesn't seem to make any difference (I tried saving it in all available encodings).

FYI, it did compile with mingw32-gcc-4.7.0 (which btw seems to provides partial C11 support, missing lots of header files, uchar.h included). On the utf-8 aware mintty (http://code.google.com/p/mintty/) console it produces the following output...

(http://img196.imageshack.us/img196/3310/c11minttygcc.jpg)

However, on the native cmd.exe switched to cp 65001 (this a handicapped UTF-8 codepage) it doesn't output the greek characters...

(http://img41.imageshack.us/img41/8772/c11cmd65001gcc.jpg)

Switching cmd.exe to cp 1253 (the "good" Greek ANSI codepage) it outputs the Greek characters, but scrambled as expected.

(http://img585.imageshack.us/img585/2074/c11cmd1253gcc.jpg)

PS. Tomorrow I will also test it on Win7 Home 64-bit and I'll let you guys know.

Title: Re: Problem with C11 unicode support
Post by: migf1 on June 21, 2012, 12:36:00 AM

Thanks for moving it to the correct section (sorry for the inconvenience).

Title: Re: Problem with C11 unicode support
Post by: migf1 on June 21, 2012, 11:28:26 AM

It compiles fine on Win7 Home 64-bit. I tried it both as a 32bit and as a 64bit project.

Here is the output on the UTF-8 mintty console...

(http://img684.imageshack.us/img684/5317/c11pelle7x64mintty.jpg)

And here is the same output on the native cmd.exe of Win 7, switched to codepage 65001...

(http://img856.imageshack.us/img856/5196/c11pelle7x64cmdexe.jpg)

Here is the cmd.exe cp 65001 output from the executable produced by MinGW32 ...

(http://img268.imageshack.us/img268/6784/c11mingw32win7cmdexe.jpg)

The first observation is that Pelles C doesn't seem to define __STDC_UTF_32__ (it doesn't output the "utf32 enabled" string on the screen). MinGW32 outputs it on both XP 32bit and 7 64bit.

The second observation is that contrary to the XP 32bit, on Win7 64bit MinG32 prints the ?-symbol char for Greek characters (on XP it was not printing anything... but it could be to different implementation of the Lucida Console font on the 2 platforms). For Pelles C I don't have a point of reference, since it doesn't compile the code on XP, but here on 7 it seems to at least print correct Greek chars instead of marking them as uknown (although it seems to output some extra ones too... maybe some of them occupy more than 1 byte, but I haven't checked with the Unicode table)

I don't know if it is important, but the XP used yesterday is an English version with the regional settings set Greek, while Win7 is a Greek version.

Title: Re: Problem with C11 unicode support
Post by: CommonTater on June 21, 2012, 12:53:44 PM

The problem you're having here may not be Pelles c... Windows is internally utf16le unicode. It doesn't know utf8 natively and it doesn't know utf32 at all. The console has two modes depending upon the first string output to it... either oem or utf16le... Once you give it a unicode string it won't display anything else.

You appear to have discovered the problem of lag time... right now C-11 does this stuff... but windows does not. Thus it's useful for file storage, networking and communications but has to be converted for the display...

Try your same program in utf16le (WCHAR or wchar_t) screen output and see what happens...

Try sending utf8 and utf32 outputs to a disk file and examine them in a hex editor... there you will see what works and what doesn't.

Title: Re: Problem with C11 unicode support
Post by: migf1 on June 21, 2012, 01:24:33 PM

Thanks for the answer tater, but the problem is not the console (I'm aware of its limitations, that's why I have also demonstrated the output on a proper utf-8 enabled console, namely mintty).

The problem is that Pelles C cannot compile the code on XP SP3 (I only have it in 32bit version).

As for the _ _STDC_UTF_32_ _, the fact that Pelles C does not define it means -according to the ISO C11 standard- that the values of type char32_t (e.g. U"string-literal") are not internally encoded as UTF-32 by the compiler. The latter must have something to do (I guess) that on Windows platforms the wchar_t type occupies 2 bytes (instead of 4 on most other platforms).

I mentioned it because I found it a bit odd that migw32 defines it (so it treats U"blabla" as a string-literal with 4-byte chars encoded in UTF-32, although it is supposed to use MS runtime libs... btw, there's no uchar.h header file in the mingw-gcc 4.7.0, so char32_t is not defined... but it does understand __CHAR16_TYPE__ and __CHAR32_TYPE__ since c99 I think... a complete chaos! ).

Title: Re: Problem with C11 unicode support
Post by: CommonTater on June 21, 2012, 01:35:10 PM

So do you have access to Windows 7 Professional... even through a friend or computer store? Win7 Home lacks a lot of the Pro version's "language" support

What you can do is upload the smallest Pelles C project that demonstrates the problem and I'll give it a try for you on my systems... XP x64 and Win7 Pro x64... Just use the Project->ZipFiles option from the menu and upload the zip. I'm sure Pelle will ask you for the same thing so you might as well get it uploaded...

Title: Re: Problem with C11 unicode support
Post by: migf1 on June 21, 2012, 01:40:07 PM

Quote from: CommonTater on June 21, 2012, 01:35:10 PM
So you have access to Windows 7 Professional... even through a friend or computer store? Win7 Home lacks a lot of the Pro version's "language" support

What you can do is upload the smallest Pelles C project that demonstrates the problem and I'll give it a try for you on my systems... XP x64 and Win7 Pro x64... Just use the Project->ZipFiles option from the menu and upload the zip. I'm sure Pelle will ask you for the same thing so you might as well get it uploaded...

Thanks, zips attached (c11 is the 64bit).

PS. They compile and work fine on Win7 Home 64bit... the problem is with XP SP3 32bit (it does not even compile the code... mingw compiles it fine).

Title: Re: Problem with C11 unicode support
Post by: CommonTater on June 21, 2012, 01:59:07 PM

You're quick... so I'll be quick too :D

I ran into a build error in the x86 version (screen snip below). When I changed _CHAR32_TYPE to the more standard char32_t it compiled...

I also got a "malicious code" warning from someplace in the bowels of Win7 which would not allow me to unpack the EXEs in the project. This does not surprise me, though since Windows does not support 32 bit characters...

The results are in the attachments below...

Title: Re: Problem with C11 unicode support
Post by: migf1 on June 21, 2012, 02:04:07 PM

Ooops, the line: __CHAR32_TYPE__ c32; was not supposed to be there at all (leftover from my experimentations with migw32).
The produced output looks just fine, I get the same on Win7 Home x64.

Does the code compile on your XP x64?

PS. I'm sorry about the malicious code, I have no idea why you got that warning.

Title: Re: Problem with C11 unicode support
Post by: migf1 on June 21, 2012, 02:06:11 PM

I have to go now, please let us know what happens on Win XP x64 when you get some time to play with it, thanks.

Title: Re: Problem with C11 unicode support
Post by: CommonTater on June 21, 2012, 02:11:31 PM

Quote from: migf1 on June 21, 2012, 02:04:07 PM
Does the code compile on your XP x64?

Yes... with the same result, but different garbage characters.

Quote
PS. I'm sorry about the malicious code, I have no idea why you got that warning.

Not to worry... Like I said it's probably because of the 32bit characters...

Title: Re: Problem with C11 unicode support
Post by: migf1 on June 21, 2012, 06:39:07 PM

Thank you, tater!

Unfortunately, it still does not compile on XP 32bit. I just d/ed the c11x86.zip from the previous post (it was made on Win7 64bit) and tried it on XP 32 bit... same error.

I think Pelle should test it too.

Title: Re: Problem with C11 unicode support
Post by: CommonTater on June 21, 2012, 08:56:35 PM

Quote from: migf1 on June 21, 2012, 06:39:07 PM
Thank you, tater!

Unfortunately, it still does not compile on XP 32bit. I just d/ed the c11x86.zip from the previous post (it was made on Win7 64bit) and tried it on XP 32 bit... same error.

I think Pelle should test it too.

Try this...

Code Select


#include <stdlib.h>
#include <stdio.h>
#include <uchar.h>
 
#define pressENTER()       \
    do{         \
        char mYcHAr;       \
        printf( u8"ðáôÞóôå ENTER..." );     \
        while ( (mYcHAr=getchar()) != '\n' && mYcHAr != EOF )  \
            ;        \
    }while(0)

/*****************************************/
int main( void )
{
#if defined(__STDC_UTF_16__)
 puts( "utf16 enabled" );
#endif
#if defined(__STDC_UTF_32__)
 puts( "utf32 enabled" );
#endif
 char u8str[] = u8"áâãä";   // this is "abcd" in Greek
 char32_t  c32;
 printf( "%s\n", u8str );
 pressENTER();
 exit(0);
}

And yes, I agree, Pelle should take a look....

Title: Re: Problem with C11 unicode support
Post by: migf1 on June 21, 2012, 09:22:15 PM

Same error.

Title: Re: Problem with C11 unicode support
Post by: Pelle on July 08, 2012, 04:20:25 PM

I use as much Unicode support from Windows as I can - almost everything. It was UCS 2 in early NT days, and UTF-16 later (I don't remember the Windows version, but something was improved in this area at some point - to or after XP). I currently don't define __STDC_UTF_32__ since I'm not convinced yet the UTF-16 <-> UTF-32 conversion are 100% correct. I think the Unicode support works well enough for English and Swedish, which is really my main priority.

Title: Re: Problem with C11 unicode support
Post by: CommonTater on July 08, 2012, 05:12:54 PM

Hi Pelle.... Like you I use the API's unicode functions, not those in C and have had no problems at all with it.

I think the big problem with Unicode support is that it's constantly changing and almost impossible to keep up with. If they would settle on one standard (I'd recommend UTF-8) and develop it fully things would get a lot easier than having this huge proliferation of standards that are obviously implemented differently from one platform to the next. This was supposed to enable better interchange... I think it's made it worse.

Ansi, OEM, utf-8, utf-16le utf-16be, utf-32le, utf-32be... how many code pages? Ridiculous.

Of course the problem is that (as we all know) if you develop one way of doing things then make a leaps and bounds improvement, you can never change entirely over to the new way of doing it... you end up supporting both methods whether you want to or not.

I wish they could standardize a universal character set (64bits if need be) and implement it into utf-8's extensible character architecture. Build all future compilers and OSs to be compatible with this one standard... life gets a lot easer!

Title: Re: Problem with C11 unicode support
Post by: Stefan Pendl on July 08, 2012, 10:01:29 PM

Quote from: CommonTater on July 08, 2012, 05:12:54 PM
I wish they could standardize a universal character set (64bits if need be) and implement it into utf-8's extensible character architecture. Build all future compilers and OSs to be compatible with this one standard... life gets a lot easer!

I second that, since all the hassle around multi-byte character sets is driving me nuts.

Title: Re: Problem with C11 unicode support
Post by: CommonTater on July 08, 2012, 10:13:09 PM

Quote from: Stefan Pendl on July 08, 2012, 10:01:29 PM
I second that, since all the hassle around multi-byte character sets is driving me nuts.

These days I just define UNICODE and _UNICODE at the top of every file. I use WCHAR from windows and wchar_t from Pelles. I don't do anything in ANSI anymore. Things that get written to disk are either written as wide characters or converted to UTF8 when saving.

Opening an unknown file is a massive pain... here's what I use to open playlists in one of my programs...

Code Select


// parse input file to strings
void CopyWChar(PWCHAR Buf)
  { PWCHAR tok;           // token string
    PWCHAR nt;            // next token
    WCHAR  fp[MAX_PATH];  // line buffer
    nt = wcstok(Buf,L"\r\n",&tok);
    while(nt)
      { // ignore comments and urls
        if ((nt[0] != '#') && (!PathIsURL(nt)))
          { // test for relative paths 
            if (PathIsRelative( nt ))
              { wcscpy(fp,FilePath);
                wcscat(fp,nt); }
            else
              wcscpy(fp,nt);
            // test for folders
           if (PathFileExists( fp ))
              { if (PathIsDirectory( fp ))
                  ExpandFolder( fp );
                else
                  AddLine( fp ); } }
        nt = wcstok(tok,L"\r\n",&tok); }  
    // randomize here
    ShuffleList();
    SavePlayerFile(); }

// convert mbyte to utf16le for parser
void CopyMByte(PBYTE Buf, DWORD Bytes)
  { PWCHAR ut = calloc(Bytes + 1,sizeof(WCHAR));     // unicode buffer
    try
      { if (MultiByteToWideChar(CP_UTF8,0,(PCHAR)Buf,Bytes,ut,Bytes * sizeof(WCHAR)) < 1) 
          Exception(0xE0640006);
        CopyWChar( ut ); }    
    finally
      { free (ut); } }
 
// convert UTF-16 byte order
void FlipEndian(PBYTE Buf, DWORD Bytes)
  { BYTE t; // temp for swaps
    for (INT i = 0; i < Bytes; i += 2)
      { t = Buf[i];
        Buf[i] = Buf[i + 1];
        Buf[i + 1] = t; } }
 
// open and translate file
BOOL M3ULaunch(PWCHAR FileName)
  { PBYTE  rf;      // raw file data
    DWORD  br;      // bytes read
    // load the raw file
    { HANDLE pl;    // playlist file handle 
      DWORD  fs;    // file size
      // get path to file
      wcsncpy(FilePath,FileName,MAX_PATH);
      PathRemoveFileSpec(FilePath);
      wcscat(FilePath,L"\\");
      // open the file
      pl = CreateFile(FileName,GENERIC_READ,0,NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL);
      if (pl == INVALID_HANDLE_VALUE)
        Exception(GetLastError());
      fs = GetFileSize(pl,NULL);        
      rf = calloc(fs + 2, sizeof(BYTE));
      if (! ReadFile(pl, rf, fs, &br, NULL))
        Exception(GetLastError());
      CloseHandle(pl);  
      if (br != fs)
        Exception(0xE00640007); } 
    try                                   
     { DWORD bom = *(DWORD*)rf;
       if ((bom == 0x0000FEFF) || (bom == 0xFFFE0000))  // utf32le bom  
         Exception(0xE0640002);                         // utf32be bom  
       else if ((bom & 0xFFFF) == 0xFFFE)               // utf16be bom
         { FlipEndian(rf,br);
           CopyWChar((PWCHAR) rf + 1); }
       else if ((bom & 0xFFFF) == 0xFEFF)               // utf16le bom
         CopyWChar((PWCHAR) rf + 1);  
       else if ((bom & 0xFFFFFF) == 0xBFBBEF)           // utf8 bom
         CopyMByte(rf + 3, br - 3);
       else                                             // no known bom, probe the file
         { if (! memchr(rf, 0x00, br))                  // 8 bit text has no nulls
             CopyMByte(rf,br);                          // ansi / utf8 no bom
           else 
            { PBYTE lf = memchr(rf,0x0A,br);            // lf is always present as 1 byte.
              if (!lf) 
                Exception(0xE0640003);
              if ((!(*(DWORD*)(lf - 3) & 0x00FFFFFF)) ||    //utf32be no bom
                   (!(*(DWORD*)lf & 0xFFFFFF00)))           //utf32le no bom
                 Exception(0xE0640002);    
              if ((lf - rf) & 1)                        // big endian? (lf at odd offset)
                FlipEndian(rf,br);                      // utf16be no bom  
              CopyWChar((PWCHAR) rf);  } } }            // utf16le no bom
     finally  
      { free(rf); }
    return 1; }

You either have to have the patience of Job or really like pain to think that's OK.

Pelles C forum

Pelles C => Bug reports => Topic started by: migf1 on June 20, 2012, 09:32:02 PM