NO

Author Topic: Why does pellesc cast hex char to compile-time codepage?  (Read 12019 times)

douyuan

  • Guest
Why does pellesc cast hex char to compile-time codepage?
« on: July 29, 2012, 05:38:09 AM »
my environment: 32bit Windows XP SP3, codepage 936.
Code: [Select]
#include <stdio.h>
#include <stdlib.h>

char *s1 = "\xfa\xfb\xfc\xfd\xfe\xff";
char s2[] = {'\xfa', '\xfb', '\xfc', '\xfd', '\xfe', '\xff'};
char s3[] = {0xfa, 0xfb, 0xfc, 0xfd, 0xfe, 0xff};

int main(int argc, char *argv[])
{
    printf("s1: %x,%x,%x,%x,%x,%x\n", *(s1 + 0), *(s1 + 1), *(s1 + 2), *(s1 + 3), *(s1 + 4), *(s1 + 5));
    printf("s2: %x,%x,%x,%x,%x,%x\n", s2[0], s2[1], s2[2], s2[3], s2[4], s2[5]);
    printf("s3: %x,%x,%x,%x,%x,%x\n", s3[0], s3[1], s3[2], s3[3], s3[4], s3[5]);
}
compile warning:
Quote
E:\test\pellesc>cc /J chartest.c
chartest.c
chartest.c(4): warning #2223: Unable to convert character '\u00fb' to codepage 936; using default character.
chartest.c(4): warning #2223: Unable to convert character '\u00fd' to codepage 936; using default character.
chartest.c(4): warning #2223: Unable to convert character '\u00fe' to codepage 936; using default character.
chartest.c(4): warning #2223: Unable to convert character '\u00ff' to codepage 936; using default character.
chartest.c(5): warning #2055: Excess characters in 'char' character literal ignored.
chartest.c(5): warning #2223: Unable to convert character '\u00fb' to codepage 936; using default character.
chartest.c(5): warning #2055: Excess characters in 'char' character literal ignored.
chartest.c(5): warning #2223: Unable to convert character '\u00fd' to codepage 936; using default character.
chartest.c(5): warning #2223: Unable to convert character '\u00fe' to codepage 936; using default character.
chartest.c(5): warning #2223: Unable to convert character '\u00ff' to codepage 936; using default character.
polink.exe chartest.obj
output of chartest.exe:
Quote
s1: a8,b2,3f,a8,b9,3f
s2: a8,3f,a8,3f,3f,3f
s3: fa,fb,fc,fd,fe,ff
s1/s2/s3 are completely different! :o

I've searched this forum by the warning message and got the following links:
It looks that pellesc firstly casts hex literal char/string to unicode, then convert unicode to compile-time codepage. I don't think gcc/visualc++ will do this.

my question:
  • Is there something wrong with my codes?
  • Does the C11 standard require the conversion?
  • Is it a pellesc custom extension?
  • How to avoid this if possible?

Thanks.

CommonTater

  • Guest
Re: Why does pellesc cast hex char to compile-time codepage?
« Reply #1 on: July 29, 2012, 05:56:26 AM »
You can view the mappings for code page 936  HERE
 
I'm not an expert on this so I might be wrong... but it may be that you have to use the wide character functions and wchar_t to make this work.
 
A char is only 8 bits... and there's a lot more than 256 characters in that codepage.

 
 

douyuan

  • Guest
Re: Why does pellesc cast hex char to compile-time codepage?
« Reply #2 on: July 29, 2012, 08:05:38 PM »
Thanks for your reply.

I want a byte string instead of a character string. A byte string needs not to be a valid character string. Code page should only affect character string. I don't know how to get a byte string in pelles c except the s3 format.

CommonTater

  • Guest
Re: Why does pellesc cast hex char to compile-time codepage?
« Reply #3 on: July 29, 2012, 08:47:22 PM »
Thanks for your reply.

I want a byte string instead of a character string. A byte string needs not to be a valid character string. Code page should only affect character string. I don't know how to get a byte string in pelles c except the s3 format.

A byte array would be created with something like unsigned char bytearray[256]; but don't be surprised when it gets scrambled during printing... if you need an array of bytes, treat it as such... don't try to display it as a string.
 
It is being mangled because you are trying to assign the values as strings.  Try assigning them as bytes the way you did your s3 example in your first message... i.e. don't use string functions.

If you are working on text, instead of arrays of values, this will still not hold your character set...
What I do in my code is to internally work on wchar_t which is Windows internal UTF16LE character set.  That way it's compatible with most functions.  I also use #define UNICODE, #define _UNICODE and  #include <wchar.h> using the wide character versions of the string functions (wcscpy(), wprintf() etc.) and prefixing string literals with L for most of my programs. 
 
If I need something to be presented outside my code in a specific manner (eg. UTF8, for networking) then I will convert it using windows api calls such as WideChartoMultiByte() to create the necessary external coding. 
 
One of the traps is that just because C-11 will support utf32 (etc.) does not mean Windows does... So what you end up doing is converting external strings to and from formats that Windows knows what to do with.
 
I should think that working in Simplified Chineese, everything will have to be done in wchar_t with unicode defines since Windows won't handle it any other way.

And yes... Unicode is a pain where you sit.
 
 
« Last Edit: July 29, 2012, 09:18:15 PM by CommonTater »

Offline Bitbeisser

  • Global Moderator
  • Member
  • *****
  • Posts: 772
Re: Why does pellesc cast hex char to compile-time codepage?
« Reply #4 on: July 29, 2012, 09:51:43 PM »
Thanks for your reply.

I want a byte string instead of a character string. A byte string needs not to be a valid character string. Code page should only affect character string. I don't know how to get a byte string in pelles c except the s3 format.

A byte array would be created with something like unsigned char bytearray[256]; but don't be surprised when it gets scrambled during printing... if you need an array of bytes, treat it as such... don't try to display it as a string.
Well, it might be just a matter of semantics, but one problem is that C (other than Pascal for example) doesn't have a specific "byte" data type, everything's a "char", even if the char "string" as our friend described it isn't even remotely intended to be "displayed" at all...

Ralf

CommonTater

  • Guest
Re: Why does pellesc cast hex char to compile-time codepage?
« Reply #5 on: July 30, 2012, 03:12:38 AM »
Thanks for your reply.

I want a byte string instead of a character string. A byte string needs not to be a valid character string. Code page should only affect character string. I don't know how to get a byte string in pelles c except the s3 format.

A byte array would be created with something like unsigned char bytearray[256]; but don't be surprised when it gets scrambled during printing... if you need an array of bytes, treat it as such... don't try to display it as a string.
Well, it might be just a matter of semantics, but one problem is that C (other than Pascal for example) doesn't have a specific "byte" data type, everything's a "char", even if the char "string" as our friend described it isn't even remotely intended to be "displayed" at all...

Ralf

True.  However... simply placing a value in a char array does not result in translation by code page.  That's done by C's various string oriented library functions.  Check the original message, note that while the first 2 examples using string and char functions were mangled, the last one using numerical assignment was not.  It's all in how you stuff the array.



Offline Bitbeisser

  • Global Moderator
  • Member
  • *****
  • Posts: 772
Re: Why does pellesc cast hex char to compile-time codepage?
« Reply #6 on: July 31, 2012, 03:30:40 AM »
True.  However... simply placing a value in a char array does not result in translation by code page.  That's done by C's various string oriented library functions.  Check the original message, note that while the first 2 examples using string and char functions were mangled, the last one using numerical assignment was not.  It's all in how you stuff the array.
Sorry, my bad...   :-X

I just saw myself that the error comes only from the first two lines, indeed using "chars" to fill the array, while there is no error on the third line, which is using "byte" values. At first I thought to see the error on all three lines, that's what threw me off... (Note to self: increase the coffee-to-water ratio after when getting up after 20h days... :o )

Ralf

CommonTater

  • Guest
Re: Why does pellesc cast hex char to compile-time codepage?
« Reply #7 on: July 31, 2012, 04:57:21 AM »
(Note to self: increase the coffee-to-water ratio after when getting up after 20h days... :o )

LOL... No worries.


douyuan

  • Guest
Re: Why does pellesc cast hex char to compile-time codepage?
« Reply #8 on: December 11, 2012, 10:52:46 AM »
I still did not understand the reason to do so. Any explanations?

output of the same program, compiled by gcc/visual c++/lcc-win32:
Quote
s1: fffffffa,fffffffb,fffffffc,fffffffd,fffffffe,ffffffff
s2: fffffffa,fffffffb,fffffffc,fffffffd,fffffffe,ffffffff
s3: fffffffa,fffffffb,fffffffc,fffffffd,fffffffe,ffffffff

CommonTater

  • Guest
Re: Why does pellesc cast hex char to compile-time codepage?
« Reply #9 on: December 11, 2012, 01:25:17 PM »
One of the reasons is that you are using signed characters (-128 -> 127) and putting in unsigned values greater than 128, the sign bit is set and your array is thus filled with negative numbers.

The array itself is probably filled with the correct values, but you are tripping over Pelles C localization when displaying them as character strings. 

As we said earlier... if you are working with bytes treat them as bytes... don't use string functions.

douyuan

  • Guest
Re: Why does pellesc cast hex char to compile-time codepage?
« Reply #10 on: December 11, 2012, 02:15:57 PM »
One of the reasons is that you are using signed characters (-128 -> 127) and putting in unsigned values greater than 128, the sign bit is set and your array is thus filled with negative numbers.

The array itself is probably filled with the correct values, but you are tripping over Pelles C localization when displaying them as character strings. 

As we said earlier... if you are working with bytes treat them as bytes... don't use string functions.
I do not think it is a signed/unsigned character problem. When you compare the two attachments below (output of "pocc /J /Tx86-asm chartest.c" & "pocc /Tx86-asm chartest.c"), you may find the only difference is movzx/movsx. The array itself is filled with the wrong values. The encoding conversion of string literal occured at compile time. This is a fairly rare behavior. I have never seen other compilers to do so, so I want to know why pelles c do this.
« Last Edit: December 11, 2012, 02:30:31 PM by douyuan »

CommonTater

  • Guest
Re: Why does pellesc cast hex char to compile-time codepage?
« Reply #11 on: December 11, 2012, 08:26:23 PM »
Hmmm... about the only answer I can give you is "Because that's how Pelle decided to do it"....

You probably should send him a PM and ask him directly....

Offline frankie

  • Global Moderator
  • Member
  • *****
  • Posts: 2096
Re: Why does pellesc cast hex char to compile-time codepage?
« Reply #12 on: December 12, 2012, 12:16:45 PM »
You are defining a so called 'escape sequence' that is, in plain words, a way to define a character appartening to character set, not an arbitrary hexadecimal value.
While it is a little-bit annoying, you are not allowed to put any value you like.
C99 standard 6.4.4.4 'Character constants' in paragraph 9 'constraints' specify
"The value of an octal or hexadecimal escape sequence shall be in the range of representable values for the type unsigned char for an integer character constant, or the unsigned type corresponding to wchar_t for a wide character constant."
So it must be unsigned.
In the same chapter paragraph 11 'Semantics', add:
"A wide character constant has type wchar_t, an integer type defined in the <stddef.h> header. The value of a wide character constant containing a single multibyte character that maps to a member of the extended execution character set is the
wide character corresponding to that multibyte character, as defined by the mbtowc function, with an implementation-defined current locale. The value of a wide character constant containing more than one multibyte character, or containing a multibyte
character or escape sequence not represented in the extended execution character set, is implementation-defined."
Meaning that your escaped sequence must be in the current execution character set or it can be handled as the compiler likes :-\.

Maybe you could define a BYTE or WORD array, fill it with the values you like than cast it to CHAR or WCHAR while passing it over.
« Last Edit: December 12, 2012, 12:21:21 PM by frankie »
It is better to be hated for what you are than to be loved for what you are not. - Andre Gide

CommonTater

  • Guest
Re: Why does pellesc cast hex char to compile-time codepage?
« Reply #13 on: December 12, 2012, 02:43:46 PM »
Thanks Frankie!

Ya learns something new everyday! :D

douyuan

  • Guest
Re: Why does pellesc cast hex char to compile-time codepage?
« Reply #14 on: December 13, 2012, 08:41:13 AM »
Thank you for the explanation, I have read this before, but mistakenly believe that the encoding conversion only works with wide character constant / wide string literal.