Pelles C forum

C language => Beginner questions => Topic started by: douyuan on July 29, 2012, 05:38:09 AM

Title: Why does pellesc cast hex char to compile-time codepage?
Post by: douyuan on July 29, 2012, 05:38:09 AM
my environment: 32bit Windows XP SP3, codepage 936.
Code: [Select]
#include <stdio.h>
#include <stdlib.h>

char *s1 = "\xfa\xfb\xfc\xfd\xfe\xff";
char s2[] = {'\xfa', '\xfb', '\xfc', '\xfd', '\xfe', '\xff'};
char s3[] = {0xfa, 0xfb, 0xfc, 0xfd, 0xfe, 0xff};

int main(int argc, char *argv[])
{
    printf("s1: %x,%x,%x,%x,%x,%x\n", *(s1 + 0), *(s1 + 1), *(s1 + 2), *(s1 + 3), *(s1 + 4), *(s1 + 5));
    printf("s2: %x,%x,%x,%x,%x,%x\n", s2[0], s2[1], s2[2], s2[3], s2[4], s2[5]);
    printf("s3: %x,%x,%x,%x,%x,%x\n", s3[0], s3[1], s3[2], s3[3], s3[4], s3[5]);
}
compile warning:
Quote
E:\test\pellesc>cc /J chartest.c
chartest.c
chartest.c(4): warning #2223: Unable to convert character '\u00fb' to codepage 936; using default character.
chartest.c(4): warning #2223: Unable to convert character '\u00fd' to codepage 936; using default character.
chartest.c(4): warning #2223: Unable to convert character '\u00fe' to codepage 936; using default character.
chartest.c(4): warning #2223: Unable to convert character '\u00ff' to codepage 936; using default character.
chartest.c(5): warning #2055: Excess characters in 'char' character literal ignored.
chartest.c(5): warning #2223: Unable to convert character '\u00fb' to codepage 936; using default character.
chartest.c(5): warning #2055: Excess characters in 'char' character literal ignored.
chartest.c(5): warning #2223: Unable to convert character '\u00fd' to codepage 936; using default character.
chartest.c(5): warning #2223: Unable to convert character '\u00fe' to codepage 936; using default character.
chartest.c(5): warning #2223: Unable to convert character '\u00ff' to codepage 936; using default character.
polink.exe chartest.obj
output of chartest.exe:
Quote
s1: a8,b2,3f,a8,b9,3f
s2: a8,3f,a8,3f,3f,3f
s3: fa,fb,fc,fd,fe,ff
s1/s2/s3 are completely different! :o

I've searched this forum by the warning message and got the following links:
It looks that pellesc firstly casts hex literal char/string to unicode, then convert unicode to compile-time codepage. I don't think gcc/visualc++ will do this.

my question:

Thanks.
Title: Re: Why does pellesc cast hex char to compile-time codepage?
Post by: CommonTater on July 29, 2012, 05:56:26 AM
You can view the mappings for code page 936  HERE (http://msdn.microsoft.com/en-us/library/cc194886)
 
I'm not an expert on this so I might be wrong... but it may be that you have to use the wide character functions and wchar_t to make this work.
 
A char is only 8 bits... and there's a lot more than 256 characters in that codepage.

 
 
Title: Re: Why does pellesc cast hex char to compile-time codepage?
Post by: douyuan on July 29, 2012, 08:05:38 PM
Thanks for your reply.

I want a byte string instead of a character string. A byte string needs not to be a valid character string. Code page should only affect character string. I don't know how to get a byte string in pelles c except the s3 format.
Title: Re: Why does pellesc cast hex char to compile-time codepage?
Post by: CommonTater on July 29, 2012, 08:47:22 PM
Thanks for your reply.

I want a byte string instead of a character string. A byte string needs not to be a valid character string. Code page should only affect character string. I don't know how to get a byte string in pelles c except the s3 format.

A byte array would be created with something like unsigned char bytearray[256]; but don't be surprised when it gets scrambled during printing... if you need an array of bytes, treat it as such... don't try to display it as a string.
 
It is being mangled because you are trying to assign the values as strings.  Try assigning them as bytes the way you did your s3 example in your first message... i.e. don't use string functions.

If you are working on text, instead of arrays of values, this will still not hold your character set...
What I do in my code is to internally work on wchar_t which is Windows internal UTF16LE character set.  That way it's compatible with most functions.  I also use #define UNICODE, #define _UNICODE and  #include <wchar.h> using the wide character versions of the string functions (wcscpy(), wprintf() etc.) and prefixing string literals with L for most of my programs. 
 
If I need something to be presented outside my code in a specific manner (eg. UTF8, for networking) then I will convert it using windows api calls such as WideChartoMultiByte() (http://msdn.microsoft.com/en-us/library/windows/desktop/dd374130(v=vs.85).aspx) to create the necessary external coding. 
 
One of the traps is that just because C-11 will support utf32 (etc.) does not mean Windows does... So what you end up doing is converting external strings to and from formats that Windows knows what to do with.
 
I should think that working in Simplified Chineese, everything will have to be done in wchar_t with unicode defines since Windows won't handle it any other way.

And yes... Unicode is a pain where you sit.
 
 
Title: Re: Why does pellesc cast hex char to compile-time codepage?
Post by: Bitbeisser on July 29, 2012, 09:51:43 PM
Thanks for your reply.

I want a byte string instead of a character string. A byte string needs not to be a valid character string. Code page should only affect character string. I don't know how to get a byte string in pelles c except the s3 format.

A byte array would be created with something like unsigned char bytearray[256]; but don't be surprised when it gets scrambled during printing... if you need an array of bytes, treat it as such... don't try to display it as a string.
Well, it might be just a matter of semantics, but one problem is that C (other than Pascal for example) doesn't have a specific "byte" data type, everything's a "char", even if the char "string" as our friend described it isn't even remotely intended to be "displayed" at all...

Ralf
Title: Re: Why does pellesc cast hex char to compile-time codepage?
Post by: CommonTater on July 30, 2012, 03:12:38 AM
Thanks for your reply.

I want a byte string instead of a character string. A byte string needs not to be a valid character string. Code page should only affect character string. I don't know how to get a byte string in pelles c except the s3 format.

A byte array would be created with something like unsigned char bytearray[256]; but don't be surprised when it gets scrambled during printing... if you need an array of bytes, treat it as such... don't try to display it as a string.
Well, it might be just a matter of semantics, but one problem is that C (other than Pascal for example) doesn't have a specific "byte" data type, everything's a "char", even if the char "string" as our friend described it isn't even remotely intended to be "displayed" at all...

Ralf

True.  However... simply placing a value in a char array does not result in translation by code page.  That's done by C's various string oriented library functions.  Check the original message, note that while the first 2 examples using string and char functions were mangled, the last one using numerical assignment was not.  It's all in how you stuff the array.


Title: Re: Why does pellesc cast hex char to compile-time codepage?
Post by: Bitbeisser on July 31, 2012, 03:30:40 AM
True.  However... simply placing a value in a char array does not result in translation by code page.  That's done by C's various string oriented library functions.  Check the original message, note that while the first 2 examples using string and char functions were mangled, the last one using numerical assignment was not.  It's all in how you stuff the array.
Sorry, my bad...   :-X

I just saw myself that the error comes only from the first two lines, indeed using "chars" to fill the array, while there is no error on the third line, which is using "byte" values. At first I thought to see the error on all three lines, that's what threw me off... (Note to self: increase the coffee-to-water ratio after when getting up after 20h days... :o )

Ralf
Title: Re: Why does pellesc cast hex char to compile-time codepage?
Post by: CommonTater on July 31, 2012, 04:57:21 AM
(Note to self: increase the coffee-to-water ratio after when getting up after 20h days... :o )

LOL... No worries.

Title: Re: Why does pellesc cast hex char to compile-time codepage?
Post by: douyuan on December 11, 2012, 10:52:46 AM
I still did not understand the reason to do so. Any explanations?

output of the same program, compiled by gcc/visual c++/lcc-win32:
Quote
s1: fffffffa,fffffffb,fffffffc,fffffffd,fffffffe,ffffffff
s2: fffffffa,fffffffb,fffffffc,fffffffd,fffffffe,ffffffff
s3: fffffffa,fffffffb,fffffffc,fffffffd,fffffffe,ffffffff
Title: Re: Why does pellesc cast hex char to compile-time codepage?
Post by: CommonTater on December 11, 2012, 01:25:17 PM
One of the reasons is that you are using signed characters (-128 -> 127) and putting in unsigned values greater than 128, the sign bit is set and your array is thus filled with negative numbers.

The array itself is probably filled with the correct values, but you are tripping over Pelles C localization when displaying them as character strings. 

As we said earlier... if you are working with bytes treat them as bytes... don't use string functions.
Title: Re: Why does pellesc cast hex char to compile-time codepage?
Post by: douyuan on December 11, 2012, 02:15:57 PM
One of the reasons is that you are using signed characters (-128 -> 127) and putting in unsigned values greater than 128, the sign bit is set and your array is thus filled with negative numbers.

The array itself is probably filled with the correct values, but you are tripping over Pelles C localization when displaying them as character strings. 

As we said earlier... if you are working with bytes treat them as bytes... don't use string functions.
I do not think it is a signed/unsigned character problem. When you compare the two attachments below (output of "pocc /J /Tx86-asm chartest.c" & "pocc /Tx86-asm chartest.c"), you may find the only difference is movzx/movsx. The array itself is filled with the wrong values. The encoding conversion of string literal occured at compile time. This is a fairly rare behavior. I have never seen other compilers to do so, so I want to know why pelles c do this.
Title: Re: Why does pellesc cast hex char to compile-time codepage?
Post by: CommonTater on December 11, 2012, 08:26:23 PM
Hmmm... about the only answer I can give you is "Because that's how Pelle decided to do it"....

You probably should send him a PM and ask him directly....
Title: Re: Why does pellesc cast hex char to compile-time codepage?
Post by: frankie on December 12, 2012, 12:16:45 PM
You are defining a so called 'escape sequence' that is, in plain words, a way to define a character appartening to character set, not an arbitrary hexadecimal value.
While it is a little-bit annoying, you are not allowed to put any value you like.
C99 standard 6.4.4.4 'Character constants' in paragraph 9 'constraints' specify
"The value of an octal or hexadecimal escape sequence shall be in the range of representable values for the type unsigned char for an integer character constant, or the unsigned type corresponding to wchar_t for a wide character constant."
So it must be unsigned.
In the same chapter paragraph 11 'Semantics', add:
"A wide character constant has type wchar_t, an integer type defined in the <stddef.h> header. The value of a wide character constant containing a single multibyte character that maps to a member of the extended execution character set is the
wide character corresponding to that multibyte character, as defined by the mbtowc function, with an implementation-defined current locale. The value of a wide character constant containing more than one multibyte character, or containing a multibyte
character or escape sequence not represented in the extended execution character set, is implementation-defined."
Meaning that your escaped sequence must be in the current execution character set or it can be handled as the compiler likes :-\.

Maybe you could define a BYTE or WORD array, fill it with the values you like than cast it to CHAR or WCHAR while passing it over.
Title: Re: Why does pellesc cast hex char to compile-time codepage?
Post by: CommonTater on December 12, 2012, 02:43:46 PM
Thanks Frankie!

Ya learns something new everyday! :D
Title: Re: Why does pellesc cast hex char to compile-time codepage?
Post by: douyuan on December 13, 2012, 08:41:13 AM
Thank you for the explanation, I have read this before, but mistakenly believe that the encoding conversion only works with wide character constant / wide string literal.
Title: Re: Why does pellesc cast hex char to compile-time codepage?
Post by: dienstag on December 14, 2012, 10:54:59 AM
Quote
C99 standard 6.4.4.4 'Character constants' in paragraph 9 'constraints' specify
"The value of an octal or hexadecimal escape sequence shall be in the range of representable values for the type unsigned char for an integer character constant, or the unsigned type corresponding to wchar_t for a wide character constant."
So it must be unsigned.

That is the wrong conclusion. The standard says, that if the hexadecimal sequence fits into an unsigned char, it is correctly specified for any integer character constant, no matter whether signed or unsigned. They should therefore be taken 1:1 by the compiler even when the leading bit specifies a negative value.

There is no word about strings here; but, however, if you consider a string as consisting of character constants, the hexadecimal numbers, when specifying a byte, must appear 1:1 in the compiled code no matter what these bits actually mean, even when they totally scramble the readability of the string.

Compilers should behave as GCC/VC/LCC do. They did all the time and there is no reason that source code that has been working for decades should now produce different results. Escape sequences were invented to put codes into characters and strings that usually do not belong there. To specify encrypted string constants, for example.
Title: Re: Why does pellesc cast hex char to compile-time codepage?
Post by: frankie on December 14, 2012, 11:09:53 AM
More subtle, means that whatever you write is taken as an unsigned.
I fully agree that correct compilation must be grant for old code (while old code with wchar_t is not so diffused).
Of course the extension of of char constant to constants in strings, as composed of chars, is perfectly legal.
Is also perfectly  legal for the compiler to behave as it likes standing to last sentence of char constants semantics.
So at the end any way to classify this as a bug?
Or should we put it in wish list?
What is your opinion?
Title: Re: Why does pellesc cast hex char to compile-time codepage?
Post by: CommonTater on December 14, 2012, 02:10:01 PM
I would suggest that we need to be careful to understand the difference between a CHAR array and a STRING array (even though C doesn't actually have strings). 

If you want literal storage of every value in a char array ... don't use string functions to put them there.  (i.e. memcpy() instead of strcpy() etc.)

If on the other hand you actually are working with text strings, then the language and character set become a consideration.  For example: the symbol ΓΌ, #129, in code page 406 might be 213 in code page 1402 .... To correctly display text across those two languages some translation of character values is necessary.

I would favour a compiler flag to disable localization functions... but with the complexity of a world with over 400 languages, I suggest it should default to "On".
 
Title: Re: Why does pellesc cast hex char to compile-time codepage?
Post by: Stefan Pendl on December 15, 2012, 09:35:51 AM
Compilers should behave as GCC/VC/LCC do.

Why should one compiler behave like another one, when there is an ISO  standard to follow?

Wouldn't this result in the same problem that IE puts on us, which defines its own standards apart from the ISO standard for HTML?

Any compiler must follow the ISO standard, but can add its own extensions for whatever reason.

If you don't like that Pelles C only implements the ISO standard without any extension, you are free to use any other compiler, that includes behavior that is different from the ISO standard.
Title: Re: Why does pellesc cast hex char to compile-time codepage?
Post by: CommonTater on December 15, 2012, 08:39:45 PM
If you don't like that Pelles C only implements the ISO standard without any extension, you are free to use any other compiler, that includes behavior that is different from the ISO standard.

Actually if you enable Pelles C extensions in the compiler settings or use /Zx on the command line, there are quite a number of extensions.  Each extension and function is clearly marked as "Not Standard C" in the help file, and there are hundreds of them.
 
NO compiler is obligated to follow any standard.  This is purely voluntary (although it makes sense they would). 
 
From the help file...
Code: [Select]
  /Zx option (POCC) [2.70] 
 Syntax: /Zx
 
 Description: The /Zx option makes the compiler accept Pelle's extensions to standard C.
 
The currently supported extensions are:
  Optional arguments - similar to C++.
Support for the GCC extension __typeof__ [4.00].
Support for the GCC extension __alignof__ (same as the __alignof operator) [5.00].
Support for the GCC case range extension: case expr ... expr [4.00].
Support for the GCC escape sequence \e (ASCII character ESC) [4.00].
Support for the GCC extension binary constants, using the 0b (or 0B) prefix followed by a sequence of '0' and '1' digits [6.00].   

 Example 1: int test2(int a = 100)
{
    return a * 2;
}

int main(void)
{
    return test2();  // Not necessary to specify an argument to test2, the default value 100 is used in this case.
}

Example 2: #define pointer(T)  __typeof__(T *)
#define array(T, N)  __typeof__(T [N])

array(pointer(char), arrname;
 
 Example 3: switch (c)
{
    case '0' ... '9': /* digit */
    case 'a' ... 'z': /* lower case (works for ASCII/ANSI) */
}
 
 Example 4: unsigned int mask = 0b1111;  /* binary 1111 is decimal 15 */

Also see the help files list of "Private #include files"
 
Title: Re: Why does pellesc cast hex char to compile-time codepage?
Post by: aMarCruz on May 30, 2014, 08:30:40 PM
For all,
in Pelles C v8 there is not translation at compile time.
Test with:

Code: [Select]
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <tchar.h>

// char to UINT
#define CH2UINT(c) ((unsigned int)((unsigned char)(c)))

#define SDUMP6(s) (void)_tprintf(_T("%s: %x,%x,%x,%x,%x,%x\n"), #s, \
        CH2UINT(s[0]), CH2UINT(s[1]), CH2UINT(s[2]), \
        CH2UINT(s[3]), CH2UINT(s[4]), CH2UINT(s[5]))

char *s1  = "\xfa\xfb\xfc\xfd\xfe\xff";
char s2[] = {'\xfa','\xfb','\xfc','\xfd','\xfe','\xff'};
char s3[] = {0xfa, 0xfb, 0xfc, 0xfd, 0xfe, 0xff};

int _tmain(void)
{
    SDUMP6(s1);
    SDUMP6(s2);
    SDUMP6(s3);

    //note: don't use sizeof(s1) - s1 have zero-terminator
    if (memcmp(s1,s2,sizeof(s2)) || memcmp(s2,s3,sizeof(s3)))
        _tprintf(_T("Buffers are NOT equals!\n"));

    return 0;
}

@beto