NO

Author Topic: UTF-8  (Read 4648 times)

mw

  • Guest
UTF-8
« on: April 12, 2007, 03:34:13 PM »
Hi

I wrote a little procedure:

static void test(HWND hWnd)
{
  WCHAR wc[2];
  char c[6], s[20];
  int n;

  wc[0] = L'§';
  wc[1] = L'\0';

  n = wcstombs(c, wc, 5);

  sprintf(s, "%d %d %d", n, strlen(c), c[0]);
  MessageBox(hWnd, s, "Test", MB_OK);
}

The message box displays these values: 1 1 -89

Why is the string length 1? The UTF-8 code of the section sign § is 0xc2 0xa7.
So wcstombs() should convert the wide char § into the 2-byte-sequence 0xc2 0xa7.
But wcstombs() converts § to the single byte 0xa7.

Did I make a mistake? Or do I misunderstand the technique of character conversion?
Or does the behaviour of wcstombs() depend on localization settings?
What is the right way to convert UTF-8 to wide chars (or vice versa)?

Best regards
Martin
« Last Edit: April 12, 2007, 03:39:27 PM by mw »

Offline Pelle

  • Administrator
  • Member
  • *****
  • Posts: 2266
    • http://www.smorgasbordet.com
Re: UTF-8
« Reply #1 on: April 12, 2007, 09:28:59 PM »
The exact wc <-> mb conversion isn't specified by the C standard, so you will (most likely) get different results with different implementations, and locale settings. Pelles C currently implements the "C" locale only, and you get a basic 8-bit ASCII conversion.

I don't need/use locale settings myself, no requests for it, and it would seriously bloat part of the C runtime, so I have (so far) settled for "C" locale only...
/Pelle

JohnF

  • Guest
Re: UTF-8
« Reply #2 on: April 13, 2007, 06:37:46 AM »
Martin, you could try the Windows API WideCharToMultiByte

I don't know if it will work but it's worth a try. You can set CP_UTF8 as the codepage.

EDIT:
Code: [Select]
WCHAR wc[2];
char c[6] = {0}, s[20];
int n;

wc[0] = L'§';
wc[1] = L'\0';

n = WideCharToMultiByte(CP_UTF8, 0, // performance and mapping flags
wc,    // wide-character string
        1,     // number of chars in string
  c,     // buffer for new string
  6,     // size of buffer
  NULL,  // default for unmappable chars
  NULL); // set when default char used

sprintf(s, "%d %d %hhx %hhx", n, strlen(c), c[0], c[1]);
MessageBox(0, s, "Test", MB_OK);

c[0] and c[1] are displayed as c2 and a7

John
« Last Edit: April 13, 2007, 09:22:11 AM by JohnF »

mw

  • Guest
Re: UTF-8
« Reply #3 on: April 13, 2007, 09:32:16 AM »
Thank you for the answers.

After I submitted my question to this forum I searched in the internet and found some information about WideCharToMultiByte() and MultiByteToWideChar(). I think that these functions are right for my purposes. I tested them on Windows Mobile 2003 and they worked well.

Does anybody know if WideCharToMultiByte() and MultiByteToWideChar() are available on older versions of Windows Mobile, too? (I ask this question because I made some bad experiences with MoveToEx() and LineTo(): These both functions worked on Windows Mobile 2003 and later versions but not on PDAs with older versions of this operating system; since I know that I use PolyLine() instead.)

Martin

Offline Stefan Pendl

  • Global Moderator
  • Member
  • *****
  • Posts: 582
    • Homepage
Re: UTF-8
« Reply #4 on: April 13, 2007, 10:17:42 AM »
These functions are available since Windows CE 1.01, but an OEM can remove this support.

See http://msdn2.microsoft.com/en-us/library/ms961248.aspx
and http://msdn2.microsoft.com/en-us/library/ms886760.aspx
---
Stefan

Proud member of the UltraDefrag Development Team