Hallo,
I am a little bit confused about the TEXT, _TEXT and _T macros.
To use the _T or the _TEXT macro I have to include tchar.h , but this is not the case with the TEXT macro. What is the difference between these three macros?
A related question is: What is the difference between UNICODE, _UNICODE and _MBCS?
nancy
Mostly the _T, _TEXT and such provide an auto-switching trick for programs that need to be compiled as both Unicode or Ansi versions. These are macros (open tchar.h in the source editor and look at it) that respond to the UNICODE and _UNICODE defines... when unicode is defined they insert an L before your string literals, when it's not defined they do nothing ... L"Hello" vs "Hello". The L prefix means "long characters" or "wide characters" (whichever way you know them).
The UNICODE define in Windows switches all windows API calls to their wide character versions. To see the switching mechanism look at some windows headers in the source editor (just don't edit them or save them!) you will find most API calls have two versions... FunctionNameA() for Ansi and FunctionNameW() for unicode. With UNICODE defined, FunctionName() gets you the W version, with it not defined you get the A version.
In windows UNICODE is utf16le unicode (http://en.wikipedia.org/wiki/Unicode)... In linux and other operating systems it may signify a different unicode format... and there are several, including utf8 which Windows calls mbcs (multibyte character set).
Text isn't just text anymore... now it's a huge jumble of formats and character sets. To give you an example, here's what it takes to open a simple playlist where the source file might be any of several unicode formats:
// parse input file to strings
void CopyWChar(PWCHAR Buf)
{ PWCHAR tok; // token string
PWCHAR nt; // next token
WCHAR fp[MAX_PATH]; // line buffer
nt = wcstok(Buf,L"\r\n",&tok);
while(nt)
{ // ignore comments and urls
if ((nt[0] != '#') && (!PathIsURL(nt)))
{ // test for relative paths
if (PathIsRelative( nt ))
{ wcscpy(fp,FilePath);
wcscat(fp,nt); }
else
wcscpy(fp,nt);
// test for folders
if (PathFileExists( fp ))
{ if (PathIsDirectory( fp ))
ExpandFolder( fp );
else
AddLine( fp ); } }
nt = wcstok(tok,L"\r\n",&tok); }
// randomize here
ShuffleList();
SavePlayerFile(); }
// convert mbyte to utf16le for parser
void CopyMByte(PBYTE Buf, DWORD Bytes)
{ PWCHAR ut = calloc(Bytes + 1,sizeof(WCHAR)); // unicode buffer
try
{ if (MultiByteToWideChar(CP_UTF8,0,(PCHAR)Buf,Bytes,ut,Bytes * sizeof(WCHAR)) < 1)
Exception(0xE0640006);
CopyWChar( ut ); }
finally
{ free (ut); } }
// convert UTF-16 byte order
void FlipEndian(PBYTE Buf, DWORD Bytes)
{ BYTE t; // temp for swaps
for (INT i = 0; i < Bytes; i += 2)
{ t = Buf[i];
Buf[i] = Buf[i + 1];
Buf[i + 1] = t; } }
// open and translate file
BOOL M3ULaunch(PWCHAR FileName)
{ PBYTE rf; // raw file data
DWORD br; // bytes read
// load the raw file
{ HANDLE pl; // playlist file handle
DWORD fs; // file size
// get path to file
wcsncpy(FilePath,FileName,MAX_PATH);
PathRemoveFileSpec(FilePath);
wcscat(FilePath,L"\\");
// open the file
pl = CreateFile(FileName,GENERIC_READ,0,NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL);
if (pl == INVALID_HANDLE_VALUE)
Exception(GetLastError());
fs = GetFileSize(pl,NULL);
rf = calloc(fs + 2, sizeof(BYTE));
if (! ReadFile(pl, rf, fs, &br, NULL))
Exception(GetLastError());
CloseHandle(pl);
if (br != fs)
Exception(0xE00640007); }
try
{ DWORD bom = *(DWORD*)rf;
if ((bom == 0x0000FEFF) || (bom == 0xFFFE0000)) // utf32le bom
Exception(0xE0640002); // utf32be bom
else if ((bom & 0xFFFF) == 0xFFFE) // utf16be bom
{ FlipEndian(rf,br);
CopyWChar((PWCHAR) rf + 1); }
else if ((bom & 0xFFFF) == 0xFEFF) // utf16le bom
CopyWChar((PWCHAR) rf + 1);
else if ((bom & 0xFFFFFF) == 0xBFBBEF) // utf8 bom
CopyMByte(rf + 3, br - 3);
else // no known bom, probe the file
{ if (! memchr(rf, 0x00, br)) // 8 bit text has no nulls
CopyMByte(rf,br); // ansi / utf8 no bom
else
{ PBYTE lf = memchr(rf,0x0A,br); // lf is always present as 1 byte.
if (!lf)
Exception(0xE0640003);
if ((!(*(DWORD*)(lf - 3) & 0x00FFFFFF)) || //utf32be no bom
(!(*(DWORD*)lf & 0xFFFFFF00))) //utf32le no bom
Exception(0xE0640002);
if ((lf - rf) & 1) // big endian? (lf at odd offset)
FlipEndian(rf,br); // utf16be no bom
CopyWChar((PWCHAR) rf); } } } // utf16le no bom
finally
{ free(rf); }
return 1; }
... and that's just to open the file and convert it to Windows compatible wide character strings...
Yes... it's confusing at first... but unicode is the best means of internationalization we have at this time. It can handle even the most complex languages (such as traditional Japanese and Farsi)... Most modern programs are written exclusively in unicode formats, so that at least user entered text is language independent.
For indepth information on Unicode and it's various formats... Google is your friend :D There's a ton of information out there.
Thank you!
I understand that UNICODE is to toggle between utf16le and ansi.
But what is _UNICODE for?
And what happend, if I use _MBCS. There are no special functions which can handle utf8.
In Pelles C I can not found any reference of this macro.
The most mystery is the difference between _T, _TEXT and TEXT. And when should I use one of them? Why is the inclusion of tchar.h sometimes necessary, sometimes not?
Is there a nice tutorial anywhere?
nancy
Puhhh!
There is also TCHAR and _TCHAR ?
I have found this discussion:
http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful (http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful)
How can I work with utf8 or utf32 in windows?
nancy
Quote from: nancy on February 01, 2012, 09:46:34 AM
Thank you!
I understand that UNICODE is to toggle between utf16le and ansi.
But what is _UNICODE for?
_UNICODE is for Pelles C ... UNICODE is for windows. Simple plan, just define them both.
Quote
And what happend, if I use _MBCS. There are no special functions which can handle utf8.
In Pelles C I can not found any reference of this macro.
That's because it's a VC++ macro that only works with its Strings library.
In Pelles C use the WCHAR versions of things. You can convert with the
MultiByteToWideChar() (http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072(v=vs.85).aspx) and
WideCharToMultiByte() (http://msdn.microsoft.com/en-us/library/windows/desktop/dd374130(v=vs.85).aspx) API functions using the CP_UTF8 code page.
Here are a couple of examples of the conversion routines...
// WCHAR to UTF8 Converter
PCHAR WStrToNStr(PWCHAR WStr)
{ PCHAR nstr = NULL; // output string
INT size; // buffer size
// test and size for conversion
size = WideCharToMultiByte(CP_UTF8,0,WStr,-1,NULL,0,NULL,NULL);
// create dynamic string
if (size > 0)
{ nstr = malloc(size * sizeof(CHAR));
WideCharToMultiByte(CP_UTF8,0,WStr,-1,nstr,size,NULL,NULL); }
return nstr; }
// UTF8 to WCHAR converter
PWCHAR NStrToWStr(PCHAR NStr)
{ PWCHAR wstr = NULL; // output string
INT size; // buffer size
// test and size for conversion
size = MultiByteToWideChar(CP_UTF8,0,NStr,-1,NULL,0);
// create dynamic string
if (size > 0)
{ wstr = malloc(size * sizeof(WCHAR));
MultiByteToWideChar(CP_UTF8,0,NStr,-1,wstr,size); }
return wstr; }
Please note, both of these return string pointers that must be released with free().
Quote
The most mystery is the difference between _T, _TEXT and TEXT. And when should I use one of them? Why is the inclusion of tchar.h sometimes necessary, sometimes not?
In Windows and tchar.h both _T and _TEXT map to the __T macro, you can see it defined in the tchar.h header.
I rather suspect TEXT does to... so for all intents and purposes they're the same thing. Most likely all three versions are included to "catch" differences between the libraries attached to various compilers.
Quote
Is there a nice tutorial anywhere?
Not that I know of. :(
It's been my experience that trying to write code of any size that will compile as either Ansi or Unicode without a lot of twiddling and tweaking is nearly impossible. I don't write Ansi code for Windows GUI anymore. Everything I do in windows is done with utf16le strings. This is actually a relatively simple process... Define both UNICODE and _UNICODE; use WCHAR and wchar_t, prefix all string literals with L (as in wprintf(L"Hello World"); ), use wchar.h instead of string.h ... etc. It goes along a whole lot easier than trying to get all crafty with dual compiles.
Thing is that when a program coded exclusively as unicode, the English language user is unaware of the difference but the guys over in Japan and Russia will be extremely happy to have software they can actually use.
Quote from: nancy on February 01, 2012, 10:24:25 AM
How can I work with utf8 or utf32 in windows?
Windows up to Win 7, doesn't support utf32 ... That is to say none of he API calls will work with it and trying to down convert to utf16le can result in significant data corruption.
Your best bet is to work internally on utf16le (WCHAR, etc) and use the functions I linked in the previous message to convert your file/network inputs and outputs to and from utf8 . I'm aware this is some nasty overhead but it is only a microsecond or so per string, so your code would still be throttled by disk/network devices with little or no apparent loss of performance.
Some handy links...
http://unicode.org/ (http://unicode.org/)
http://www.unicode.org/standard/WhatIsUnicode.html (http://www.unicode.org/standard/WhatIsUnicode.html)
http://www.joelonsoftware.com/articles/Unicode.html (http://www.joelonsoftware.com/articles/Unicode.html)
http://msdn.microsoft.com/en-us/library/windows/desktop/dd318661(v=vs.85).aspx (http://msdn.microsoft.com/en-us/library/windows/desktop/dd318661(v=vs.85).aspx) (vc++ oriented)
http://en.wikipedia.org/wiki/Byte_order_mark (http://en.wikipedia.org/wiki/Byte_order_mark)
http://en.wikipedia.org/wiki/Endianness (http://en.wikipedia.org/wiki/Endianness)