Author Topic: TEXT (Read 6153 times)

nancy · « **on:** February 01, 2012, 12:32:10 AM »

Hallo,

I am a little bit confused about the TEXT, _TEXT and _T macros.
To use the _T or the _TEXT macro I have to include tchar.h , but this is not the case with the TEXT macro. What is the difference between these three macros?

A related question is: What is the difference between UNICODE, _UNICODE and _MBCS?

nancy

CommonTater · « **Reply #1 on:** February 01, 2012, 05:18:26 AM »

Mostly the _T, _TEXT and such provide an auto-switching trick for programs that need to be compiled as both Unicode or Ansi versions. These are macros (open tchar.h in the source editor and look at it) that respond to the UNICODE and _UNICODE defines... when unicode is defined they insert an L before your string literals, when it's not defined they do nothing ... L"Hello" vs "Hello". The L prefix means "long characters" or "wide characters" (whichever way you know them).

The UNICODE define in Windows switches all windows API calls to their wide character versions. To see the switching mechanism look at some windows headers in the source editor (just don't edit them or save them!) you will find most API calls have two versions... FunctionNameA() for Ansi and FunctionNameW() for unicode. With UNICODE defined, FunctionName() gets you the W version, with it not defined you get the A version.

In windows UNICODE is utf16le unicode... In linux and other operating systems it may signify a different unicode format... and there are several, including utf8 which Windows calls mbcs (multibyte character set).

Text isn't just text anymore... now it's a huge jumble of formats and character sets. To give you an example, here's what it takes to open a simple playlist where the source file might be any of several unicode formats:

Code: [Select]


// parse input file to strings
void CopyWChar(PWCHAR Buf)
  { PWCHAR tok;           // token string
    PWCHAR nt;            // next token
    WCHAR  fp[MAX_PATH];  // line buffer
    nt = wcstok(Buf,L"\r\n",&tok);
    while(nt)
      { // ignore comments and urls
        if ((nt[0] != '#') && (!PathIsURL(nt)))
          { // test for relative paths 
            if (PathIsRelative( nt ))
              { wcscpy(fp,FilePath);
                wcscat(fp,nt); }
            else
              wcscpy(fp,nt);
            // test for folders
           if (PathFileExists( fp ))
              { if (PathIsDirectory( fp ))
                  ExpandFolder( fp );
                else
                  AddLine( fp ); } }
        nt = wcstok(tok,L"\r\n",&tok); }  
    // randomize here
    ShuffleList();
    SavePlayerFile(); }

// convert mbyte to utf16le for parser
void CopyMByte(PBYTE Buf, DWORD Bytes)
  { PWCHAR ut = calloc(Bytes + 1,sizeof(WCHAR));     // unicode buffer
    try
      { if (MultiByteToWideChar(CP_UTF8,0,(PCHAR)Buf,Bytes,ut,Bytes * sizeof(WCHAR)) < 1) 
          Exception(0xE0640006);
        CopyWChar( ut ); }    
    finally
      { free (ut); } }
 
// convert UTF-16 byte order
void FlipEndian(PBYTE Buf, DWORD Bytes)
  { BYTE t; // temp for swaps
    for (INT i = 0; i < Bytes; i += 2)
      { t = Buf[i];
        Buf[i] = Buf[i + 1];
        Buf[i + 1] = t; } }
 
// open and translate file
BOOL M3ULaunch(PWCHAR FileName)
  { PBYTE  rf;      // raw file data
    DWORD  br;      // bytes read
    // load the raw file
    { HANDLE pl;    // playlist file handle 
      DWORD  fs;    // file size
      // get path to file
      wcsncpy(FilePath,FileName,MAX_PATH);
      PathRemoveFileSpec(FilePath);
      wcscat(FilePath,L"\\");
      // open the file
      pl = CreateFile(FileName,GENERIC_READ,0,NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL);
      if (pl == INVALID_HANDLE_VALUE)
        Exception(GetLastError());
      fs = GetFileSize(pl,NULL);        
      rf = calloc(fs + 2, sizeof(BYTE));
      if (! ReadFile(pl, rf, fs, &br, NULL))
        Exception(GetLastError());
      CloseHandle(pl);  
      if (br != fs)
        Exception(0xE00640007); } 
    try                                   
     { DWORD bom = *(DWORD*)rf;
       if ((bom == 0x0000FEFF) || (bom == 0xFFFE0000))  // utf32le bom  
         Exception(0xE0640002);                         // utf32be bom  
       else if ((bom & 0xFFFF) == 0xFFFE)               // utf16be bom
         { FlipEndian(rf,br);
           CopyWChar((PWCHAR) rf + 1); }
       else if ((bom & 0xFFFF) == 0xFEFF)               // utf16le bom
         CopyWChar((PWCHAR) rf + 1);  
       else if ((bom & 0xFFFFFF) == 0xBFBBEF)           // utf8 bom
         CopyMByte(rf + 3, br - 3);
       else                                             // no known bom, probe the file
         { if (! memchr(rf, 0x00, br))                  // 8 bit text has no nulls
             CopyMByte(rf,br);                          // ansi / utf8 no bom
           else 
            { PBYTE lf = memchr(rf,0x0A,br);            // lf is always present as 1 byte.
              if (!lf) 
                Exception(0xE0640003);
              if ((!(*(DWORD*)(lf - 3) & 0x00FFFFFF)) ||    //utf32be no bom
                   (!(*(DWORD*)lf & 0xFFFFFF00)))           //utf32le no bom
                 Exception(0xE0640002);    
              if ((lf - rf) & 1)                        // big endian? (lf at odd offset)
                FlipEndian(rf,br);                      // utf16be no bom  
              CopyWChar((PWCHAR) rf);  } } }            // utf16le no bom
     finally  
      { free(rf); }
    return 1; }

... and that's just to open the file and convert it to Windows compatible wide character strings...

Yes... it's confusing at first... but unicode is the best means of internationalization we have at this time. It can handle even the most complex languages (such as traditional Japanese and Farsi)... Most modern programs are written exclusively in unicode formats, so that at least user entered text is language independent.

For indepth information on Unicode and it's various formats... Google is your friend

There's a ton of information out there.

nancy · « **Reply #2 on:** February 01, 2012, 09:46:34 AM »

Thank you!

I understand that UNICODE is to toggle between utf16le and ansi.
But what is _UNICODE for?
And what happend, if I use _MBCS. There are no special functions which can handle utf8.
In Pelles C I can not found any reference of this macro.

The most mystery is the difference between _T, _TEXT and TEXT. And when should I use one of them? Why is the inclusion of tchar.h sometimes necessary, sometimes not?

Is there a nice tutorial anywhere?

nancy

nancy · « **Reply #3 on:** February 01, 2012, 10:24:25 AM »

Puhhh!

There is also TCHAR and _TCHAR ?

I have found this discussion:

http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful

How can I work with utf8 or utf32 in windows?

nancy

CommonTater · « **Reply #4 on:** February 01, 2012, 04:17:51 PM »

Quote from: nancy on February 01, 2012, 09:46:34 AM

Thank you!

I understand that UNICODE is to toggle between utf16le and ansi.
But what is _UNICODE for?

_UNICODE is for Pelles C ... UNICODE is for windows. Simple plan, just define them both.

Quote

And what happend, if I use _MBCS. There are no special functions which can handle utf8.
In Pelles C I can not found any reference of this macro.

That's because it's a VC++ macro that only works with its Strings library.

In Pelles C use the WCHAR versions of things. You can convert with the
MultiByteToWideChar() and WideCharToMultiByte() API functions using the CP_UTF8 code page.

Here are a couple of examples of the conversion routines...

Code: [Select]

// WCHAR to UTF8 Converter
PCHAR WStrToNStr(PWCHAR WStr)
  { PCHAR nstr = NULL;    // output string
    INT   size;           // buffer size
    // test and size for conversion
    size = WideCharToMultiByte(CP_UTF8,0,WStr,-1,NULL,0,NULL,NULL);
    // create dynamic string
    if (size > 0)
      { nstr = malloc(size * sizeof(CHAR));
        WideCharToMultiByte(CP_UTF8,0,WStr,-1,nstr,size,NULL,NULL); }
    return nstr; }

Code: [Select]

// UTF8 to WCHAR converter
PWCHAR NStrToWStr(PCHAR NStr)
  { PWCHAR  wstr = NULL;        // output string
    INT     size;               // buffer size
    // test and size for conversion
    size = MultiByteToWideChar(CP_UTF8,0,NStr,-1,NULL,0); 
    // create dynamic string  
    if (size > 0)
      { wstr = malloc(size * sizeof(WCHAR));
        MultiByteToWideChar(CP_UTF8,0,NStr,-1,wstr,size); }
        
    return wstr; }

Please note, both of these return string pointers that must be released with free().

Quote

The most mystery is the difference between _T, _TEXT and TEXT. And when should I use one of them? Why is the inclusion of tchar.h sometimes necessary, sometimes not?

In Windows and tchar.h both _T and _TEXT map to the __T macro, you can see it defined in the tchar.h header.
I rather suspect TEXT does to... so for all intents and purposes they're the same thing. Most likely all three versions are included to "catch" differences between the libraries attached to various compilers.

Quote

Is there a nice tutorial anywhere?

Not that I know of.

It's been my experience that trying to write code of any size that will compile as either Ansi or Unicode without a lot of twiddling and tweaking is nearly impossible. I don't write Ansi code for Windows GUI anymore. Everything I do in windows is done with utf16le strings. This is actually a relatively simple process... Define both UNICODE and _UNICODE; use WCHAR and wchar_t, prefix all string literals with L (as in wprintf(L"Hello World"); ), use wchar.h instead of string.h ... etc. It goes along a whole lot easier than trying to get all crafty with dual compiles.

Thing is that when a program coded exclusively as unicode, the English language user is unaware of the difference but the guys over in Japan and Russia will be extremely happy to have software they can actually use.

CommonTater · « **Reply #5 on:** February 01, 2012, 04:32:23 PM »

Quote from: nancy on February 01, 2012, 10:24:25 AM

How can I work with utf8 or utf32 in windows?

Windows up to Win 7, doesn't support utf32 ... That is to say none of he API calls will work with it and trying to down convert to utf16le can result in significant data corruption.

Your best bet is to work internally on utf16le (WCHAR, etc) and use the functions I linked in the previous message to convert your file/network inputs and outputs to and from utf8 . I'm aware this is some nasty overhead but it is only a microsecond or so per string, so your code would still be throttled by disk/network devices with little or no apparent loss of performance.

Some handy links...
http://unicode.org/
http://www.unicode.org/standard/WhatIsUnicode.html
http://www.joelonsoftware.com/articles/Unicode.html
http://msdn.microsoft.com/en-us/library/windows/desktop/dd318661(v=vs.85).aspx (vc++ oriented)
http://en.wikipedia.org/wiki/Byte_order_mark
http://en.wikipedia.org/wiki/Endianness

Pelles C forum

News:

Author Topic: TEXT (Read 6153 times)

nancy

TEXT

CommonTater

Re: TEXT

nancy

Re: TEXT

nancy

Re: TEXT

CommonTater

Re: TEXT

CommonTater

Re: TEXT