Unicode BOM....

CommonTater · February 16, 2011, 08:49:31 PM

I've been working on a small program that has to be able to flexibly open either ascii or unicode text files. Internally the program is operating on unicode so ascii files have to be converted when loading and possibly re-converted when saving... The idea is to open either format and give the user his choice of save formats.

What I need is a simple means of telling which is which, "on the fly" as it were. Where I'm not quite getting my head around this part of the problem is the Byte Order Message... I know what it is, but I'm not entirely sure how to use it in this case. My current understanding says...

1) If the first character is > 0 and < 128 it's an ascii file
2) If the BOM is not present then either the first or second character will be 0.
3) If the BOM is present the first two characters should be FE FF for utf16be or FF FE for utf16le.
4) if #3 is FE FF then I have to flip the byte order for Windows (which is utf16le)

Is this correct?

This is a windows only program but the files may get passed around...
When writing Unicode, should I ask about byte order or just use windows native format?

Bitbeisser · February 17, 2011, 01:12:23 AM

Quote from: CommonTater on February 16, 2011, 08:49:31 PM
1) If the first character is > 0 and < 128 it's an ascii file

Maybe, and then only US ASCII, not accounting for non-Unicode Windows text files which can have valid characters > 128

Quote2) If the BOM is not present then either the first or second character will be 0.

No, 00 00 FE FF indicates a UTF-32, big-endian encoding

Quote3) If the BOM is present the first two characters should be FE FF for utf16be or FF FE for utf16le.

Maybe, as valid BOM sequences are

00 00 FE FF    UTF-32, big-endian
FF FE 00 00    UTF-32, little-endian
FE FF      UTF-16, big-endian
FF FE      UTF-16, little-endian
EF BB BF      UTF-8

Quote4) if #3 is FE FF then I have to flip the byte order for Windows (which is utf16le)

Well, don't forget UTF-32LE

QuoteWhen writing Unicode, should I ask about byte order or just use windows native format?

You could get away with assuming what you refer to as "native Windows".

In general, I don't think that there is a 100% "detection algorithm" for the encoding of a text file...

Ralf

Stefan Pendl · February 17, 2011, 01:41:53 AM

You may check out the source code of Notepad++ or Scintilla, whichever is responsible for detecting the encoding.
Both are available at http://sf.net/

It is C++, but one should be able to deceiver the needed code.

CommonTater · February 17, 2011, 01:50:01 AM

Quote
In general, I don't think that there is a 100% "detection algorithm" for the encoding of a text file...

Hmmmm... bummer. Makes the whole idea of writing a format marker kind of pointless doesn't it?

Then of course there's the problem of line endings... I've seen all combinations of cr, lf, cr-lf, lf-cr and I'm going to have to auto detect these as well.

CommonTater · February 17, 2011, 02:02:27 AM

Quote
Both are available at http://sf.net/

WOW... so much for sourceforge... the site repeatedly caused browser lockups... I couldn't get to anything...

CommonTater · February 17, 2011, 02:07:50 AM

Ok... so how common is utf32?

I know about it (and now I know more about it) but I don't think I've ever seen it in use...

Edit: Oi such a day I'm having!

Stefan Pendl · February 17, 2011, 02:23:47 AM

Quote from: CommonTater on February 17, 2011, 02:02:27 AM
WOW... so much for sourceforge... the site repeatedly caused browser lockups... I couldn't get to anything...

Try http://notepad-plus.svn.sourceforge.net/viewvc/notepad-plus/trunk/ and download the tarball (link at the lower left corner of the page).

CommonTater · February 17, 2011, 02:45:45 AM

Quote from: Stefan Pendl on February 17, 2011, 02:23:47 AM
Try http://notepad-plus.svn.sourceforge.net/viewvc/notepad-plus/trunk/ and download the tarball (link at the lower left corner of the page).

That one worked, thanks stephan...

If taken a look ad how they do it (uni8_16.cpp ... DetermineEncoding) and it looks easy enough. They're really just examining the first byte to determine what to do next... This I think I can manage.

Thanks guys!

I think, since Windows doesn't have a 32bit character mode, and C has no library functions for it, I'm going to skip utf32 for now and see how long it takes to come back and bite me... The utf16 support is a big step forward for this type of file anyway so I should be good to go for a couple of years (fingers crossed)

CommonTater · February 19, 2011, 05:34:49 PM

Thought you might be interested in what I came up with to solve this problem...
I'm still testing it, but so far so good...

Code Select


BOOL M3UOpen(PWCHAR Filename)
  { PBYTE  rf;      // raw data
    DWORD  br;      // bytes read
    // load the raw file
    { HANDLE pl;    // playlist file handle
      DWORD  fs;    // file size
      pl = CreateFile(Filename,GENERIC_READ,0,NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL);
      if(pl == INVALID_HANDLE_VALUE)
        Exception(GetLastError());
      fs = GetFileSize(pl,NULL);        
      rf = calloc((fs + 2), sizeof(BYTE));
      ReadFile(pl,rf,fs,&br,NULL);
      CloseHandle(pl);  
      if (br != fs)
        Exception(GetLastError()); } 
    // Identify text format
    try                                   
     { if (*(DWORD*)rf == 0x0000FEFF ||             // utf32le bom  
           *(DWORD*)rf == 0xFFFE0000)               // utf32be bom  
         Exception(0xE0640002);
       else if (*(WORD*)rf == 0xFFFE)               // utf16be bom
         { FlipBytes(rf + 2,br - 2);
           CopyUnicode(rf + 2,br - 2); }
       else if (*(WORD*)rf == 0xFEFF)               // utf16le bom
         CopyUnicode(rf + 2,br - 2);  
       else if (*(WORD*)rf == 0xBBEF)               // utf8 bom
         CopyMByte(rf + 3,br - 3);
       else                                         // no known bom, probe the file
         { PBYTE lf = NULL; // line feed
           lf = memchr(rf,0x0A,br);                 // lf is always recognizable in 1 byte.
           if (!lf)
             Exception(0xE0640003);
           if (*(lf - 1) != 0 || *(lf + 1) != 0)    // utf8 no bom
             CopyMByte(rf,br);
           else if ((DWORD)(lf - rf) & 1)           // big endian (lf at odd offset)
             { if (!*(DWORD*)(lf - 3) & 0x00FFFFFF) //utf32be no bom
                 Exception(0xE0640002);    
               else
                 { FlipBytes(rf,br);                // utf16be no bom  
                   CopyUnicode(rf,br); } }
           else                                     // little endian (lf at even offset)
             { if (!*(DWORD*)lf & 0xFFFFFF00)       // utf32le no bom
                 Exception(0xE0640002);                           
               else 
                 CopyUnicode(rf,br); } } }          // utf16le no bom
    finally  
      { free(rf); }
    return 1; }

The exception codes say "utf32 not supported" and "File has errors".
The copy routines translate and save the file as utf16le (i.e. Windows Unicode) with BOM.

Thanks for the help, guys...

Bitbeisser · February 20, 2011, 12:38:45 AM

Quote from: CommonTater on February 17, 2011, 01:50:01 AM
Quote
In general, I don't think that there is a 100% "detection algorithm" for the encoding of a text file...

Hmmmm... bummer. Makes the whole idea of writing a format marker kind of pointless doesn't it?

Then of course there's the problem of line endings... I've seen all combinations of cr, lf, cr-lf, lf-cr and I'm going to have to auto detect these as well.

There should be only three that you would have to consider

CR-LF : DOS/Windows
CR : Mac OS
LF : Unix/Linux

Years back when I worked at a CAD/CAM software manufacturer and was responsible for modules to perform data exchange via DXF, IGES, etc, I ran a lot into the CR-CR-LF format with data coming from mainly IBM, DEC and SGI minis but can't say that I have seen this in the last 10 years or so.

It isn't very likely (though not impossible) that you have more than one line-ending style within one file, so a simple test algorithm would be:
- read n bytes from the start of the file, where n is at last 50% larger than the average expected line length.
- scan for the first occurrence of a CR
- if found, test for the character after the CR, if it's LF, you have a DOS/Windows text file, if not, you have likely a Mac file
- if not found, scan for the first occurrence of a LF, if found you have a Unix/Linux text file, if not you aren't likely to have a text file at all or our buffer size was too small

Ralf

Bitbeisser · February 20, 2011, 12:56:50 AM

Quote from: CommonTater on February 17, 2011, 02:07:50 AM
Ok... so how common is utf32?

I know about it (and now I know more about it) but I don't think I've ever seen it in use...

Edit: Oi such a day I'm having!

I think you might have a chance of UTF32 in Asian language setups (CJKV) of Windows, not sure about Linux. The onlu time I ran ever into it was after the download of a document from a Chinese web site.
Depending on the purpose/realm of use of your software, you might be able to ignore it...

Ralf

CommonTater · February 20, 2011, 03:19:52 AM

Quote from: Bitbeisser on February 20, 2011, 12:38:45 AM
There should be only three that you would have to consider

CR-LF : DOS/Windows
CR : Mac OS
LF : Unix/Linux

Thanks Ralph... I hit on a really simple solution to that problem. Since I'm breaking the file up into strings for some parsing... I just went through and trashed all the CRs and LFs... the combination doesn't matter... they're gone. I can reinsert Windows style line ends when I save the files.

As you may have deduced from the function name these are M3U format playlists that need to be transformed for use on windows. Almost all of them are either English or French. Problem is they've been handed around so much and saved on other machines often enough they've ended up being in a bunch of dispate formats...

TimoVJL · February 21, 2011, 08:11:55 AM

http://en.wikipedia.org/wiki/Byte_order_mark

Code Select

else if (*(WORD*)rf == 0xBBEF) // utf8 bom
to this

Code Select

else if ((*(DWORD*)rf & 0x00FFFFFF) == 0x00BFBBEF) // utf8 bom

CommonTater · February 21, 2011, 03:11:45 PM

Quote from: timovjl on February 21, 2011, 08:11:55 AM
http://en.wikipedia.org/wiki/Byte_order_mark

Code Select Expand
else if (*(WORD*)rf == 0xBBEF) // utf8 bom
to this
Code Select Expand
else if ((*(DWORD*)rf & 0x00FFFFFF) == 0x00BFBBEF) // utf8 bom

Thanks Timo... Already done...
Also had to redo the line about utf8 with no bom... it's now split into the two pieces in the last section and worrking much better...

Heres the current version...

Code Select


BOOL M3UOpen(PWCHAR Filename)
  { PBYTE  rf;      // raw data
    DWORD  br;      // bytes read
    // load the raw file
    { HANDLE pl;    // playlist file handle
      DWORD  fs;    // file size
      pl = CreateFile(Filename,GENERIC_READ,0,NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL);
      if(pl == INVALID_HANDLE_VALUE)
        Exception(GetLastError());
      fs = GetFileSize(pl,NULL);        
      rf = calloc((fs + 2), sizeof(BYTE));
      ReadFile(pl,rf,fs,&br,NULL);
      CloseHandle(pl);  
      if (br != fs)
        Exception(GetLastError()); } 
    // Identify text format
    try                                   
     { if (*(DWORD*)rf == 0x0000FEFF ||             // utf32le bom  
           *(DWORD*)rf == 0xFFFE0000)               // utf32be bom  
         Exception(0xE0640002);
       else if (*(WORD*)rf == 0xFFFE)               // utf16be bom
         { FlipBytes(rf + 2,br - 2);
           CopyUnicode(rf + 2,br - 2); }
       else if (*(WORD*)rf == 0xFEFF)               // utf16le bom
         CopyUnicode(rf + 2,br - 2);  
       else if (*(DWORD*)rf & 0x00FFFFFF == 0x00BFBBEF) // utf8 bom
         CopyMByte(rf + 3,br - 3);
       else                                         // no known bom, probe the file
         { PBYTE lf = NULL; // points to line feed
           lf = memchr(rf,0x0A,br);                 // lf is always recognizable in 1 byte.
           if (!lf)
             Exception(0xE0640003);
           if ((DWORD)(lf - rf) & 1)                // big endian? (lf at odd offset)
             { if (*(lf - 1) != 0)                  // utf8  no bom
                 CopyMByte(rf,br);                  
               else if (!*(DWORD*)(lf - 3) & 0x00FFFFFF) //utf32be no bom
                 Exception(0xE0640002);    
               else
                 { FlipBytes(rf,br);                // utf16be no bom  
                   CopyUnicode(rf,br); } }
           else                                     // little endian? (lf at even offset)
             { if (*(lf + 1) != 0)
                 CopyMByte(rf,br);                  // utf8 no bom
               else if (!*(DWORD*)lf & 0xFFFFFF00)  // utf32le no bom
                   Exception(0xE0640002);                           
               else 
                 CopyUnicode(rf,br); } } }          // utf16le no bom
    finally  
      { free(rf); }
    return 1; }

CommonTater · March 09, 2011, 08:58:53 PM

By way of an update, here's the final code, after some debugging...

Code Select




// open and translate file
BOOL M3UOpen(PWCHAR FileName)
  { PBYTE  rf;      // raw file data
    DWORD  br;      // bytes read
    // load the raw file
    { HANDLE pl;    // playlist file handle 
      DWORD  fs;    // file size
      // get path to file
      wcsncpy(FilePath,FileName,MAX_PATH);
      PathRemoveFileSpec(FilePath);
      wcscat(FilePath,L"\\");
      // open the file
      pl = CreateFile(FileName,GENERIC_READ,0,NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL);
      if (pl == INVALID_HANDLE_VALUE)
        Exception(GetLastError());
      fs = GetFileSize(pl,NULL);        
      rf = calloc(fs + 2, sizeof(BYTE));
      if (! ReadFile(pl, rf, fs, &br, NULL))
        Exception(GetLastError());
      CloseHandle(pl);  
      if (br != fs)
        Exception(0xE00640007); } 
    try                                   
     { DWORD bom = *(DWORD*)rf;
       if ((bom == 0x0000FEFF) || (bom == 0xFFFE0000))  // utf32le bom  
         Exception(0xE0640002);                         // utf32be bom  
       else if ((bom & 0xFFFF) == 0xFFFE)               // utf16be bom
         { FlipEndian(rf,br);
           CopyWchar((PWCHAR) rf + 1); }
       else if ((bom & 0xFFFF) == 0xFEFF)               // utf16le bom
         CopyWchar((PWCHAR) rf + 1);  
       else if ((bom & 0xFFFFFF) == 0xBFBBEF)           // utf8 bom
         CopyMByte(rf + 3, br - 3);
       else                                             // no known bom, probe the file
         { if (! memchr(rf, 0x00, br))                  // 8 bit text has no nulls
             CopyMByte(rf,br);                          // ansi / utf8 no bom
           else 
            { PBYTE lf = memchr(rf,0x0A,br);            // lf is always present as 1 byte.
              if (!lf) 
                Exception(0xE0640003);
              if ((!(*(DWORD*)(lf - 3) & 0x00FFFFFF)) ||    //utf32be no bom
                   (!(*(DWORD*)lf & 0xFFFFFF00)))           //utf32le no bom
                 Exception(0xE0640002);    
              if ((lf - rf) & 1)                        // big endian? (lf at odd offset)
                FlipEndian(rf,br);                      // utf16be no bom  
              CopyWchar((PWCHAR) rf);  } } }            // utf32le no bom
     finally  
      { free(rf); }
    return 1; }

News:

Unicode BOM....

CommonTater

Bitbeisser

Stefan Pendl

CommonTater

CommonTater

CommonTater

Stefan Pendl

CommonTater

CommonTater

Bitbeisser

Bitbeisser

CommonTater

TimoVJL

CommonTater

CommonTater