I've been working on a small program that has to be able to flexibly open either ascii or unicode text files. Internally the program is operating on unicode so ascii files have to be converted when loading and possibly re-converted when saving... The idea is to open either format and give the user his choice of save formats.
What I need is a simple means of telling which is which, "on the fly" as it were. Where I'm not quite getting my head around this part of the problem is the Byte Order Message... I know what it is, but I'm not entirely sure how to use it in this case. My current understanding says...
1) If the first character is > 0 and < 128 it's an ascii file
2) If the BOM is not present then either the first or second character will be 0.
3) If the BOM is present the first two characters should be FE FF for utf16be or FF FE for utf16le.
4) if #3 is FE FF then I have to flip the byte order for Windows (which is utf16le)
Is this correct?
This is a windows only program but the files may get passed around...
When writing Unicode, should I ask about byte order or just use windows native format?
Quote from: CommonTater on February 16, 2011, 08:49:31 PM
1) If the first character is > 0 and < 128 it's an ascii file
Maybe, and then only US ASCII, not accounting for non-Unicode Windows text files which can have valid characters > 128
Quote2) If the BOM is not present then either the first or second character will be 0.
No, 00 00 FE FF indicates a UTF-32, big-endian encoding
Quote3) If the BOM is present the first two characters should be FE FF for utf16be or FF FE for utf16le.
Maybe, as valid BOM sequences are
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8
Quote4) if #3 is FE FF then I have to flip the byte order for Windows (which is utf16le)
Well, don't forget UTF-32LE
QuoteWhen writing Unicode, should I ask about byte order or just use windows native format?
You could get away with assuming what you refer to as "native Windows".
In general, I don't think that there is a 100% "detection algorithm" for the encoding of a text file...
Ralf
You may check out the source code of Notepad++ or Scintilla, whichever is responsible for detecting the encoding.
Both are available at http://sf.net/
It is C++, but one should be able to deceiver the needed code.
Quote
In general, I don't think that there is a 100% "detection algorithm" for the encoding of a text file...
Hmmmm... bummer. Makes the whole idea of writing a format marker kind of pointless doesn't it?
Then of course there's the problem of line endings... I've seen all combinations of cr, lf, cr-lf, lf-cr and I'm going to have to auto detect these as well.
Quote
Both are available at http://sf.net/
WOW... so much for sourceforge... the site repeatedly caused browser lockups... I couldn't get to anything...
Ok... so how common is utf32?
I know about it (and now I know more about it) but I don't think I've ever seen it in use...
Edit: Oi such a day I'm having!
Quote from: CommonTater on February 17, 2011, 02:02:27 AM
WOW... so much for sourceforge... the site repeatedly caused browser lockups... I couldn't get to anything...
Try http://notepad-plus.svn.sourceforge.net/viewvc/notepad-plus/trunk/ and download the tarball (link at the lower left corner of the page).
Quote from: Stefan Pendl on February 17, 2011, 02:23:47 AM
Try http://notepad-plus.svn.sourceforge.net/viewvc/notepad-plus/trunk/ and download the tarball (link at the lower left corner of the page).
That one worked, thanks stephan...
If taken a look ad how they do it (uni8_16.cpp ... DetermineEncoding) and it looks easy enough. They're really just examining the first byte to determine what to do next... This I think I can manage.
Thanks guys!
I think, since Windows doesn't have a 32bit character mode, and C has no library functions for it, I'm going to skip utf32 for now and see how long it takes to come back and bite me... The utf16 support is a big step forward for this type of file anyway so I should be good to go for a couple of years (fingers crossed) 8)
Thought you might be interested in what I came up with to solve this problem...
I'm still testing it, but so far so good...
BOOL M3UOpen(PWCHAR Filename)
{ PBYTE rf; // raw data
DWORD br; // bytes read
// load the raw file
{ HANDLE pl; // playlist file handle
DWORD fs; // file size
pl = CreateFile(Filename,GENERIC_READ,0,NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL);
if(pl == INVALID_HANDLE_VALUE)
Exception(GetLastError());
fs = GetFileSize(pl,NULL);
rf = calloc((fs + 2), sizeof(BYTE));
ReadFile(pl,rf,fs,&br,NULL);
CloseHandle(pl);
if (br != fs)
Exception(GetLastError()); }
// Identify text format
try
{ if (*(DWORD*)rf == 0x0000FEFF || // utf32le bom
*(DWORD*)rf == 0xFFFE0000) // utf32be bom
Exception(0xE0640002);
else if (*(WORD*)rf == 0xFFFE) // utf16be bom
{ FlipBytes(rf + 2,br - 2);
CopyUnicode(rf + 2,br - 2); }
else if (*(WORD*)rf == 0xFEFF) // utf16le bom
CopyUnicode(rf + 2,br - 2);
else if (*(WORD*)rf == 0xBBEF) // utf8 bom
CopyMByte(rf + 3,br - 3);
else // no known bom, probe the file
{ PBYTE lf = NULL; // line feed
lf = memchr(rf,0x0A,br); // lf is always recognizable in 1 byte.
if (!lf)
Exception(0xE0640003);
if (*(lf - 1) != 0 || *(lf + 1) != 0) // utf8 no bom
CopyMByte(rf,br);
else if ((DWORD)(lf - rf) & 1) // big endian (lf at odd offset)
{ if (!*(DWORD*)(lf - 3) & 0x00FFFFFF) //utf32be no bom
Exception(0xE0640002);
else
{ FlipBytes(rf,br); // utf16be no bom
CopyUnicode(rf,br); } }
else // little endian (lf at even offset)
{ if (!*(DWORD*)lf & 0xFFFFFF00) // utf32le no bom
Exception(0xE0640002);
else
CopyUnicode(rf,br); } } } // utf16le no bom
finally
{ free(rf); }
return 1; }
The exception codes say "utf32 not supported" and "File has errors".
The copy routines translate and save the file as utf16le (i.e. Windows Unicode) with BOM.
Thanks for the help, guys...
Quote from: CommonTater on February 17, 2011, 01:50:01 AM
Quote
In general, I don't think that there is a 100% "detection algorithm" for the encoding of a text file...
Hmmmm... bummer. Makes the whole idea of writing a format marker kind of pointless doesn't it?
Then of course there's the problem of line endings... I've seen all combinations of cr, lf, cr-lf, lf-cr and I'm going to have to auto detect these as well.
There should be only three that you would have to consider
CR-LF : DOS/Windows
CR : Mac OS
LF : Unix/Linux
Years back when I worked at a CAD/CAM software manufacturer and was responsible for modules to perform data exchange via DXF, IGES, etc, I ran a lot into the CR-CR-LF format with data coming from mainly IBM, DEC and SGI minis but can't say that I have seen this in the last 10 years or so.
It isn't very likely (though not impossible) that you have more than one line-ending style within one file, so a simple test algorithm would be:
- read
n bytes from the start of the file, where
n is at last 50% larger than the average expected line length.
- scan for the first occurrence of a CR
- if found, test for the character after the CR, if it's LF, you have a DOS/Windows text file, if not, you have likely a Mac file
- if not found, scan for the first occurrence of a LF, if found you have a Unix/Linux text file, if not you aren't likely to have a text file at all or our buffer size was too small
Ralf
Quote from: CommonTater on February 17, 2011, 02:07:50 AM
Ok... so how common is utf32?
I know about it (and now I know more about it) but I don't think I've ever seen it in use...
Edit: Oi such a day I'm having!
I think you might have a chance of UTF32 in Asian language setups (CJKV) of Windows, not sure about Linux. The onlu time I ran ever into it was after the download of a document from a Chinese web site.
Depending on the purpose/realm of use of your software, you might be able to ignore it...
Ralf
Quote from: Bitbeisser on February 20, 2011, 12:38:45 AM
There should be only three that you would have to consider
CR-LF : DOS/Windows
CR : Mac OS
LF : Unix/Linux
Thanks Ralph... I hit on a really simple solution to that problem. Since I'm breaking the file up into strings for some parsing... I just went through and trashed all the CRs and LFs... the combination doesn't matter... they're gone. I can reinsert Windows style line ends when I save the files.
As you may have deduced from the function name these are M3U format playlists that need to be transformed for use on windows. Almost all of them are either English or French. Problem is they've been handed around so much and saved on other machines often enough they've ended up being in a bunch of dispate formats...
http://en.wikipedia.org/wiki/Byte_order_mark (http://en.wikipedia.org/wiki/Byte_order_mark)
else if (*(WORD*)rf == 0xBBEF) // utf8 bom
to this
else if ((*(DWORD*)rf & 0x00FFFFFF) == 0x00BFBBEF) // utf8 bom
Quote from: timovjl on February 21, 2011, 08:11:55 AM
http://en.wikipedia.org/wiki/Byte_order_mark (http://en.wikipedia.org/wiki/Byte_order_mark)
else if (*(WORD*)rf == 0xBBEF) // utf8 bom
to this
else if ((*(DWORD*)rf & 0x00FFFFFF) == 0x00BFBBEF) // utf8 bom
Thanks Timo... Already done...
Also had to redo the line about utf8 with no bom... it's now split into the two pieces in the last section and worrking much better...
Heres the current version...
BOOL M3UOpen(PWCHAR Filename)
{ PBYTE rf; // raw data
DWORD br; // bytes read
// load the raw file
{ HANDLE pl; // playlist file handle
DWORD fs; // file size
pl = CreateFile(Filename,GENERIC_READ,0,NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL);
if(pl == INVALID_HANDLE_VALUE)
Exception(GetLastError());
fs = GetFileSize(pl,NULL);
rf = calloc((fs + 2), sizeof(BYTE));
ReadFile(pl,rf,fs,&br,NULL);
CloseHandle(pl);
if (br != fs)
Exception(GetLastError()); }
// Identify text format
try
{ if (*(DWORD*)rf == 0x0000FEFF || // utf32le bom
*(DWORD*)rf == 0xFFFE0000) // utf32be bom
Exception(0xE0640002);
else if (*(WORD*)rf == 0xFFFE) // utf16be bom
{ FlipBytes(rf + 2,br - 2);
CopyUnicode(rf + 2,br - 2); }
else if (*(WORD*)rf == 0xFEFF) // utf16le bom
CopyUnicode(rf + 2,br - 2);
else if (*(DWORD*)rf & 0x00FFFFFF == 0x00BFBBEF) // utf8 bom
CopyMByte(rf + 3,br - 3);
else // no known bom, probe the file
{ PBYTE lf = NULL; // points to line feed
lf = memchr(rf,0x0A,br); // lf is always recognizable in 1 byte.
if (!lf)
Exception(0xE0640003);
if ((DWORD)(lf - rf) & 1) // big endian? (lf at odd offset)
{ if (*(lf - 1) != 0) // utf8 no bom
CopyMByte(rf,br);
else if (!*(DWORD*)(lf - 3) & 0x00FFFFFF) //utf32be no bom
Exception(0xE0640002);
else
{ FlipBytes(rf,br); // utf16be no bom
CopyUnicode(rf,br); } }
else // little endian? (lf at even offset)
{ if (*(lf + 1) != 0)
CopyMByte(rf,br); // utf8 no bom
else if (!*(DWORD*)lf & 0xFFFFFF00) // utf32le no bom
Exception(0xE0640002);
else
CopyUnicode(rf,br); } } } // utf16le no bom
finally
{ free(rf); }
return 1; }
By way of an update, here's the final code, after some debugging...
// open and translate file
BOOL M3UOpen(PWCHAR FileName)
{ PBYTE rf; // raw file data
DWORD br; // bytes read
// load the raw file
{ HANDLE pl; // playlist file handle
DWORD fs; // file size
// get path to file
wcsncpy(FilePath,FileName,MAX_PATH);
PathRemoveFileSpec(FilePath);
wcscat(FilePath,L"\\");
// open the file
pl = CreateFile(FileName,GENERIC_READ,0,NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL);
if (pl == INVALID_HANDLE_VALUE)
Exception(GetLastError());
fs = GetFileSize(pl,NULL);
rf = calloc(fs + 2, sizeof(BYTE));
if (! ReadFile(pl, rf, fs, &br, NULL))
Exception(GetLastError());
CloseHandle(pl);
if (br != fs)
Exception(0xE00640007); }
try
{ DWORD bom = *(DWORD*)rf;
if ((bom == 0x0000FEFF) || (bom == 0xFFFE0000)) // utf32le bom
Exception(0xE0640002); // utf32be bom
else if ((bom & 0xFFFF) == 0xFFFE) // utf16be bom
{ FlipEndian(rf,br);
CopyWchar((PWCHAR) rf + 1); }
else if ((bom & 0xFFFF) == 0xFEFF) // utf16le bom
CopyWchar((PWCHAR) rf + 1);
else if ((bom & 0xFFFFFF) == 0xBFBBEF) // utf8 bom
CopyMByte(rf + 3, br - 3);
else // no known bom, probe the file
{ if (! memchr(rf, 0x00, br)) // 8 bit text has no nulls
CopyMByte(rf,br); // ansi / utf8 no bom
else
{ PBYTE lf = memchr(rf,0x0A,br); // lf is always present as 1 byte.
if (!lf)
Exception(0xE0640003);
if ((!(*(DWORD*)(lf - 3) & 0x00FFFFFF)) || //utf32be no bom
(!(*(DWORD*)lf & 0xFFFFFF00))) //utf32le no bom
Exception(0xE0640002);
if ((lf - rf) & 1) // big endian? (lf at odd offset)
FlipEndian(rf,br); // utf16be no bom
CopyWchar((PWCHAR) rf); } } } // utf32le no bom
finally
{ free(rf); }
return 1; }