Unicode Read/Write File SOLVED

Juni · August 27, 2010, 02:37:59 PM

Hello Forum
i m trying to build a simple texteditor
wich 100 % supports unicode
(load/save even if filename contain for example chinese letters)
like the one in vista,
reduced to load/save as unicode only.

I wrote a demo to test some functions
but it always crash when it comes to file handling.

Any help appreciated
thank you

( i m not sure wich function to read write etc)
demo.zip attached

CommonTater · August 27, 2010, 03:33:25 PM

I haven't compiled and run your demo piece but scanning the source code I see a couple of problems...

1) You are defining varables inside switch statements which may cause problems for the stack handlers. It would be better to define them at the entry point of the function or globally.

2) You are forgetting to adjust your read/write sizes for the size of your characters. Disk I/O works in bytes but wchar_t is two bytes. So when you specify how many bytes to read or save you should use

BufferSize * sizeof(wchar_t).

3) In windows when you define UNICODE you do not have to explicitly add the W on the end of your function calls. The UNICODE specifier does that for you. (This is minor but there are a few cases where it does make a difference).

Juni · August 27, 2010, 06:47:22 PM

Hi CommonTater,
thank You for the hints

1) Moving definitions out of switch´es indeed solved random crashes,
but file i/o problem remains.

2) Problem is i m unsure what is inside those strings that i get from winapi,
how those w-filefunctions work
and if my trys in general are the right route.

Atm i guess its lot about string conversion
and those constants i putted where approximations with a bit space left
to handle all cases wich might appear ( i.e. unicode coded with 2, 3 or 5 chars )

3) Do you mean UNICODE or _UNICODE ?
And could you please give an example where it matters ?

I attached new demo.zip

Thank you very much

CommonTater · August 28, 2010, 02:34:09 AM

Quote
Hi CommonTater,
thank You for the hints

1) Moving definitions out of switch´es indeed solved random crashes,
but file i/o problem remains.

You still have the same problem... Switch statements are not intended to be the meat of your program. They are basically intended as a way of selecting subroutines to call...

Like this...

Code Select


// Enable system contact
INT EnableLogon(void)
  { HKEY    rk;         // registry hive key
    CHAR    un[100];    // username string
    CHAR    pw[100];    // password string
    CHAR    cn[100];    // computer name
    ULONG   dl;         // data length
    HANDLE  ut;         // user token 

    ... code ...
    
    // all done
    RegCloseKey(rk);
    return 0; }


// Message Loop
LRESULT CALLBACK MsgProc(HWND wnd,UINT msg,WPARAM wparm,LPARAM lparm)
  { switch (msg)
      { case WM_COMMAND :
          switch (LOWORD(wparm))
            { case 100 :            // enable button
                return EnableLogon();
               case 200 :           // disable button
                return DisableLogon();
              case 300 :            // quit button
                PostMessage(Win[0],WM_CLOSE,0,0);
                return 0;
              default :
                return DefWindowProc(wnd,msg,wparm,lparm); }

        // NC Exit button
        case WM_CLOSE :
          DestroyWindow(Win[0]);
          return 0;
        case WM_DESTROY :
  ... etc ...

Note how the variables are created in the subroutine? These are created temporarily on the system's stack and then disposed of when the procedure exits. If you need global variables (window or file handles, for example) create them at the top of your page. Everything else should be done in function calls... In C everything is a subroutine.

Quote
2) Problem is i m unsure what is inside those strings that i get from winapi, how those w-filefunctions work and if my trys in general are the right route.

One of the first things you learn in programming is that input is always a lot more difficult than output because you never know what's coming your way and you do have to guard for it.

Do you have a copy of the winapi documentation? If not you should get it
working in windows with C-99 function calls and variable definitions is not the best way to do things. You can get a copy of the docs at..

http://www.microsoft.com/downloads/details.aspx?FamilyID=6b6c21d2-2006-4afa-9702-529fa782d63b&displaylang=en

Quote
3) Do you mean UNICODE or _UNICODE ?
And could you please give an example where it matters ?

I mean UNICODE ... _UNICODE is for Pelles-C, not windows.

Here is the header from one of my current projects to give you some idea what needs to be included in a typcial project. This is pretty much what it takes to set a project up for proper Unicode support...

Code Select


// for the compiler
#define UNICODE
#define _UNICODE
#define WIN32_DEFAULT_LIBS
#define WIN32_LEAN_AND_MEAN
#define _WIN32_WINNT 0x0502
#define _X86_

// Windows headers
#include <winsock2.h>
#include <windows.h>
#include <ws2tcpip.h>
#include <commctrl.h>
#include <commdlg.h>
#include <shellapi.h>
#include <shlwapi.h>
#include <shlobj.h>

//  PellesC headers
#include <stdlib.h>
#include <wchar.h>

Please note that each each windows function call has 3 flavours... CallA, CallW and Call ... The last one is actually a macro that will call the correct A or W version one depending on your #defines. You should always use the macro and let the UNICODE switch in windows make the decisions for you. That way all you have to do is un-define unicode and recompile to make an ansi version of your project for older OSs.

Also you should be using windows variable definitions. Instead of wchar_t use the windows TCHAR, which is also changed between CHAR and WCHAR by the UNICODE define.

Finally, when using unicode in the Pelles-C IDE you have to prefix your strings with L ... L"My name is fred" not just "my name is fred" ... the first is Unicode, the second is not.

I hope I'm helping you and not scaring you off...

JohnF · August 28, 2010, 11:44:03 AM

Juni, have you used the debugger and checked that vars contain what you expect them to have at various stages?

John

Juni · August 28, 2010, 12:56:30 PM

@CommonTater

Thank you again your posts really help a lot for unicode beginner like me.

1) I separated Code new demo.zip attached to 1st post

2) Maybe i expressed my prob a bit unclear,
its not about the return values of windows functions
its about how unicode is encoded in windows functions.

i.e. if i use chinese (windows IME Pinyin)
and prly mix up with latin letters
in a dialog textfield
is every Unicode Code Token coded in 2 bytes ?

Or are there exceptions as some sources say
wich would make results differ from 1-5 bytes ?

2-1)
If every symbol is coded in 2 bytes,
are the c99-filefunctions bad practice ?

Help says it writes widechars
but both write and open lead to wrong results.
(write makes 1 totally different chinese letter of 2)
(load seems to make greek or something from it)

3) Thank you for the basics,
i used mixed sources for learning and ended up wrong it seems.

@JohnF

The Hex-Values windows-function return seem to be ok,
for file i/o i dunno howto debug cause the results really make no sense for me @.@

Thanks a lot

new demo.zip attached to first post

TimoVJL · August 28, 2010, 02:04:47 PM

http://www.joelonsoftware.com/articles/Unicode.html
http://en.wikibooks.org/wiki/Windows_Programming/Unicode

Unicode can represent all of the world's characters in modern computer use, including technical symbols and special characters used in publishing. Because each Unicode code value is 16 bits wide, it is possible to have separate values for up to 65,536 characters. Unicode-enabled functions are often referred to as "wide-character" functions. Note that the implementation of Unicode in actual 16-bit values is referred to as UTF-16. For compatibility with existing environments, there are two lossless transformations to convert 16-bit Unicode values into forms appropriate for 8- or 7-bit environments: UTF-8 and UTF-7. For more information, see The Unicode Standard, Version 2.0.

Win32 functions support applications that use either Unicode or the regular ANSI character set. Mixed use in the same application is also possible. Adding Unicode support to an application is easy, and you can even maintain a single set of sources from which to compile an application that supports either Unicode or the Windows ANSI character set.

JohnF · August 28, 2010, 03:07:06 PM

There is a project on my web site that does a unicode editor.

http://www.johnfindlay.plus.com/lcc-win32/winprog/SmallEd2.zip

Maybe it will help.

John

CommonTater · August 28, 2010, 03:26:37 PM

Quote
1) I separated Code new demo.zip attached to 1st post

Well, you're getting closer. Now you need to move your variable definitions into the functions.

For example:

Code Select


// function to write a file
BOOL WriteToFile(PTCHAR String)
  { HANDLE file;        // file handle
    DWORD  fsize;       // size of file 

    .. open and write file here

    CloseHandle(file);    
    return 1; }

By "non-globalizing" as many variables as possible you forestall the risk of ending up with unexpected or unknown contents when you do your function. Since the variables are created in the function, used only in the function and destroyed when it exits, you know exactly what you are dealing with. As general practice I use as few global variables as possible and have written entire applications with none at all.

I noticed the comment in your source that windows likes small text buffers. While this is true, it is always smartest to allocate extra space unless you have the means to discover the correct buffer size "on the fly".

For example: The maximum size of a file path is defined as MAX_PATH which is currently 260 TCHARs. You should always assign MAX_PATH + 1 and use and use an initializer to guarantee you know the contents going in. The +1 guarantees your string is null terminated.

Like this...

Code Select


TCHAR CurentDirectory[MAX_PATH +1] = {0};

// get working directory
GetCurrentDirectory(MAX_PATH,CurrentDirectory);

In my own code I always have large lists of predefined buffer sizes that I apply throughout the code from a global .h file.

For Example...

Code Select


// buffer sizes
#define MAX_HOSTNAME      MAX_COMPUTERNAME_LENGTH
#define MAX_LOGLINE       300   // tchars
#define MAX_TEXTIP        16    // tchars
#define MAX_TEXTPORT      6     // tchars
#define MAX_PASSWORD      24    // tchars
#define MAX_UNCPATH       256   // tchars
#define MAX_TOOLTIP       32    // tchars
#define MAX_REMOTENAME    24    // tchars
#define MAX_PROGRAMNAME   24    // tchars
#define MAX_TYPENAME      20    // tchars
#define MAX_DGRAMDATA     1024  // bytes

Quote
2) Maybe i expressed my prob a bit unclear,
its not about the return values of windows functions
its about how unicode is encoded in windows functions.

Well, the programmer in me says "you don't need to know" since windows function calls (as opposed to C-99 calls) handle this seamlessly. But the fact is that (currently) a TCHAR is a 16 bit value, so it's two bytes. However, you should not count on that. With new OSs like Win7 (spit, curse!) and X64 versions of everything that may change.

The smarter way is to use the sizeof function...

Code Select


// function to get byte size of string
DWORD GetBufferSize(PTCHAR String)
  { DWORD sl;                 // string length
    sl = lstrlen(String);     // get string length in TCHARS
    sl *= sizeof(Tchar);      // convert to byte size
    return sl; }              // send back the answer
    
// more efficient version
DWORD GetBufferSize(PTCHAR String)
  { return lstrlen(String) * sizeof(TCHAR); }

Either version will return the correct byte size whether UNICODE is defined or not.

Quote
i.e. if i use chinese (windows IME Pinyin)
and prly mix up with latin letters
in a dialog textfield
is every Unicode Code Token coded in 2 bytes ?

Or are there exceptions as some sources say
wich would make results differ from 1-5 bytes ?

Currently it's 2 bytes, but as explained, you can't count on that in the future. You won't find mixed sizes in the same string (yet) as this would totally confuse any attemtps to edit or manipulate the string.

Quote
2-1)
If every symbol is coded in 2 bytes,
are the w-filefunction bad practice ?

Help says it writes widechars
but both write and open lead to wrong results.
(write makes 1 totally different chinese letter of 2)
(load seems to make greek or something from it)

Generally I prefer to use Windows calls over C calls as much as possible. The problem is that Windows definition of Unicode can and sometimes does differ from that in C-99. Staying with windows functions provides a much more integrated approach that will update itself across different versions of windows as you recompile.

The thing to remember is that Microsoft (Spit, Curse!) is under some obligation to keep things compatible, C language maintainers are not (although Pelle has always done a great job of it).

Quote
3) Thank you for the basics,
i used mixed sources for learning and ended up wrong it seems.

Oh boy... if you only knew how many times I got it all wrong when I was first starting with C. Fact is I'm still amazed when I write something and it actually works the first try.

The best way to learn WinApi programming is to study WinApi code samples and read the tutorials. Windows is C, but the api calls are totally different, so it does take a while to learn, even if you already know C.
(Hense the suggestion that you download and install the Windows API documentation.)

Juni · August 28, 2010, 03:43:38 PM

@timovjl

Thank you for the links,
especially the wikibook link really looks good

@CommonTater

I updated the file 4+ times last hours not sure wich version you received, sorry.
Global variables should be reduced to the minimum now.

And about safety for defining buffers i will try do that
once i found out wich functions to use
cause atm i dunno how big demo-codes become
and i don t want obfuscate code by to many allocations.
(Totally agree with you)

Wich windows-calls could i use to save to file ?
(sry i m really noob in that)

Is the MSDK download different from the online version ?

Thank you again

@JohnF

Thanks a lot, i got a problem with the included .exe pls read my next post

CommonTater · August 28, 2010, 04:16:52 PM

Quote from: Juni on August 28, 2010, 03:43:38 PM
@CommonTater
Wich windows-calls could i use to save to file ?
(sry i m really noob in that)

Generally I use:
GetOpenFilename
GetSaveFilename
CreateFile
ReadFile
WriteFile
CloseHandle
etc.

Quote
Is the MSDK download different from the online version ?

The function call descriptions are about the same (naturally) but the download version includes a ton of code samples as well as debugging tools and utilities that are very handy at times.

Juni · August 28, 2010, 04:38:49 PM

@CommonTator

I checked the functions and found no switch for utf8/utf16 etc,
is there a winapi function for that too ?

Thanks a lot

@JohnF

update - it seems on Vista your editor doesn t work,
i guess i opened your demofiles with vista editor before.

If i try running your editor and opening the included .uni files or my project .txt s
theres only mix of few characters but for sure not russian or chinese.
( it looks same as in my demo i think )

Should the included .exe be Vista enabled ?
Or might i have to recompile it for Vista ?

Thanks a lot

JohnF · August 28, 2010, 04:44:48 PM

Quote from: Juni on August 28, 2010, 04:38:49 PM
@CommonTator

Thanks a lot

@JohnF

update - it seems on Vista your editor doesn t work,
i guess i opened your demofiles with vista editor before.

If i try running your editor and opening the included .uni files or my project .txt s
theres only mix of few characters but for sure not russian or chinese.

Should the included .exe be Vista enabled ?
Or might i have to recompile it for Vista ?

Thanks a lot

It's not my editor and I don't know if it should work on Vista sorry. However, you can still study the code which might help you.

John

Juni · August 28, 2010, 04:48:09 PM

Thank you

Unfortunately it seems the programm has same problem as mine
and i don t even get close to what that could be

TimoVJL · August 28, 2010, 05:26:45 PM

If i open your example test file with WordPad is that what you want see ?
(Except that it is finnish version of it)
Look at attachment picture.

News:

Unicode Read/Write File SOLVED

Juni

CommonTater

Juni

CommonTater

JohnF

Juni

TimoVJL

JohnF

CommonTater

Juni

CommonTater

Juni

JohnF

Juni

TimoVJL