getwline bug

JohnF · November 06, 2010, 10:44:59 AM

Pelle, I'm not sure if you are aware of this but 'getwline' does not read lines of widechar txt. It will load lines of ansi txt though.

Code Select


#define __STDC_WANT_LIB_EXT2__  1

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>

int main(void)
{
    FILE *fp;
    size_t len = 0;
    wchar_t * line = NULL;
    ssize_t read;

//  fp = fopen("uni.txt", "r");
    fp = fopen("ansi.txt", "r");
    if (fp == NULL)
        exit(1);

    while ((read = getwline(&line, &len, fp)) != -1)
    {
        wprintf(L"%ls", line);
    }

    free(line);
    fclose(fp);

    return 0;
}

John

Pelle · November 19, 2010, 05:18:36 PM

I have to think about this. Part of the problem is what to expect from the file on disk... (should it always be a sequence of bytes, and wide stream functions just maps to/from bytes, or what...)

JohnF · November 19, 2010, 10:15:40 PM

Quote from: Pelle on November 19, 2010, 05:18:36 PM
I have to think about this. Part of the problem is what to expect from the file on disk... (should it always be a sequence of bytes, and wide stream functions just maps to/from bytes, or what...)

I'm not sure either - in fact reading various docs only makes it more confusing.

For example one would think that using fgetws would read a line of widechar text but in fact it results in a widechar null after the first widechar character. So when printing to screen one only sees the first character even though the whole line has been read into a buffer. In other words each widechar character is terminated with a null.

The same occurs with other compiles as well.

John

Pelle · November 19, 2010, 11:56:05 PM

I snipped this part from the C(99) standard document I got:

"Although both text and binary wide-oriented streams are conceptually sequences of wide characters, the external file associated with a wide-oriented stream is a sequence of multibyte characters, generalized as follows:
- Multibyte encodings within files may contain embedded null bytes (unlike multibyte encodings valid for use internal to the program).
- A file need not begin nor end in the initial shift state.

Moreover, the encodings used for multibyte characters may differ among files. Both the nature and choice of such encodings are implementation-defined.

The wide character input functions read multibyte characters from the stream and convert them to wide characters as if they were read by successive calls to the fgetwc function. Each conversion occurs as if by a call to the mbrtowc function, with the conversion state described by the stream's own mbstate_t object. The byte input functions read characters from the stream as if by successive calls to the fgetc function.

The wide character output functions convert wide characters to multibyte characters and write them to the stream as if they were written by successive calls to the fputwc function. Each conversion occurs as if by a call to the wcrtomb function, with the conversion state described by the stream's own mbstate_t object. The byte output functions write characters to the stream as if by successive calls to the fputc function."

Some years ago, when I fiddled with the locale settings, it seemed better (or maybe just easier) to use the appropriate Windows ANSI codepage for "multibyte" characters (screen, files, etc.) Some other character representation for files would be better, but this would lead to many new problems.

I can't see anything wrong enough to be fixed here, at least not for version 6.50...

JohnF · November 20, 2010, 08:00:20 AM

Ok, thanks for looking into it.

John

News:

getwline bug

JohnF

Pelle

JohnF

Pelle

JohnF