Hi Pelle:
Following code from
https://en.cppreference.com/w/c/io/fscanf (https://en.cppreference.com/w/c/io/fscanf)
compiled with 12.0 RC1 outputs
Converted 7 fields:
i = 25
x = 5.432000
str1 = Thompson
j = 56
y = 789.000000
str2 = 56
warr[0] = U+df
warr[1] = U+0
Expected output
Converted 7 fields:
i = 25
x = 5.432000
str1 = Thompson
j = 56
y = 789.000000
str2 = 56
warr[0] = U+df
warr[1] = U+6c34
// Code from https://en.cppreference.com/w/c/io/fscanf
#define __STDC_WANT_LIB_EXT1__ 1
#include <stdio.h>
#include <stddef.h>
#include <locale.h>
int main(void)
{
int i, j;
float x, y;
char str1[10], str2[4];
wchar_t warr[2];
setlocale(LC_ALL, "en_US.utf8");
char input[] = "25 54.32E-1 Thompson 56789 0123 56ß水";
/* parse as follows:
%d: an integer
%f: a floating-point value
%9s: a string of at most 9 non-whitespace characters
%2d: two-digit integer (digits 5 and 6)
%f: a floating-point value (digits 7, 8, 9)
%*d: an integer which isn't stored anywhere
' ': all consecutive whitespace
%3[0-9]: a string of at most 3 decimal digits (digits 5 and 6)
%2lc: two wide characters, using multibyte to wide conversion */
int ret = sscanf(input, "%d%f%9s%2d%f%*d %3[0-9]%2lc",
&i, &x, str1, &j, &y, str2, warr);
printf("Converted %d fields:\n"
"i = %d\n"
"x = %f\n"
"str1 = %s\n"
"j = %d\n"
"y = %f\n"
"str2 = %s\n"
"warr[0] = U+%x\n"
"warr[1] = U+%x\n",
ret, i, x, str1, j, y, str2, warr[0], warr[1]);
#ifdef __STDC_LIB_EXT1__
int n = sscanf_s(input, "%d%f%s", &i, &x, str1, (rsize_t)sizeof str1);
// writes 25 to i, 5.432 to x, the 9 bytes "Thompson\0" to str1, and 3 to n.
#endif
}
Hello Robert,
I assume you are using the /utf-8 compiler option.
Seems to be an old problem. An internal counter was counting bytes rather than characters. Not a visible problem before, with single-byte ANSI characters, but clearly a problem now with "exotic" (multi-byte) UTF-8 characters. Only warr[0] is updated, warr[1] picks up some random stack value (because it's never initialized).
Are you stuck on something because of this, or can I wait a little before uploading a corrected version (RC2 on final)?
Quote from: Pelle on April 02, 2023, 03:14:18 PM
Hello Robert,
I assume you are using the /utf-8 compiler option.
Seems to be an old problem. An internal counter was counting bytes rather than characters. Not a visible problem before, with single-byte ANSI characters, but clearly a problem now with "exotic" (multi-byte) UTF-8 characters. Only warr[0] is updated, warr[1] picks up some random stack value (because it's never initialized).
Are you stuck on something because of this, or can I wait a little before uploading a corrected version (RC2 on final)?
Hi Pelle:
Thank you for your consideration. No hurry. I just stumbled upon this while looking for something else.
Hi Pelle:
Is it the same problem in the code below as with Wonky Exotic posted above? Or is it maybe Windows / Linux "never the twain shall meet" wchar_t difference?
Pelle's 12.0 RC1 64 bit compile output is
Length of source string (excluding terminator):
6 bytes
0 multibyte characters
Wide character string is: G聒ʄ (0 characters)
G alpha upper
聒 !alpha
ʄ !alpha
Expected output
Length of source string (excluding terminator):
8 bytes
6 multibyte characters
Wide character string is: Grüße! (6 characters)
G alpha upper
r alpha lower
ü alpha lower
ß alpha lower
e alpha lower
! !alpha
Command line usage is
mbstowcs.exe de_DE.UTF-8 Grüße!
Source code from
https://man7.org/linux/man-pages/man3/mbstowcs.3.html (https://man7.org/linux/man-pages/man3/mbstowcs.3.html)
#include <wctype.h>
#include <locale.h>
#include <wchar.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
size_t mbslen; /* Number of multibyte characters in source */
wchar_t *wcs; /* Pointer to converted wide character string */
if (argc < 3) {
fprintf(stderr, "Usage: %s <locale> <string>\n", argv[0]);
exit(EXIT_FAILURE);
}
/* Apply the specified locale. */
if (setlocale(LC_ALL, argv[1]) == NULL) {
perror("setlocale");
exit(EXIT_FAILURE);
}
/* Calculate the length required to hold argv[2] converted to
a wide character string. */
mbslen = mbstowcs(NULL, argv[2], 0);
if (mbslen == (size_t) -1) {
perror("mbstowcs");
exit(EXIT_FAILURE);
}
/* Describe the source string to the user. */
printf("Length of source string (excluding terminator):\n");
printf(" %zu bytes\n", strlen(argv[2]));
printf(" %zu multibyte characters\n\n", mbslen);
/* Allocate wide character string of the desired size. Add 1
to allow for terminating null wide character (L'\0'). */
wcs = calloc(mbslen + 1, sizeof(*wcs));
if (wcs == NULL) {
perror("calloc");
exit(EXIT_FAILURE);
}
/* Convert the multibyte character string in argv[2] to a
wide character string. */
if (mbstowcs(wcs, argv[2], mbslen + 1) == (size_t) -1) {
perror("mbstowcs");
exit(EXIT_FAILURE);
}
printf("Wide character string is: %ls (%zu characters)\n",
wcs, mbslen);
/* Now do some inspection of the classes of the characters in
the wide character string. */
for (wchar_t *wp = wcs; *wp != 0; wp++) {
printf(" %lc ", (wint_t) *wp);
if (!iswalpha(*wp))
printf("!");
printf("alpha ");
if (iswalpha(*wp)) {
if (iswupper(*wp))
printf("upper ");
if (iswlower(*wp))
printf("lower ");
}
putchar('\n');
}
exit(EXIT_SUCCESS);
}
Hello Robert,
Quote from: Robert on April 03, 2023, 10:00:59 PM
Is it the same problem in the code below as with Wonky Exotic posted above? Or is it maybe Windows / Linux "never the twain shall meet" wchar_t difference?
The previous problem was very specifically about the
scanf function family with a
%lc specifier (
%2lc in your case).
This problem seems to be about standard C (C99/C11/C17/C2X) vs Linux, starting here:
mbslen = mbstowcs(NULL, argv[2], 0);
Passing NULL for the first (and zero for the last) argument will have different meanings.