NO

Author Topic: Wonky V12 RC1 Exotic Formatting  (Read 2167 times)

Offline Robert

  • Member
  • *
  • Posts: 247
Wonky V12 RC1 Exotic Formatting
« on: April 02, 2023, 09:29:05 AM »
Hi  Pelle:

Following code from

https://en.cppreference.com/w/c/io/fscanf

compiled with 12.0 RC1 outputs

Code: [Select]
Converted 7 fields:
i = 25
x = 5.432000
str1 = Thompson
j = 56
y = 789.000000
str2 = 56
warr[0] = U+df
warr[1] = U+0


Expected output

Code: [Select]
Converted 7 fields:
i = 25
x = 5.432000
str1 = Thompson
j = 56
y = 789.000000
str2 = 56
warr[0] = U+df
warr[1] = U+6c34

Code: [Select]
// Code from https://en.cppreference.com/w/c/io/fscanf

#define __STDC_WANT_LIB_EXT1__ 1
#include <stdio.h>
#include <stddef.h>
#include <locale.h>
 
int main(void)
{
    int i, j;
    float x, y;
    char str1[10], str2[4];
    wchar_t warr[2];
    setlocale(LC_ALL, "en_US.utf8");
 
    char input[] = "25 54.32E-1 Thompson 56789 0123 56ß水";
    /* parse as follows:
       %d: an integer
       %f: a floating-point value
       %9s: a string of at most 9 non-whitespace characters
       %2d: two-digit integer (digits 5 and 6)
       %f:  a floating-point value (digits 7, 8, 9)
       %*d: an integer which isn't stored anywhere
       ' ': all consecutive whitespace
       %3[0-9]: a string of at most 3 decimal digits (digits 5 and 6)
       %2lc: two wide characters, using multibyte to wide conversion  */
    int ret = sscanf(input, "%d%f%9s%2d%f%*d %3[0-9]%2lc",
                     &i, &x, str1, &j, &y, str2, warr);
 
    printf("Converted %d fields:\n"
           "i = %d\n"
           "x = %f\n"
           "str1 = %s\n"
           "j = %d\n"
           "y = %f\n"
           "str2 = %s\n"
           "warr[0] = U+%x\n"
           "warr[1] = U+%x\n",
           ret, i, x, str1, j, y, str2, warr[0], warr[1]);
 
#ifdef __STDC_LIB_EXT1__
    int n = sscanf_s(input, "%d%f%s", &i, &x, str1, (rsize_t)sizeof str1);
    // writes 25 to i, 5.432 to x, the 9 bytes "Thompson\0" to str1, and 3 to n.
#endif
}

Offline Pelle

  • Administrator
  • Member
  • *****
  • Posts: 2266
    • http://www.smorgasbordet.com
Re: Wonky V12 RC1 Exotic Formatting
« Reply #1 on: April 02, 2023, 03:14:18 PM »
Hello Robert,

I assume you are using the /utf-8 compiler option.

Seems to be an old problem. An internal counter was counting bytes rather than characters. Not a visible problem before, with single-byte ANSI characters, but clearly a problem now with "exotic" (multi-byte) UTF-8 characters. Only warr[0] is updated, warr[1] picks up some random stack value (because it's never initialized).

Are you stuck on something because of this, or can I wait a little before uploading a corrected version (RC2 on final)?
/Pelle

Offline Robert

  • Member
  • *
  • Posts: 247
Re: Wonky V12 RC1 Exotic Formatting
« Reply #2 on: April 03, 2023, 04:08:20 AM »
Hello Robert,

I assume you are using the /utf-8 compiler option.

Seems to be an old problem. An internal counter was counting bytes rather than characters. Not a visible problem before, with single-byte ANSI characters, but clearly a problem now with "exotic" (multi-byte) UTF-8 characters. Only warr[0] is updated, warr[1] picks up some random stack value (because it's never initialized).

Are you stuck on something because of this, or can I wait a little before uploading a corrected version (RC2 on final)?

Hi Pelle:

Thank you for your consideration. No hurry. I just stumbled upon this while looking for something else.

Offline Robert

  • Member
  • *
  • Posts: 247
Re: Wonky V12 RC1 Exotic Formatting
« Reply #3 on: April 03, 2023, 10:00:59 PM »
Hi Pelle:

Is it the same problem in the code below as with Wonky Exotic posted above? Or is it maybe Windows / Linux "never the twain shall meet" wchar_t difference?

Pelle's 12.0 RC1 64 bit compile output is

Code: [Select]
Length of source string (excluding terminator):
    6 bytes
    0 multibyte characters

Wide character string is: G聒ʄ (0 characters)
    G alpha upper
    聒 !alpha
    ʄ !alpha
   
Expected output

Code: [Select]
           Length of source string (excluding terminator):
               8 bytes
               6 multibyte characters

           Wide character string is: Grüße! (6 characters)
               G alpha upper
               r alpha lower
               ü alpha lower
               ß alpha lower
               e alpha lower
               ! !alpha

      
Command line usage is

Code: [Select]
mbstowcs.exe de_DE.UTF-8 Grüße!    
Source code from

https://man7.org/linux/man-pages/man3/mbstowcs.3.html

Code: [Select]
       #include <wctype.h>
       #include <locale.h>
       #include <wchar.h>
       #include <stdio.h>
       #include <string.h>
       #include <stdlib.h>

       int  main(int argc, char *argv[])
       {
           size_t mbslen;      /* Number of multibyte characters in source */
           wchar_t *wcs;       /* Pointer to converted wide character string */

           if (argc < 3) {
               fprintf(stderr, "Usage: %s <locale> <string>\n", argv[0]);
               exit(EXIT_FAILURE);
           }

           /* Apply the specified locale. */

           if (setlocale(LC_ALL, argv[1]) == NULL) {
               perror("setlocale");
               exit(EXIT_FAILURE);
           }

           /* Calculate the length required to hold argv[2] converted to
              a wide character string. */

           mbslen = mbstowcs(NULL, argv[2], 0);
           if (mbslen == (size_t) -1) {
               perror("mbstowcs");
               exit(EXIT_FAILURE);
           }

           /* Describe the source string to the user. */

           printf("Length of source string (excluding terminator):\n");
           printf("    %zu bytes\n", strlen(argv[2]));
           printf("    %zu multibyte characters\n\n", mbslen);

           /* Allocate wide character string of the desired size.  Add 1
              to allow for terminating null wide character (L'\0'). */

           wcs = calloc(mbslen + 1, sizeof(*wcs));
           if (wcs == NULL) {
               perror("calloc");
               exit(EXIT_FAILURE);
           }

           /* Convert the multibyte character string in argv[2] to a
              wide character string. */

           if (mbstowcs(wcs, argv[2], mbslen + 1) == (size_t) -1) {
               perror("mbstowcs");
               exit(EXIT_FAILURE);
           }

           printf("Wide character string is: %ls (%zu characters)\n",
                   wcs, mbslen);

           /* Now do some inspection of the classes of the characters in
              the wide character string. */

           for (wchar_t *wp = wcs; *wp != 0; wp++) {
               printf("    %lc ", (wint_t) *wp);

               if (!iswalpha(*wp))
                   printf("!");
               printf("alpha ");

               if (iswalpha(*wp)) {
                   if (iswupper(*wp))
                       printf("upper ");

                   if (iswlower(*wp))
                       printf("lower ");
               }

               putchar('\n');
           }

           exit(EXIT_SUCCESS);
       }

Offline Pelle

  • Administrator
  • Member
  • *****
  • Posts: 2266
    • http://www.smorgasbordet.com
Re: Wonky V12 RC1 Exotic Formatting
« Reply #4 on: April 03, 2023, 11:12:04 PM »
Hello Robert,

Is it the same problem in the code below as with Wonky Exotic posted above? Or is it maybe Windows / Linux "never the twain shall meet" wchar_t difference?
The previous problem was very specifically about the scanf function family with a %lc specifier (%2lc in your case).

This problem seems to be about standard C (C99/C11/C17/C2X) vs Linux, starting here:
Code: [Select]
mbslen = mbstowcs(NULL, argv[2], 0);
Passing NULL for the first (and zero for the last) argument will have different meanings.

/Pelle