Pelles C forum

Pelles C => Feature requests => Topic started by: Robert on February 22, 2021, 12:17:02 am

Title: UTF-8 Identifiers
Post by: Robert on February 22, 2021, 12:17:02 am
UTF-8 identifiers, now supported by Clang and GCC 10 would be nice. Code below compiles and executes as expected with Martin Storsjö's LLVM-MinGW

Code: [Select]

char * Όνομα_μήνα (int μετρητής)
{
  static const PCHAR DATA[]=
  {
 "ΦΟΙΝΙΚΑΙΟΣ","ΚΡΑΝΕΙΟΣ","ΛΑΝΟΤΡΟΠΙΟΣ","ΜΑΧΑΝΕΥΣ",
 "ΔΩΔΕΚΑΤΕΥΣ","ΕΥΚΛΕΙΟΣ","ΑΡΤΕΜΙΣΙΟΣ","ΨΥΔΡΕΥΣ",
 "ΓΑΜΕΙΛΙΟΣ","ΑΓΡΙΑΝΙΟΣ","ΠΑΝΑΜΟΣ","ΑΠΕΛΛΑΙΟΣ"
};

Title: Re: UTF-8 Identifiers
Post by: Pelle on February 22, 2021, 07:59:18 am
Well, using the u8 string prefix should work (at least with the source file in an encoding like UTF-16, on my machine) ...

Code: [Select]
char *Όνομα_μήνα(int μετρητής)
{
    /*static */ const char *DATA[] =
    {
        u8"ΦΟΙΝΙΚΑΙΟΣ", u8"ΚΡΑΝΕΙΟΣ", u8"ΛΑΝΟΤΡΟΠΙΟΣ", u8"ΜΑΧΑΝΕΥΣ",
        u8"ΔΩΔΕΚΑΤΕΥΣ", u8"ΕΥΚΛΕΙΟΣ", u8"ΑΡΤΕΜΙΣΙΟΣ", u8"ΨΥΔΡΕΥΣ",
        u8"ΓΑΜΕΙΛΙΟΣ", u8"ΑΓΡΙΑΝΙΟΣ", u8"ΠΑΝΑΜΟΣ", u8"ΑΠΕΛΛΑΙΟΣ"
    };
}

Otherwise I think there will be problems with Microsoft/Windows compatibility. I'm not sure it's 100%, but the current behavior seems to match MSVC.


Title: Re: UTF-8 Identifiers
Post by: Robert on February 22, 2021, 09:48:41 pm
Well, using the u8 string prefix should work (at least with the source file in an encoding like UTF-16, on my machine) ...

Code: [Select]
char *Όνομα_μήνα(int μετρητής)
{
    /*static */ const char *DATA[] =
    {
        u8"ΦΟΙΝΙΚΑΙΟΣ", u8"ΚΡΑΝΕΙΟΣ", u8"ΛΑΝΟΤΡΟΠΙΟΣ", u8"ΜΑΧΑΝΕΥΣ",
        u8"ΔΩΔΕΚΑΤΕΥΣ", u8"ΕΥΚΛΕΙΟΣ", u8"ΑΡΤΕΜΙΣΙΟΣ", u8"ΨΥΔΡΕΥΣ",
        u8"ΓΑΜΕΙΛΙΟΣ", u8"ΑΓΡΙΑΝΙΟΣ", u8"ΠΑΝΑΜΟΣ", u8"ΑΠΕΛΛΑΙΟΣ"
    };
}

Otherwise I think there will be problems with Microsoft/Windows compatibility. I'm not sure it's 100%, but the current behavior seems to match MSVC.


Hi Pelle:

I was refering to the identifiers, the names of variables, types, functions, labels etc. In the example I posted, this part

Code: [Select]

char *Όνομα_μήνα(int μετρητής)


From ISO/IEC 9899:202x, Annex D (normative) Universal character names for identifiers

Quote

                                  Annex D
                               (normative)
                 Universal character names for identifiers

1 This clause lists the hexadecimal code values that are valid in universal character names in identifiers.

                 D.1 Ranges of characters allowed
1 00A8, 00AA, 00AD, 00AF, 00B2–00B5, 00B7–00BA, 00BC–00BE, 00C0–00D6, 00D8–00F6, 00F8–00FF
2 0100–167F, 1681–180D, 180F–1FFF
3 200B–200D, 202A–202E, 203F–2040, 2054, 2060–206F
4 2070–218F, 2460–24FF, 2776–2793, 2C00–2DFF, 2E80–2FFF
5 3004–3007, 3021–302F, 3031–303F
6 3040–D7FF
7 F900–FD3D, FD40–FDCF, FDF0–FE44, FE47–FFFD
8 10000–1FFFD, 20000–2FFFD, 30000–3FFFD, 40000–4FFFD, 50000–5FFFD, 60000–6FFFD, 70000–
7FFFD, 80000–8FFFD, 90000–9FFFD, A0000–AFFFD, B0000–BFFFD, C0000–CFFFD, D0000–DFFFD,
E0000–EFFFD
               D.2 Ranges of characters disallowed initially
1 0300–036F, 1DC0–1DFF, 20D0–20FF, FE20–FE2F


Martin Storsjö's LLVM-MinGW has implemented this and I have used it on Windows. I believe that Martin also has also done this on the latest MinGW64.

Microsoft C/C++ identifiers are still ASCII

Quote

nondigit: one of
    _ a b c d e f g h i j k l mn o p q r s t u v w x y z
    A B C D E F G H I J K L MN O P Q R S T U V W X Y Z

digit: one of
    0 1 2 3 4 5 6 7 8 9


quoted from

Quote
https://docs.microsoft.com/en-us/cpp/c-language/c-identifiers?view=msvc-160

but in general moving toward UTF-8 and away from UTF-16.

Quote

-A vs. -W APIs
Win32 APIs often support both -A and -W variants.

-A variants recognize the ANSI code page configured on the system and support char*, while -W variants operate in UTF-16 and support WCHAR.

Until recently, Windows has emphasized "Unicode" -W variants over -A APIs. However, recent releases have used the ANSI code page and -A APIs as a means to introduce UTF-8 support to apps. If the ANSI code page is configured for UTF-8, -A APIs operate in UTF-8. This model has the benefit of supporting existing code built with -A APIs without any code changes.


Quoted from

https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page (https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page)



Title: Re: UTF-8 Identifiers
Post by: Pelle on February 23, 2021, 07:24:50 am
I was refering to the identifiers, the names of variables, types, functions, labels etc. In the example I posted, this part
...
Ah! OK ...

...
but in general moving toward UTF-8 and away from UTF-16.
I wasn't aware of this. I will look at it... (but can't promise anything right now).

I guess the standard C way of using "universal-character-names" should work...
Code: [Select]
\uxxxx  (xxxx = four hex digits)
\Uxxxxxxxx  (xxxxxxxx = eight hex digits)
... but it get tedious rather quickly...
Title: Re: UTF-8 Identifiers
Post by: Robert on February 23, 2021, 08:21:57 am
I was refering to the identifiers, the names of variables, types, functions, labels etc. In the example I posted, this part
...
Ah! OK ...

...
but in general moving toward UTF-8 and away from UTF-16.

I wasn't aware of this. I will look at it... (but can't promise anything right now).

I guess the standard C way of using "universal-character-names" should work...
Code: [Select]
\uxxxx  (xxxx = four hex digits)
\Uxxxxxxxx  (xxxxxxxx = eight hex digits)
... but it get tedious rather quickly...

My head hurts just thinking about the standard C way of using "universal-character-names"  !

Your IDE already is UTF-8 default so it would be nice to add a level of sophistication and accessibility for non-ASCII coders.

Thank you Pelle.
Title: Re: UTF-8 Identifiers
Post by: Pelle on February 24, 2021, 05:24:19 pm
Good news: it wasn't too hard adding a new compiler option (/utf-8) that switches from the default ANSI code page (both for runtime, and source files without a BOM). Will be in the next version.
Title: Re: UTF-8 Identifiers
Post by: Robert on February 24, 2021, 10:08:13 pm
Good news: it wasn't too hard adding a new compiler option (/utf-8) that switches from the default ANSI code page (both for runtime, and source files without a BOM). Will be in the next version.

شكرا لك
આભાર
Баярлалаа
Cảm ơn bạn
謝謝
Thank you