Pelles C > Feature requests

UTF-8 Identifiers

(1/2) > >>

Robert:
UTF-8 identifiers, now supported by Clang and GCC 10 would be nice. Code below compiles and executes as expected with Martin Storsjö's LLVM-MinGW


--- Code: ---
char * Όνομα_μήνα (int μετρητής)
{
  static const PCHAR DATA[]=
  {
 "ΦΟΙΝΙΚΑΙΟΣ","ΚΡΑΝΕΙΟΣ","ΛΑΝΟΤΡΟΠΙΟΣ","ΜΑΧΑΝΕΥΣ",
 "ΔΩΔΕΚΑΤΕΥΣ","ΕΥΚΛΕΙΟΣ","ΑΡΤΕΜΙΣΙΟΣ","ΨΥΔΡΕΥΣ",
 "ΓΑΜΕΙΛΙΟΣ","ΑΓΡΙΑΝΙΟΣ","ΠΑΝΑΜΟΣ","ΑΠΕΛΛΑΙΟΣ"
};


--- End code ---

Pelle:
Well, using the u8 string prefix should work (at least with the source file in an encoding like UTF-16, on my machine) ...


--- Code: ---char *Όνομα_μήνα(int μετρητής)
{
    /*static */ const char *DATA[] =
    {
        u8"ΦΟΙΝΙΚΑΙΟΣ", u8"ΚΡΑΝΕΙΟΣ", u8"ΛΑΝΟΤΡΟΠΙΟΣ", u8"ΜΑΧΑΝΕΥΣ",
        u8"ΔΩΔΕΚΑΤΕΥΣ", u8"ΕΥΚΛΕΙΟΣ", u8"ΑΡΤΕΜΙΣΙΟΣ", u8"ΨΥΔΡΕΥΣ",
        u8"ΓΑΜΕΙΛΙΟΣ", u8"ΑΓΡΙΑΝΙΟΣ", u8"ΠΑΝΑΜΟΣ", u8"ΑΠΕΛΛΑΙΟΣ"
    };
}

--- End code ---

Otherwise I think there will be problems with Microsoft/Windows compatibility. I'm not sure it's 100%, but the current behavior seems to match MSVC.


Robert:

--- Quote from: Pelle on February 22, 2021, 07:59:18 AM ---Well, using the u8 string prefix should work (at least with the source file in an encoding like UTF-16, on my machine) ...


--- Code: ---char *Όνομα_μήνα(int μετρητής)
{
    /*static */ const char *DATA[] =
    {
        u8"ΦΟΙΝΙΚΑΙΟΣ", u8"ΚΡΑΝΕΙΟΣ", u8"ΛΑΝΟΤΡΟΠΙΟΣ", u8"ΜΑΧΑΝΕΥΣ",
        u8"ΔΩΔΕΚΑΤΕΥΣ", u8"ΕΥΚΛΕΙΟΣ", u8"ΑΡΤΕΜΙΣΙΟΣ", u8"ΨΥΔΡΕΥΣ",
        u8"ΓΑΜΕΙΛΙΟΣ", u8"ΑΓΡΙΑΝΙΟΣ", u8"ΠΑΝΑΜΟΣ", u8"ΑΠΕΛΛΑΙΟΣ"
    };
}

--- End code ---

Otherwise I think there will be problems with Microsoft/Windows compatibility. I'm not sure it's 100%, but the current behavior seems to match MSVC.

--- End quote ---


Hi Pelle:

I was refering to the identifiers, the names of variables, types, functions, labels etc. In the example I posted, this part


--- Code: ---
char *Όνομα_μήνα(int μετρητής)


--- End code ---

From ISO/IEC 9899:202x, Annex D (normative) Universal character names for identifiers


--- Quote ---
                                  Annex D
                               (normative)
                 Universal character names for identifiers

1 This clause lists the hexadecimal code values that are valid in universal character names in identifiers.

                 D.1 Ranges of characters allowed
1 00A8, 00AA, 00AD, 00AF, 00B2–00B5, 00B7–00BA, 00BC–00BE, 00C0–00D6, 00D8–00F6, 00F8–00FF
2 0100–167F, 1681–180D, 180F–1FFF
3 200B–200D, 202A–202E, 203F–2040, 2054, 2060–206F
4 2070–218F, 2460–24FF, 2776–2793, 2C00–2DFF, 2E80–2FFF
5 3004–3007, 3021–302F, 3031–303F
6 3040–D7FF
7 F900–FD3D, FD40–FDCF, FDF0–FE44, FE47–FFFD
8 10000–1FFFD, 20000–2FFFD, 30000–3FFFD, 40000–4FFFD, 50000–5FFFD, 60000–6FFFD, 70000–
7FFFD, 80000–8FFFD, 90000–9FFFD, A0000–AFFFD, B0000–BFFFD, C0000–CFFFD, D0000–DFFFD,
E0000–EFFFD
               D.2 Ranges of characters disallowed initially
1 0300–036F, 1DC0–1DFF, 20D0–20FF, FE20–FE2F


--- End quote ---

Martin Storsjö's LLVM-MinGW has implemented this and I have used it on Windows. I believe that Martin also has also done this on the latest MinGW64.

Microsoft C/C++ identifiers are still ASCII


--- Quote ---
nondigit: one of
    _ a b c d e f g h i j k l mn o p q r s t u v w x y z
    A B C D E F G H I J K L MN O P Q R S T U V W X Y Z

digit: one of
    0 1 2 3 4 5 6 7 8 9


--- End quote ---

quoted from


--- Quote ---https://docs.microsoft.com/en-us/cpp/c-language/c-identifiers?view=msvc-160
--- End quote ---

but in general moving toward UTF-8 and away from UTF-16.


--- Quote ---
-A vs. -W APIs
Win32 APIs often support both -A and -W variants.

-A variants recognize the ANSI code page configured on the system and support char*, while -W variants operate in UTF-16 and support WCHAR.

Until recently, Windows has emphasized "Unicode" -W variants over -A APIs. However, recent releases have used the ANSI code page and -A APIs as a means to introduce UTF-8 support to apps. If the ANSI code page is configured for UTF-8, -A APIs operate in UTF-8. This model has the benefit of supporting existing code built with -A APIs without any code changes.


--- End quote ---

Quoted from

https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page



Pelle:

--- Quote from: Robert on February 22, 2021, 09:48:41 PM ---I was refering to the identifiers, the names of variables, types, functions, labels etc. In the example I posted, this part
...

--- End quote ---
Ah! OK ...


--- Quote from: Robert on February 22, 2021, 09:48:41 PM ---...
but in general moving toward UTF-8 and away from UTF-16.

--- End quote ---
I wasn't aware of this. I will look at it... (but can't promise anything right now).

I guess the standard C way of using "universal-character-names" should work...

--- Code: ---\uxxxx  (xxxx = four hex digits)
\Uxxxxxxxx  (xxxxxxxx = eight hex digits)

--- End code ---
... but it get tedious rather quickly...

Robert:

--- Quote from: Pelle on February 23, 2021, 07:24:50 AM ---
--- Quote from: Robert on February 22, 2021, 09:48:41 PM ---I was refering to the identifiers, the names of variables, types, functions, labels etc. In the example I posted, this part
...

--- End quote ---
Ah! OK ...


--- Quote from: Robert on February 22, 2021, 09:48:41 PM ---...
but in general moving toward UTF-8 and away from UTF-16.

--- End quote ---

I wasn't aware of this. I will look at it... (but can't promise anything right now).

I guess the standard C way of using "universal-character-names" should work...

--- Code: ---\uxxxx  (xxxx = four hex digits)
\Uxxxxxxxx  (xxxxxxxx = eight hex digits)

--- End code ---
... but it get tedious rather quickly...

--- End quote ---

My head hurts just thinking about the standard C way of using "universal-character-names"  !

Your IDE already is UTF-8 default so it would be nice to add a level of sophistication and accessibility for non-ASCII coders.

Thank you Pelle.

Navigation

[0] Message Index

[#] Next page

Go to full version