UTF-8 Identifiers

Pelles C > Feature requests

UTF-8 Identifiers

(1/2) > >>

Robert:
UTF-8 identifiers, now supported by Clang and GCC 10 would be nice. Code below compiles and executes as expected with Martin Storsjö's LLVM-MinGW

--- Code: ---
char * Όνομα_μήνα (int μετρητής)
{
static const PCHAR DATA[]=
{
"ΦΟΙΝΙΚΑΙΟΣ","ΚΡΑΝΕΙΟΣ","ΛΑΝΟΤΡΟΠΙΟΣ","ΜΑΧΑΝΕΥΣ",
"ΔΩΔΕΚΑΤΕΥΣ","ΕΥΚΛΕΙΟΣ","ΑΡΤΕΜΙΣΙΟΣ","ΨΥΔΡΕΥΣ",
"ΓΑΜΕΙΛΙΟΣ","ΑΓΡΙΑΝΙΟΣ","ΠΑΝΑΜΟΣ","ΑΠΕΛΛΑΙΟΣ"
};

--- End code ---

Pelle:
Well, using the u8 string prefix should work (at least with the source file in an encoding like UTF-16, on my machine) ...

--- Code: ---char *Όνομα_μήνα(int μετρητής)
{
/*static */ const char *DATA[] =
{
u8"ΦΟΙΝΙΚΑΙΟΣ", u8"ΚΡΑΝΕΙΟΣ", u8"ΛΑΝΟΤΡΟΠΙΟΣ", u8"ΜΑΧΑΝΕΥΣ",
u8"ΔΩΔΕΚΑΤΕΥΣ", u8"ΕΥΚΛΕΙΟΣ", u8"ΑΡΤΕΜΙΣΙΟΣ", u8"ΨΥΔΡΕΥΣ",
u8"ΓΑΜΕΙΛΙΟΣ", u8"ΑΓΡΙΑΝΙΟΣ", u8"ΠΑΝΑΜΟΣ", u8"ΑΠΕΛΛΑΙΟΣ"
};
}

--- End code ---

Otherwise I think there will be problems with Microsoft/Windows compatibility. I'm not sure it's 100%, but the current behavior seems to match MSVC.

Robert:

--- Quote from: Pelle on February 22, 2021, 07:59:18 AM ---Well, using the u8 string prefix should work (at least with the source file in an encoding like UTF-16, on my machine) ...

--- Code: ---char *Όνομα_μήνα(int μετρητής)
{
/*static */ const char *DATA[] =
{
u8"ΦΟΙΝΙΚΑΙΟΣ", u8"ΚΡΑΝΕΙΟΣ", u8"ΛΑΝΟΤΡΟΠΙΟΣ", u8"ΜΑΧΑΝΕΥΣ",
u8"ΔΩΔΕΚΑΤΕΥΣ", u8"ΕΥΚΛΕΙΟΣ", u8"ΑΡΤΕΜΙΣΙΟΣ", u8"ΨΥΔΡΕΥΣ",
u8"ΓΑΜΕΙΛΙΟΣ", u8"ΑΓΡΙΑΝΙΟΣ", u8"ΠΑΝΑΜΟΣ", u8"ΑΠΕΛΛΑΙΟΣ"
};
}

--- End code ---

Otherwise I think there will be problems with Microsoft/Windows compatibility. I'm not sure it's 100%, but the current behavior seems to match MSVC.

--- End quote ---

Hi Pelle:

I was refering to the identifiers, the names of variables, types, functions, labels etc. In the example I posted, this part

--- Code: ---
char *Όνομα_μήνα(int μετρητής)

--- End code ---

From ISO/IEC 9899:202x, Annex D (normative) Universal character names for identiﬁers

--- Quote ---
Annex D
(normative)
Universal character names for identiﬁers

1 This clause lists the hexadecimal code values that are valid in universal character names in identiﬁers.

D.1 Ranges of characters allowed
1 00A8, 00AA, 00AD, 00AF, 00B2–00B5, 00B7–00BA, 00BC–00BE, 00C0–00D6, 00D8–00F6, 00F8–00FF
2 0100–167F, 1681–180D, 180F–1FFF
3 200B–200D, 202A–202E, 203F–2040, 2054, 2060–206F
4 2070–218F, 2460–24FF, 2776–2793, 2C00–2DFF, 2E80–2FFF
5 3004–3007, 3021–302F, 3031–303F
6 3040–D7FF
7 F900–FD3D, FD40–FDCF, FDF0–FE44, FE47–FFFD
8 10000–1FFFD, 20000–2FFFD, 30000–3FFFD, 40000–4FFFD, 50000–5FFFD, 60000–6FFFD, 70000–
7FFFD, 80000–8FFFD, 90000–9FFFD, A0000–AFFFD, B0000–BFFFD, C0000–CFFFD, D0000–DFFFD,
E0000–EFFFD
D.2 Ranges of characters disallowed initially
1 0300–036F, 1DC0–1DFF, 20D0–20FF, FE20–FE2F

--- End quote ---

Martin Storsjö's LLVM-MinGW has implemented this and I have used it on Windows. I believe that Martin also has also done this on the latest MinGW64.

Microsoft C/C++ identifiers are still ASCII

--- Quote ---
nondigit: one of
_ a b c d e f g h i j k l mn o p q r s t u v w x y z
A B C D E F G H I J K L MN O P Q R S T U V W X Y Z

digit: one of
0 1 2 3 4 5 6 7 8 9

--- End quote ---

quoted from

--- Quote ---https://docs.microsoft.com/en-us/cpp/c-language/c-identifiers?view=msvc-160
--- End quote ---

but in general moving toward UTF-8 and away from UTF-16.

--- Quote ---
-A vs. -W APIs
Win32 APIs often support both -A and -W variants.

-A variants recognize the ANSI code page configured on the system and support char*, while -W variants operate in UTF-16 and support WCHAR.

Until recently, Windows has emphasized "Unicode" -W variants over -A APIs. However, recent releases have used the ANSI code page and -A APIs as a means to introduce UTF-8 support to apps. If the ANSI code page is configured for UTF-8, -A APIs operate in UTF-8. This model has the benefit of supporting existing code built with -A APIs without any code changes.

--- End quote ---

Quoted from

https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page

Pelle:

--- Quote from: Robert on February 22, 2021, 09:48:41 PM ---I was refering to the identifiers, the names of variables, types, functions, labels etc. In the example I posted, this part
...

--- End quote ---
Ah! OK ...

--- Quote from: Robert on February 22, 2021, 09:48:41 PM ---...
but in general moving toward UTF-8 and away from UTF-16.

--- End quote ---
I wasn't aware of this. I will look at it... (but can't promise anything right now).

I guess the standard C way of using "universal-character-names" should work...

--- Code: ---\uxxxx (xxxx = four hex digits)
\Uxxxxxxxx (xxxxxxxx = eight hex digits)

--- End code ---
... but it get tedious rather quickly...

Robert:

--- Quote from: Pelle on February 23, 2021, 07:24:50 AM ---
--- Quote from: Robert on February 22, 2021, 09:48:41 PM ---I was refering to the identifiers, the names of variables, types, functions, labels etc. In the example I posted, this part
...

--- End quote ---
Ah! OK ...

--- Quote from: Robert on February 22, 2021, 09:48:41 PM ---...
but in general moving toward UTF-8 and away from UTF-16.

--- End quote ---

I wasn't aware of this. I will look at it... (but can't promise anything right now).

I guess the standard C way of using "universal-character-names" should work...

--- Code: ---\uxxxx (xxxx = four hex digits)
\Uxxxxxxxx (xxxxxxxx = eight hex digits)

--- End code ---
... but it get tedious rather quickly...

--- End quote ---

My head hurts just thinking about the standard C way of using "universal-character-names" !

Your IDE already is UTF-8 default so it would be nice to add a level of sophistication and accessibility for non-ASCII coders.

Thank you Pelle.

Navigation

[0] Message Index

[#] Next page

Go to full version