Print Page - UTF-8 Identifiers

Title: UTF-8 Identifiers
Post by: Robert on February 22, 2021, 12:17:02 AM

UTF-8 identifiers, now supported by Clang and GCC 10 would be nice. Code below compiles and executes as expected with Martin Storsjö's LLVM-MinGW

Code Select



char * Όνομα_μήνα (int μετρητής)
{
  static const PCHAR DATA[]=
  {
 "ΦΟΙΝΙΚΑΙΟΣ","ΚΡΑΝΕΙΟΣ","ΛΑΝΟΤΡΟΠΙΟΣ","ΜΑΧΑΝΕΥΣ",
 "ΔΩΔΕΚΑΤΕΥΣ","ΕΥΚΛΕΙΟΣ","ΑΡΤΕΜΙΣΙΟΣ","ΨΥΔΡΕΥΣ",
 "ΓΑΜΕΙΛΙΟΣ","ΑΓΡΙΑΝΙΟΣ","ΠΑΝΑΜΟΣ","ΑΠΕΛΛΑΙΟΣ"
};

Title: Re: UTF-8 Identifiers
Post by: Pelle on February 22, 2021, 07:59:18 AM

Well, using the u8 string prefix should work (at least with the source file in an encoding like UTF-16, on my machine) ...

Code Select

char *Όνομα_μήνα(int μετρητής)
{
    /*static */ const char *DATA[] =
    {
        u8"ΦΟΙΝΙΚΑΙΟΣ", u8"ΚΡΑΝΕΙΟΣ", u8"ΛΑΝΟΤΡΟΠΙΟΣ", u8"ΜΑΧΑΝΕΥΣ",
        u8"ΔΩΔΕΚΑΤΕΥΣ", u8"ΕΥΚΛΕΙΟΣ", u8"ΑΡΤΕΜΙΣΙΟΣ", u8"ΨΥΔΡΕΥΣ",
        u8"ΓΑΜΕΙΛΙΟΣ", u8"ΑΓΡΙΑΝΙΟΣ", u8"ΠΑΝΑΜΟΣ", u8"ΑΠΕΛΛΑΙΟΣ"
    };
}

Otherwise I think there will be problems with Microsoft/Windows compatibility. I'm not sure it's 100%, but the current behavior seems to match MSVC.

Title: Re: UTF-8 Identifiers
Post by: Robert on February 22, 2021, 09:48:41 PM

Quote from: Pelle on February 22, 2021, 07:59:18 AM
Well, using the u8 string prefix should work (at least with the source file in an encoding like UTF-16, on my machine) ...

Code Select Expand
char *Όνομα_μήνα(int μετρητής) { /*static */ const char *DATA[] = { u8"ΦΟΙΝΙΚΑΙΟΣ", u8"ΚΡΑΝΕΙΟΣ", u8"ΛΑΝΟΤΡΟΠΙΟΣ", u8"ΜΑΧΑΝΕΥΣ", u8"ΔΩΔΕΚΑΤΕΥΣ", u8"ΕΥΚΛΕΙΟΣ", u8"ΑΡΤΕΜΙΣΙΟΣ", u8"ΨΥΔΡΕΥΣ", u8"ΓΑΜΕΙΛΙΟΣ", u8"ΑΓΡΙΑΝΙΟΣ", u8"ΠΑΝΑΜΟΣ", u8"ΑΠΕΛΛΑΙΟΣ" }; }

Otherwise I think there will be problems with Microsoft/Windows compatibility. I'm not sure it's 100%, but the current behavior seems to match MSVC.

Hi Pelle:

I was refering to the identifiers, the names of variables, types, functions, labels etc. In the example I posted, this part

Code Select



char *Όνομα_μήνα(int μετρητής)

From ISO/IEC 9899:202x, Annex D (normative) Universal character names for identiﬁers

Quote

Annex D
(normative)
Universal character names for identiﬁers

1 This clause lists the hexadecimal code values that are valid in universal character names in identiﬁers.

D.1 Ranges of characters allowed
1 00A8, 00AA, 00AD, 00AF, 00B2–00B5, 00B7–00BA, 00BC–00BE, 00C0–00D6, 00D8–00F6, 00F8–00FF
2 0100–167F, 1681–180D, 180F–1FFF
3 200B–200D, 202A–202E, 203F–2040, 2054, 2060–206F
4 2070–218F, 2460–24FF, 2776–2793, 2C00–2DFF, 2E80–2FFF
5 3004–3007, 3021–302F, 3031–303F
6 3040–D7FF
7 F900–FD3D, FD40–FDCF, FDF0–FE44, FE47–FFFD
8 10000–1FFFD, 20000–2FFFD, 30000–3FFFD, 40000–4FFFD, 50000–5FFFD, 60000–6FFFD, 70000–
7FFFD, 80000–8FFFD, 90000–9FFFD, A0000–AFFFD, B0000–BFFFD, C0000–CFFFD, D0000–DFFFD,
E0000–EFFFD
D.2 Ranges of characters disallowed initially
1 0300–036F, 1DC0–1DFF, 20D0–20FF, FE20–FE2F

Martin Storsjö's LLVM-MinGW has implemented this and I have used it on Windows. I believe that Martin also has also done this on the latest MinGW64.

Microsoft C/C++ identifiers are still ASCII

Quote

nondigit: one of
_ a b c d e f g h i j k l mn o p q r s t u v w x y z
A B C D E F G H I J K L MN O P Q R S T U V W X Y Z

digit: one of
0 1 2 3 4 5 6 7 8 9

quoted from

Quotehttps://docs.microsoft.com/en-us/cpp/c-language/c-identifiers?view=msvc-160

but in general moving toward UTF-8 and away from UTF-16.

Quote

-A vs. -W APIs
Win32 APIs often support both -A and -W variants.

-A variants recognize the ANSI code page configured on the system and support char*, while -W variants operate in UTF-16 and support WCHAR.

Until recently, Windows has emphasized "Unicode" -W variants over -A APIs. However, recent releases have used the ANSI code page and -A APIs as a means to introduce UTF-8 support to apps. If the ANSI code page is configured for UTF-8, -A APIs operate in UTF-8. This model has the benefit of supporting existing code built with -A APIs without any code changes.

Quoted from

https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page (https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page)

Title: Re: UTF-8 Identifiers
Post by: Pelle on February 23, 2021, 07:24:50 AM

Quote from: Robert on February 22, 2021, 09:48:41 PM
I was refering to the identifiers, the names of variables, types, functions, labels etc. In the example I posted, this part
...

Ah! OK ...

Quote from: Robert on February 22, 2021, 09:48:41 PM
...
but in general moving toward UTF-8 and away from UTF-16.

I wasn't aware of this. I will look at it... (but can't promise anything right now).

I guess the standard C way of using "universal-character-names" should work...

Code Select

\uxxxx  (xxxx = four hex digits)
\Uxxxxxxxx  (xxxxxxxx = eight hex digits)

... but it get tedious rather quickly...

Title: Re: UTF-8 Identifiers
Post by: Robert on February 23, 2021, 08:21:57 AM

Quote from: Pelle on February 23, 2021, 07:24:50 AM
Quote from: Robert on February 22, 2021, 09:48:41 PM
I was refering to the identifiers, the names of variables, types, functions, labels etc. In the example I posted, this part
...
Ah! OK ...

Quote from: Robert on February 22, 2021, 09:48:41 PM
...
but in general moving toward UTF-8 and away from UTF-16.

I wasn't aware of this. I will look at it... (but can't promise anything right now).

I guess the standard C way of using "universal-character-names" should work...
Code Select Expand
\uxxxx (xxxx = four hex digits) \Uxxxxxxxx (xxxxxxxx = eight hex digits)
... but it get tedious rather quickly...

My head hurts just thinking about the standard C way of using "universal-character-names" !

Your IDE already is UTF-8 default so it would be nice to add a level of sophistication and accessibility for non-ASCII coders.

Thank you Pelle.

Title: Re: UTF-8 Identifiers
Post by: Pelle on February 24, 2021, 05:24:19 PM

Good news: it wasn't too hard adding a new compiler option (/utf-8) that switches from the default ANSI code page (both for runtime, and source files without a BOM). Will be in the next version.

Title: Re: UTF-8 Identifiers
Post by: Robert on February 24, 2021, 10:08:13 PM

Quote from: Pelle on February 24, 2021, 05:24:19 PM
Good news: it wasn't too hard adding a new compiler option (/utf-8) that switches from the default ANSI code page (both for runtime, and source files without a BOM). Will be in the next version.

شكرا لك
આભાર
Баярлалаа
Cảm ơn bạn
謝謝
Thank you

Title: Re: UTF-8 Identifiers
Post by: Robert on July 24, 2021, 05:12:09 AM

Quote from: Robert on February 24, 2021, 10:08:13 PM
Quote from: Pelle on February 24, 2021, 05:24:19 PM
Good news: it wasn't too hard adding a new compiler option (/utf-8) that switches from the default ANSI code page (both for runtime, and source files without a BOM). Will be in the next version.

شكرا لك
આભાર
Баярлалаа
Cảm ơn bạn
謝謝
Thank you

+1

Use /utf-8 flag on pocc 11.0 compiler command line.

Code Select

#include <windows.h> 
#include <stdio.h>    // ISO StdLib
#include <stdlib.h>   // ISO StdLib
#include <conio.h>    // Πρωτόγονη είσοδος / έξοδος

// *************************************************
//            Καθολικές μεταβλητές χρηστών
// *************************************************

static int αρχικός_Κώδικας_σελίδα;

// *************************************************
//               Πρωτότυπα χρήστη
// *************************************************

char* Όνομα_μήνα (int);
char* εργάσιμες (int);

// *************************************************
//            Διαδικασίες χρήστη
// *************************************************

char * Όνομα_μήνα (int μετρητής)
{
  static char* στοιχεία[]=
  {

  // The Antikythera mechanism, the oldest example of an analogue computer, has the
  //  following 12 month names of the Corinthian calendar inscribed on the Metonic dial.
  //  https://en.wikipedia.org/wiki/Antikythera_mechanism

 "ΦΟΙΝΙΚΑΙΟΣ","ΚΡΑΝΕΙΟΣ","ΛΑΝΟΤΡΟΠΙΟΣ","ΜΑΧΑΝΕΥΣ",
 "ΔΩΔΕΚΑΤΕΥΣ","ΕΥΚΛΕΙΟΣ","ΑΡΤΕΜΙΣΙΟΣ","ΨΥΔΡΕΥΣ",
 "ΓΑΜΕΙΛΙΟΣ","ΑΓΡΙΑΝΙΟΣ","ΠΑΝΑΜΟΣ","ΑΠΕΛΛΑΙΟΣ"
  };
  if(μετρητής<1||μετρητής>12 )
  {
   return 0;
  }
 return στοιχεία[μετρητής-1];
}

char* εργάσιμες (int μετρητής)
{
  static char* στοιχεία[]=
  {
 "Κυριακή","Δευτέρα","Τρίτη","Τετάρτη",
 "Πέμπτη","Παρασκευή","Σάββατο"
  };
 return στοιχεία[μετρητής-1];  
}

// *************************************************
//                  Κύριο πρόγραμμα
// *************************************************

  int main(int argc, char *argv[])
{
  αρχικός_Κώδικας_σελίδα=GetConsoleOutputCP();
  SetConsoleOutputCP(65001);
  printf("%s\n","Εδώ είναι τα ονόματα του μήνα:");
  printf("\n");
    {int ιώτα;
  for(ιώτα=1; ιώτα<=12; ιώτα+=1)
    {
      printf("%s\n",Όνομα_μήνα(ιώτα));
    }
    }
  printf("\n");
  printf("%s\n","Εδώ είναι τα ονόματα των ημερών της εβδομάδας:");
  printf("\n");
    {int ιώτα;
  for(ιώτα=1; ιώτα<=7; ιώτα+=1)
    {
      printf("%s\n",εργάσιμες(ιώτα));
    }
    }
  SetConsoleOutputCP(αρχικός_Κώδικας_σελίδα);
    printf("\n%s\n","Πατήστε οποιοδήποτε κουμπί για να συνεχίσετε . . .");
  _getch();
  
  return EXIT_SUCCESS;   // Τέλος του κύριου προγράμματος 
}

Result:

Code Select



Εδώ είναι τα ονόματα του μήνα:

ΦΟΙΝΙΚΑΙΟΣ
ΚΡΑΝΕΙΟΣ
ΛΑΝΟΤΡΟΠΙΟΣ
ΜΑΧΑΝΕΥΣ
ΔΩΔΕΚΑΤΕΥΣ
ΕΥΚΛΕΙΟΣ
ΑΡΤΕΜΙΣΙΟΣ
ΨΥΔΡΕΥΣ
ΓΑΜΕΙΛΙΟΣ
ΑΓΡΙΑΝΙΟΣ
ΠΑΝΑΜΟΣ
ΑΠΕΛΛΑΙΟΣ

Εδώ είναι τα ονόματα των ημερών της εβδομάδας:

Κυριακή
Δευτέρα
Τρίτη
Τετάρτη
Πέμπτη
Παρασκευή
Σάββατο

Πατήστε οποιοδήποτε κουμπί για να συνεχίσετε . . .

Pelles C forum

Pelles C => Feature requests => Topic started by: Robert on February 22, 2021, 12:17:02 AM