Export C source as HTML or PDF file

Robert · January 18, 2026, 12:17:30 AM

Quote from: TimoVJL on January 17, 2026, 01:21:16 PMA small stupid C to RTF project to RichEdit.

It can help to debug some code.

Hi Timo:
The RTF output from this code is different from the RTF output of John Z's IDE addin.
I'm going to have a look at the output in an ImHex editor and see what's going on.

You know ImHex ?
https://imhex.werwolv.net/
https://github.com/WerWolv/ImHex

TimoVJL · January 18, 2026, 05:00:36 AM

Quote from: Robert on January 18, 2026, 12:17:30 AMThe RTF output from this code is different from the RTF output of John Z's IDE addin.

It might just created for different purbose, like writing to RichEdit control.
Also it was for only ANSI source.

Code Select

int printf(const char * restrict format, ...);

// https://stackoverflow.com/questions/5603559/one-file-lib-to-conv-utf8-char-to-wchar-t
short utf8_to_wchar(char **utf8)
{
    short sz = 0;
    short c;
    char *p = *(char **)utf8;
    char v = (*p);
    if (v >= 0)
    {
        c = v;
        sz += c;
        ++p; (*utf8)++;
    }
    int shiftCount = 0;
    if ((v & 0xE0) == 0xC0)
    {
        shiftCount = 1;
        c = v & 0x1F;
    }
    else if ((v & 0xF0) == 0xE0)
    {
        shiftCount = 2;
        c = v & 0xF;
    }
    else
        return 0;
    ++p; (*utf8)++;
    while (shiftCount)
    {
        v = *p;
        ++p; (*utf8)++;
        if ((v & 0xC0) != 0x80)
            return 0;
        c <<= 6;
        c |= (v & 0x3F);
        --shiftCount;
    }
    sz += c;
    return sz;
}

int ShortToStrPos(int n, char *s)
{
    int i, sign, idx, nl, len;

    idx = 0;
/*    if ((sign = n) < 0) {    // record sign
        n = -n;    // make n positive
        idx++;
    }*/
    i = 0;
    nl = n;
    while ((nl /= 10) > 0)    /* count nums */
        idx++;
    len = idx+1;
    s[idx+1] = '\0';
    do {    /* generate digits in reverse order */
        s[idx--] = n % 10 + '0';    /* get next digit */
    } while ((n /= 10) > 0);    /* delete it */
//    if (sign < 0)
//        s[0] = '-';
    return len;
}

int __cdecl main(void)
{
    char utf8[] = u8"σκατ";
    char *p = utf8;
    while (*p) {
        if (*(unsigned char*)p > 127) {    // UTF8 ?
            short uc = utf8_to_wchar(&p);
            printf("%Xh\t", uc);
        }
    }
    printf("\n%p\n%p\n", utf8, p);
    return 0;
}

EDIT 2025-01-19: UNICODE version in RE_Test3 and esc close window, but still bugs

Vortex · January 18, 2026, 10:11:36 AM

Hi Timo,

My apologies, it was my mistake. Your application works fine and I removed my previous message #41859

Robert · January 21, 2026, 03:56:32 AM

Quote from: TimoVJL on January 18, 2026, 05:00:36 AM
Quote from: Robert on January 18, 2026, 12:17:30 AMThe RTF output from this code is different from the RTF output of John Z's IDE addin.
It might just created for different purbose, like writing to RichEdit control.
Also it was for only ANSI source.

Code Select Expand
int printf(const char * restrict format, ...); // https://stackoverflow.com/questions/5603559/one-file-lib-to-conv-utf8-char-to-wchar-t short utf8_to_wchar(char **utf8) { short sz = 0; short c; char *p = *(char **)utf8; char v = (*p); if (v >= 0) { c = v; sz += c; ++p; (*utf8)++; } int shiftCount = 0; if ((v & 0xE0) == 0xC0) { shiftCount = 1; c = v & 0x1F; } else if ((v & 0xF0) == 0xE0) { shiftCount = 2; c = v & 0xF; } else return 0; ++p; (*utf8)++; while (shiftCount) { v = *p; ++p; (*utf8)++; if ((v & 0xC0) != 0x80) return 0; c <<= 6; c |= (v & 0x3F); --shiftCount; } sz += c; return sz; } int ShortToStrPos(int n, char *s) { int i, sign, idx, nl, len; idx = 0; /* if ((sign = n) < 0) { // record sign n = -n; // make n positive idx++; }*/ i = 0; nl = n; while ((nl /= 10) > 0) /* count nums */ idx++; len = idx+1; s[idx+1] = '\0'; do { /* generate digits in reverse order */ s[idx--] = n % 10 + '0'; /* get next digit */ } while ((n /= 10) > 0); /* delete it */ // if (sign < 0) // s[0] = '-'; return len; } int __cdecl main(void) { char utf8[] = u8"σκατ"; char *p = utf8; while (*p) { if (*(unsigned char*)p > 127) { // UTF8 ? short uc = utf8_to_wchar(&p); printf("%Xh\t", uc); } } printf("\n%p\n%p\n", utf8, p); return 0; }
EDIT 2025-01-19: UNICODE version in RE_Test3 and esc close window, but still bugs

Hei TimoVJL:

The code snippet above is interesting. Thanks.

What bugs ? I don't see bugs in RE_Test3 output.

Robert · January 21, 2026, 04:37:53 AM

Quote from: TimoVJL on January 18, 2026, 05:00:36 AM
Code Select Expand
.... if (*(unsigned char*)p > 127) { // UTF8 ? ....

Hei TimoVJL:

"Nearly all invalid UTF-8 cases can be detected by looking at the first two bytes of a character (in fact, the first 12 bits)."

Quoted from:
'Validating UTF-8 In Less Than One Instruction Per Byte'
available at
https://arxiv.org/pdf/2010.03090.pdf

See also:
Ridiculously fast unicode (UTF-8) validation

Thanks again for the code.

Mikään ei ole mahdotonta.

TimoVJL · January 21, 2026, 07:14:06 AM

QuoteRTF SYNTAX

An RTF file consists of unformatted text, control words, control symbols, and groups. For ease of transport, a standard RTF file can consist of only 7-bit ASCII characters. (Converters that communicate with Microsoft Word for Windows or Microsoft Word for the Macintosh should expect 8-bit characters.) There is no set maximum line length for an RTF file.

RTF use ASCII 32 - 127 chars and some latin-1 (ISO/IEC 8859) chars without coding.

So i was just lazy for checking chars like many others.
UTF-8 with BOM can have conditional processing.

Robert · January 21, 2026, 10:48:55 AM

Quote from: TimoVJL on January 21, 2026, 07:14:06 AM
QuoteRTF SYNTAX

An RTF file consists of unformatted text, control words, control symbols, and groups. For ease of transport, a standard RTF file can consist of only 7-bit ASCII characters. (Converters that communicate with Microsoft Word for Windows or Microsoft Word for the Macintosh should expect 8-bit characters.) There is no set maximum line length for an RTF file.

RTF use ASCII 32 - 127 chars and some latin-1 (ISO/IEC 8859) chars without coding.

So i was just lazy for checking chars like many others.
UTF-8 with BOM can have conditional processing.

Hi TimoVJL and John Z:

RTF SYNTAX.
Oh that !
Yeah, well, I think I'm begining to remember why I'm here.

Export C source etc.

Anyway, you solved what I had considered the hard part, that is, dealing with the UTF-16LE text which is what the export addin function AddIn_GetSourceTextW has to process.

However,as John Z mentioned and you yelled "RTF SYNTAX" I had to look and see what was expected from

static const char* σκατ;

and saw that it was an RTF encoding of

\par }{\rtlch\fcs1 \af67 \ltrch\fcs0 \f67\insrsid15157589\charrsid15157589 static const char* \'f3\'ea\'e1\'f4;

Hmmmm $:-\$

TimoVJL · January 22, 2026, 12:42:11 PM

Better to show important things too:

Code Select

{\rtf1\ansi\deff0{\fonttbl{\f0\fnil\fcharset0 Courier New;}{\f1\fnil\fcharset161{\*\fname Courier New;}Courier New Greek;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\lang1035\f0\fs22 #include <windows.h>\par
#include <stdio.h>\par
\par
static int OrigCodePage;\par
static const char* \f1\'f3\'ea\'e1\'f4;\par
static const char* \'e4\'f5\'f3\'ea\'e1\'f4\'e1\'ed\'ef\'de\'f4\'f9\'ed;\par

Streamed parsing don't work, as have to separate RTF header while processing.

Robert · January 22, 2026, 08:38:24 PM

Quote from: TimoVJL on January 22, 2026, 12:42:11 PMBetter to show important things too:
Code Select Expand
{\rtf1\ansi\deff0{\fonttbl{\f0\fnil\fcharset0 Courier New;}{\f1\fnil\fcharset161{\*\fname Courier New;}Courier New Greek;}} {\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\lang1035\f0\fs22 #include <windows.h>\par #include <stdio.h>\par \par static int OrigCodePage;\par static const char* \f1\'f3\'ea\'e1\'f4;\par static const char* \'e4\'f5\'f3\'ea\'e1\'f4\'e1\'ed\'ef\'de\'f4\'f9\'ed;\parStreamed parsing don't work, as have to separate RTF header while processing.

The streamed parsing is a problem because the AddIn_GetSourceText function extracts UTF-16LE with embedded nulls. The code extracted by AddIn_GetSourceText should be converted to UTF-8, removing the embedded nulls, so that it can be processed with standard, non-wide, C functions.

The RTF encoding of UTF-8 is beyond my understanding, for example, the encoding of UTF-8 eight byte

σκατ;

into the expected RTF representation

\'f3\'ea\'e1\'f4;

TimoVJL · January 22, 2026, 11:03:02 PM

Those are connected.

Code Select

{\f1\fnil\fcharset161{\*\fname Courier New;}Courier New Greek;}
\f1\'f3\'ea\'e1\'f4

With UNICODE 16LE a bit less conversion, have to find right fontset for chars.

https://www.oreilly.com/library/view/rtf-pocket-guide/9781449302047/ch04.html

I have low interest for that.

Robert · January 23, 2026, 03:25:53 AM

Quote from: TimoVJL on January 22, 2026, 11:03:02 PMThose are connected.
Code Select Expand
{\f1\fnil\fcharset161{\*\fname Courier New;}Courier New Greek;} \f1\'f3\'ea\'e1\'f4
With UNICODE 16LE a bit less conversion, have to find right fontset for chars.

https://www.oreilly.com/library/view/rtf-pocket-guide/9781449302047/ch04.html

I have low interest for that.

Ah yes, Code Pages and code page fonts.

Thanks Timo.

John Z · January 23, 2026, 12:59:59 PM

Assuming the code comments are in the users default code page language then

Code Select

#include <windows.h>
#include <stdio.h>

int main() {
    UINT user_codepage = GetACP(); // Retrieve the system default Windows ANSI code page

    printf("The user's default Windows ANSI code page is: %u\n", user_codepage);

    // Optional: Keep the console window open to view the output
    printf("Press Enter to exit...");
    getchar();

    return 0;
}

or variation thereof can get the correct code page to encode in the output file(s).
Code snippet provided by Google 'AI? overview -

however only the first line is relevant

John Z

TimoVJL · January 23, 2026, 05:43:50 PM

How that helps RTF coding ?

EDIT:

Quote from: Robert on January 23, 2026, 06:03:43 PMUnfortunately, the RTFDEFS.H document referenced is not obviously available.

How to Obtain the WinWord Converter SDK (GC1039)

HTML
https://unicodelookup.com/

Robert · January 23, 2026, 06:03:43 PM

Quote from: John Z on January 23, 2026, 12:59:59 PMAssuming the code comments are in the users default code page language then
Code Select Expand
#include <windows.h> #include <stdio.h> int main() { UINT user_codepage = GetACP(); // Retrieve the system default Windows ANSI code page printf("The user's default Windows ANSI code page is: %u\n", user_codepage); // Optional: Keep the console window open to view the output printf("Press Enter to exit..."); getchar(); return 0; }
or variation thereof can get the correct code page to encode in the output file(s).
Code snippet provided by Google 'AI? overview - however only the first line is relevant

John Z

Hi John Z:

My inaccurate "Code Pages" statement should have stated

"Ah yes, charsets and charset fonts."

There is some information in

https://www.biblioscape.com/rtf15_spec.htm

where it is written

Quote\fcharsetN Specifies the character set of a font in the font table. Values for N are defined by Windows header files, and in the file RTFDEFS.H accompanying this document.

Unfortunately, the RTFDEFS.H document referenced is not obviously available.

There is a webpage at

https://www.n2pdf.de/fileadmin/user_upload/n2pdf/files/en/help/client_enu/unicode.htm

that has a table of codepage - charset equivalents.

My interest in your resurrection of the Pelles C Export addin is in the "Export to HTML" facility. I think it can handle Unicode identifiers and quotation mark embedded Unicode strings. RTF ?? I really doubt it. PDF ?? Definitely beyond my pay grade.

If you are interested in developing a Unicode capable "Export to HTML" facility, you might find some help studying the BCX translated C codes of the example on the webpage

https://bcxbasiccoders.com/webhelp/html/bcxunicode.htm#widetoansi

Robert · January 23, 2026, 10:54:58 PM

Quote from: TimoVJL on January 23, 2026, 05:43:50 PMHow that helps RTF coding ?

EDIT:
Quote from: Robert on January 23, 2026, 06:03:43 PMUnfortunately, the RTFDEFS.H document referenced is not obviously available.
How to Obtain the WinWord Converter SDK (GC1039)

Thanks TimoVjl, the rtfdefs.h file is in the download and the charset defines are

Code Select


// \fcharset, \cchs argument values
// some of these values may also be #defined in windows.h; here's the
// complete list
#define ANSI_CHARSET                  0
#define DEFAULT_CHARSET               1
#define SYMBOL_CHARSET                2
#define INVALID_CHARSET               3 // nil value
#define MAC_CHARSET                  77
#define SHIFTJIS_CHARSET            128 // CP 932: Japanese
#define HANGEUL_CHARSET             129 // CP 949: Korean
#define JOHAB_CHARSET               130
#define GB2312_CHARSET              134 // CP 936: PRC             
#define CHINESEBIG5_CHARSET         136 // CP 950: Taiwan
#define GREEK_CHARSET               161
#define TURKISH_CHARSET             162
#define HEBREW_CHARSET              177
#define ARABIC_CHARSET              178
#define ARABICTRADITIONAL_CHARSET   179
#define ARABICUSER_CHARSET          180
#define HEBREWUSER_CHARSET          181
#define BALTIC_CHARSET              186
#define RUSSIAN_CHARSET             204
#define THAI_CHARSET                222
#define EASTEUROPE_CHARSET          238
#define PC437_CHARSET               254
#define OEM_CHARSET                 255

Correlation of Unicode chars to the above RTF charset data may be possible using the International Components for Unicode libraries functions to process locale data contained in the the Unicode Common Locale Data Repository (CLDR).

https://github.com/unicode-org/icu
https://github.com/unicode-org/cldr

There are several RTF-charset to UTF-8 converters but what is needed here, for the Export to RTF addin, is a Unicode to RTF-charset converter.

Obviously, from the above list of charsets, the conversions from Unicode would be limited. For example, C coders working with the Native American Osage language script or the International Phonetic Alphabet (Hello, anyone out there ?) would be excluded.

News:

Export C source as HTML or PDF file