NO

Author Topic: Allocation error loading UTF-8-encoded text to array of wide character strings  (Read 2640 times)

Offline Darius Runge

  • Member
  • *
  • Posts: 3
    • Contact Data
Hello,

I am trying to load a text file of UTF-8-encoded strings seperated by a newline character to a dynamically allocated array of wide character strings. This works until I read a string containing a non-english character (the lowercase Ü as it exists in german language). I have tried different locale settings, providing more memory for the wide characters than should be required and also had a look if wcslen might be the wrong function.

Perhaps someone can tell me the reason my program is not working as intended. The main (and only) file as well as 'stop.txt' is attached. (Don't worry about the naming of "stop", its just that they are stop words - words that are to be ignored - in a parser I am working on).

I appreciate any help and hope it is not something obvious I simply missed.

Greetings,
Darius

Offline frankie

  • Global Moderator
  • Member
  • *****
  • Posts: 2113
Happy new year.
It's too late to deep check your code, but as first point the following is uncorrect:
Code: [Select]
stop_words = calloc(1, sizeof(wchar_t));
It should be:
Code: [Select]
stop_words = calloc(1, sizeof(wchar_t *));
And also:
Code: [Select]
stop_words = realloc(stop_words, i);
Should be:
Code: [Select]
stop_words = realloc(stop_words, i * sizeof(wchar_t *));
The debugger is your friend, use it.
"It is better to be hated for what you are than to be loved for what you are not." - Andre Gide

Offline Darius Runge

  • Member
  • *
  • Posts: 3
    • Contact Data
Thank you for the quick reply! Really missed those mistakes and will have a deeper look in how to use the Pelles C Debugger.

Offline John Z

  • Member
  • *
  • Posts: 860
Hi Darius,

You might also consider that pointer names in particular, can benefit from some type of identifier like a p_.
So your variable named 'stop_words', which frankie points out is being assigned a pointer could be named
'p_stop_words'.  Then reading your code later pointers are easily identified. Just a thought....

John Z

Update: Looked over your sample code.  If you have a lot of stop_words to import you'll be beating the memory management system up ;)  You might consider allocating the p_stop_words pointer space in blocks of 10 or more then down count the number used and only realloc when 1 spot is left.  Code will be faster overall and less memory management.
« Last Edit: January 02, 2023, 12:27:40 PM by John Z »

Offline Darius Runge

  • Member
  • *
  • Posts: 3
    • Contact Data
Hi John,

I appreciate your suggestions on my code! The 'p_' is a good idea and I will do so in the future. However, I will probably write it as a suffix (stop_words_p) to match the convention of the C standard library which has types like div_t, time_t or the _s suffix for structures which I have often seen in other code as well. I understand this to be a matter of mere preference (or is it supposed to be possible to have a prefix and suffix like p_div_t?)

I was worried myself that I might beat up the memory management system, but I understood that realloc is supposed to be implemented smart enough to actually provide more space than asked for to avoid relocating every single call. Do you know if that's true?

To everyone: The original issue is fixed and I will soon share the working code if someone encounters the same issue. (Technically only checked it in linux/gcc so far, will check in Pelles C/Win11 as well before posting here to be sure).

Darius

Offline frankie

  • Global Moderator
  • Member
  • *****
  • Posts: 2113
Hi John,

I appreciate your suggestions on my code! The 'p_' is a good idea and I will do so in the future. However, I will probably write it as a suffix (stop_words_p) to match the convention of the C standard library which has types like div_t, time_t or the _s suffix for structures which I have often seen in other code as well. I understand this to be a matter of mere preference (or is it supposed to be possible to have a prefix and suffix like p_div_t?)
Maybe you'll better understand how Hungarian notation should be used reading here. Microsoft extensively use it.
The postfix format is generally used for user defined type data like in:
Code: [Select]
typedef unsigned int count_t;
count_t  my_count;
For a more efficient memory usage, as John suggested, you should allocate a block of pointers at time (i.e. 100 pointers). When finished you can use realloc to reduce the allocated structure/array freeing the excess memory.
« Last Edit: January 03, 2023, 02:13:14 AM by frankie »
"It is better to be hated for what you are than to be loved for what you are not." - Andre Gide