Pelles C forum

Pelles C => Bug reports => Topic started by: John Z on October 11, 2023, 01:54:16 PM

Title: V12 time optimization
Post by: John Z on October 11, 2023, 01:54:16 PM: To start - I've used Maximize Speed plus checked Extra Optimizations for what follows since about version 9. Unfortunately starting when I finally switched to version 12 a problem has occurred which I failed to catch until this week. My bad for not testing each and every feature of my program after upgrading to version 12. I use third party mini zip sources to complete the assembly of an ODS format spreadsheet. Starting in version 12 it was created but was always corrupt and UN-openable. Today I isolated the source code where the problem occurs.

This code does not work correctly when optimized for time, but works well when no optimization or only size optimization is used. Here is the code -
Code: [Select]
#pragma optimize(none) // needed for Pelles C v12 optimization bug in time static char *strrpl(const char *str, size_t n, char oldchar, char newchar) { char c; size_t i; char *rpl = (char *)calloc((1 + n), sizeof(char)); char *begin = rpl; if (!rpl) { return NULL; } for (i = 0; (i < n) && (c = *str++); ++i) { if (c == oldchar) { c = newchar; } *rpl++ = c; } Write_To_Log("c:/temp/vcardz.log", (char *)begin, 0, 0); // added by John Z for testing return begin; } #pragma optimize(time) // back on for remaining source
When optimized for time the begin character pointer somehow ends up pointing to the string start plus 1 char - so it ends up pointing to the 2nd character in the string as the start of the string. ONLY happens when using the time optimization.
The attached log file shows the output with various optimizations.

So bracketing this one procedure with #pragma fixes my issue.
Maybe information can help find a weak optimization point that can be improved in future versions.

John Z
Title: Re: V12 time optimization
Post by: frankie on October 13, 2023, 07:56:49 PM: John I can't reproduce the problem.
Anyway some points:
Code: [Select]
char *rpl = (char *)calloc((1 + n), sizeof(char));The cast to character pointer returned from calloc call is not required, in some cases can cause problems, moreover sizeof(char) is always 1, because by language standard it is always the smallest allocation unit for a system.

But what sound strange is the cast to character pointer of the variable begin inside the call to Write_To_Log function:
Code: [Select]
Write_To_Log("c:/temp/vcardz.log", (char *)begin, 0, 0);Is it a typo? What is exactly the definition of the function Write_To_Log?
Which compilation is it? 32 or 64 bits? The bug happens for both bitness?
The character pointer begin gets wrong inside the function strrpl or only inside the function Write_To_Log?
Title: Re: V12 time optimization
Post by: John Z on October 13, 2023, 10:24:33 PM: Thanks frankie,

Wow - it is 100% repeatable in my program/setup. I'm going to follow Pelle's inputs from before and move this single procedure into its own file. The code I've shown is preceded with about 950 lines of code, and this is now the last procedure in the file. I relocated from the middle it to the bottom once I isolated it with #pragma statements that showed where the corruption was occurring. I found this was the problem proc basically using a search inserting #pragma optimize(none) and #pragma optimize(time) at locations until I narrowed it down to one procedure. Yes - it took a while but each test you half the search space :)

I added my Write_to_Log call after I found where the problem was. The log file was easier to show the output than using the ODS zip file which has the 1st character missing on each of the 4 file names inside of the zip.

Yes the cast on 'begin' is redundant removing it has no effect. The pointer 'begin' becomes wrong before the Write_to_Log.
I removed the cast and re-ran, everything is the same with or without the cast, as expected since I added the proc to log the issue that was already happening. I should have clarified that in the original post - sorry that is a red herring ....

When I move it into it's own file, I'll also be able to generate a complete ppj and example. If that shows the issue I'll post the complete minimized project. If it all of the sudden works ok when broken out I'll be stumped and may have to re-install Pelles C.....

It is only a 32 bit program BTW - way too much to change to make it into 64 bit ;(
The zip procedure where the issue is found is unchanged from GitHub.

it only happens in version 12 not in version 11.

Thanks very much for looking at this and your analysis.

John Z
Title: Re: V12 time optimization
Post by: Marco on October 14, 2023, 11:19:29 AM: Hello John, I tried the function on my system, and I have the same issue (32-bit compilation - Pelles C v12). The 64-bit executable seems to work as expected (i.e. no errors). To fix the issue you need to add the volatile keyword to the 'i' variable.

Code: [Select]
static char *strrpl(const char *str, size_t n, char oldchar, char newchar) { [...] volatile size_t i; // <-- FIX [...] }
Here are the options used during the compilation, if it can help:

Code: [Select]
pocc: -fp:precise -W2 -Gz -Ze -Zx -MT -Tx86-coff polink: -debug:no -subsystem:console -machine:x86
Marco
Title: Re: V12 time optimization
Post by: John Z on October 14, 2023, 11:30:13 AM: SUCCESS! Well sort of.....

Attached here is a simple minimum (mostly) program that demonstrates the issue under V12 on my WIN 11 system.
A simple Hello_world program. When executed click the OK button, first an optimized procedure is tested, then the
same thing is sent to an unoptimized procedure. the exact procedure is duplicated so that one will be optimized and one will not. This was easier than rebuilding for each test.

Clearly shows that the problem exists on my system under V12. It does not manifest when run under V11, both outputs are correct and identical. It also does not manifest when run under V10, all good with or without optimization. Both V11 and V10 were run on the same system as the V12 test.

Also it can easily demonstrate that it is only the Maximize Speed optimization, with and without Extra Optimization.

So with help this could show it is my installation, or not....BTW tested with all add-ins disabled too.

John Z
Title: Re: V12 time optimization
Post by: John Z on October 14, 2023, 11:32:41 AM: Quote from: Marco on October 14, 2023, 11:19:29 AM
Hello John, I tried the function on my system, and I have the same issue (32-bit compilation - Pelles C v12). The 64-bit executable seems to work as expected (i.e. no errors). To fix the issue you need to add the volatile keyword to the 'i' variable.Marco

Great NEWS! I'll give that a try. Easy to test now with the standalone. Appreciate your help!

John Z

Update: Tried Marco's volatile fix and can confirm that he's found the workaround fix for V12! Nice work Marco!
Title: Re: V12 time optimization
Post by: Vortex on October 14, 2023, 11:40:33 AM: Quote
If we do not use volatile qualifier, the following problems may arise:
1) Code may not work as expected when optimization is turned on.

https://www.geeksforgeeks.org/understanding-volatile-qualifier-in-c/
Title: Re: V12 time optimization
Post by: John Z on October 14, 2023, 11:58:06 AM: Quote from: Vortex on October 14, 2023, 11:40:33 AM
Quote
If we do not use volatile qualifier, the following problems may arise:
1) Code may not work as expected when optimization is turned on.

https://www.geeksforgeeks.org/understanding-volatile-qualifier-in-c/

Yes but but how about -
Quote
"Objects declared as volatile are omitted from optimization because their values can be changed by code outside the scope of current code at any time."

In this case i is local in scope and can not be changed outside of the scope ....
Then there is the issue is everything to be volatile? How to determine which local scope variable needs volatile?

John Z
Title: Re: V12 time optimization
Post by: Vortex on October 14, 2023, 12:06:30 PM: Quote
Some compilers allow you to implicitly declare all variables as volatile. Resist this temptation, since it is essentially a substitute for thought. It also leads to potentially less efficient code.

Also, resist the temptation to blame the optimizer or turn it off when you encounter unexpected program behavior. Modern C/C++ optimizers are so good that I cannot remember the last time I came across an optimization bug. In contrast, I regularly come across failures by programmers to use volatile.

https://barrgroup.com/embedded-systems/how-to/c-volatile-keyword

Quote
In C, the volatile keyword is used to indicate to the compiler that a variable's value may change unexpectedly, so it should not rely on the value being cached in a register or optimized away.

https://www.javatpoint.com/volatile-keyword-in-c#:~:text=In%20C%2C%20the%20volatile%20keyword,a%20register%20or%20optimized%20away.
Title: Re: V12 time optimization
Post by: John Z on October 14, 2023, 02:52:59 PM: Quote
A variable should be declared volatile whenever its value could change unexpectedly. In practice, only three types of variables could change:

1. Memory-mapped peripheral registers

2. Global variables modified by an interrupt service routine

3. Global variables accessed by multiple tasks within a multi-threaded application
same source ->https://barrgroup.com/embedded-systems/how-to/c-volatile-keyword

As is clear from the source code the variable i is neither Global nor a Register variable as written.....

Quote
It is because the value of a volatile variable may be changed by external factors, such as hardware or other threads running concurrently.
Few cases where the usage of 'volatile' is appropriate: Accessing memory-mapped hardware registers, Sharing data between threads, Using non-local jumps
same source ->https://www.javatpoint.com/volatile-keyword-in-c#:~:text=In%20C%2C%20the%20volatile%20keyword,a%20register%20or%20optimized%20away.

None of these appear to apply to variable i ..... and not a threaded application

Anyway I do appreciate you trying to clarify it for me. Suggest we wait for Pelle, or some others to comment ?

John Z
Title: Re: V12 time optimization
Post by: Pelle on October 14, 2023, 06:26:44 PM: This appears to be a bug in the X86 code generator only, the X64 code generator seems fine.

I'm somewhat busy with other things right now, so I'm not going to analyze this until later, but you seem to have enough info to work around the problem for now...
Title: Re: V12 time optimization
Post by: John Z on October 14, 2023, 06:54:36 PM: Quote from: Pelle on October 14, 2023, 06:26:44 PM
you seem to have enough info to work around the problem for now...

Definitely do, just treat as something to put in the hopper for later.

Thanks,

John Z
Title: Re: V12 time optimization
Post by: frankie on October 14, 2023, 10:02:27 PM: It took a long time to discover the optimizer problem, the main reason is that compiling with debug the bug disappear.
Then I disassembled the routine and analyzed the assembler source.
Here the problem is evident, due to an aggressive optimization, the loop erroneously starts including the increment of the string pointer:
Code: [Select]
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ; Bad implementation is the exact copy, obtained disassembling the C ; source strrpl_asm_bad@16: push ebp mov ebp,esp sub esp,4 push ebx push esi push edi mov edi,dword ptr [n] mov bl,byte ptr [oldchar] push 1 lea esi,[edi+1] push esi call calloc add esp,8 mov esi,eax mov dword ptr [loc_var],esi test eax,eax je lbl_52 test edi,edi jbe lbl_4D xor esi,esi ; Here is the optimizer bug. ; The loop is entered on pointer increment ; The result is that the loop starts on 2nd character lbl_2C: add dword ptr [pstr],1 mov ecx,dword ptr [pstr] mov cl,byte ptr [ecx] mov edx,dword ptr [pstr] cmp byte ptr [edx],0 je lbl_4D mov dl,byte ptr [newchar] cmp cl,bl cmove ecx,edx mov byte ptr [eax],cl inc eax inc esi cmp esi,edi jb lbl_2C lbl_4D: mov eax,dword ptr [loc_var] jmp lbl_54 lbl_52: xor eax,eax lbl_54: pop edi pop esi pop ebx mov esp,ebp pop ebp ret 10hThe correct code should be:
Code: [Select]
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ; This is the correct version strrpl_asm_good@16: push ebp mov ebp,esp sub esp,4 push ebx push esi push edi mov edi,dword ptr [n] mov bl,byte ptr [oldchar] push 1 lea esi,[edi+1] push esi call calloc add esp,8 mov esi,eax mov dword ptr [loc_var],esi test eax,eax je lbl_g_52 test edi,edi jbe lbl_g_4D xor esi,esi ; The correct version jumps in the loop after the increment jmp lbl_should_jump_here lbl_g_2C: add dword ptr [pstr],1 lbl_should_jump_here: mov ecx,dword ptr [pstr] mov cl,byte ptr [ecx] mov edx,dword ptr [pstr] cmp byte ptr [edx],0 je lbl_g_4D mov dl,byte ptr [newchar] cmp cl,bl cmove ecx,edx mov byte ptr [eax],cl inc eax inc esi cmp esi,edi jb lbl_g_2C lbl_g_4D: mov eax,dword ptr [loc_var] jmp lbl_g_54 lbl_g_52: xor eax,eax lbl_g_54: pop edi pop esi pop ebx mov esp,ebp pop ebp ret 10hI attach the project from JohnZ with an assembler module containing the 2 assembly procedures.
I had to write it because there was no other way to debug the problem.
Title: Re: V12 time optimization
Post by: frankie on October 14, 2023, 10:27:39 PM: Quote from: John Z on October 14, 2023, 11:58:06 AM
Quote from: Vortex on October 14, 2023, 11:40:33 AM
Quote
If we do not use volatile qualifier, the following problems may arise:
1) Code may not work as expected when optimization is turned on.

https://www.geeksforgeeks.org/understanding-volatile-qualifier-in-c/

Yes but but how about -
Quote
"Objects declared as volatile are omitted from optimization because their values can be changed by code outside the scope of current code at any time."

In this case i is local in scope and can not be changed outside of the scope ....
Then there is the issue is everything to be volatile? How to determine which local scope variable needs volatile?

John Z
It doesn't matter if the variable is global or not. Consider that you can have passed the address of a local variable to another procedure that modify it externally on a timer (to not mention interrupt) or an asynchronous event.
Even single threaded programs can be influenced by external concurrent processes or system I/O and have then volatile content modifications. For this reason the language doesn't put any limitation about the scope of the volatile element.
Anyway, as Vortex highlighted, its use can can be troublesome, and heavily limits the optimization process forcing the reload of the value each time it is used.
Now the workaround force a small variation in code emission that remove the bug. Marco introduced a variation declaring volatile the variable i, this added code to service the volatile property forcing the optimizer to modify strategy.
But it could have been any other small modification. I.e. modifing the for loop:
Code: [Select]
for (i = 0; (i < n) && (c = *str++); ++i)With a while loop:
Code: [Select]
while (n-- && (c = *str++))Remove the bug.
Title: Re: V12 time optimization
Post by: John Z on October 14, 2023, 11:38:29 PM: Excellent! Thanks. I too noticed that a small change in the code made the issue disappear.
For example when I tried to Write_to_Log the incoming string before any processing was performed the problem never appeared. That is also why I was concerned that putting it into a separate project might obscure the problem. Fortunately it didn't.

Very elegant while loop !

John Z
Title: Re: V12 time optimization
Post by: Pelle on October 15, 2023, 11:14:11 PM: The problem turned out to be in the spiller/rewriter.

The way a pointed-to value is fetched from a post-incremented pointer, like:
Code: [Select]
c = *str++
is turned into something like this internally (LCSE=Local Common SubExpression):
Code: [Select]
LCSE-temp = *str str = str + 1 c = LCSE-tempThis construct reduces the internal set of operators, which is a good thing, but needs careful handling all over.

There was a logical error in that a GCSE (Global Common SubExpression) like:
Code: [Select]
GCSE-temp = <value> (other code) ... first use of GCSE-temp (other code) ... second use of GCSE-temp etc.
can be rewritten as:
Code: [Select]
(other code) ... first use of <value> (other code) ... second use of <value>
this will always work, but this is not always true for a LCSE-temp (due to the above special construct).

The spiller/rewriter is rarely touched, last time was apparently in 2010, so this logical error has remained undetected for a long time. I guess it needed a register-starved architecture like X86 to trigger (most people are presumably using X64 these days).

Quote from: frankie on October 14, 2023, 10:02:27 PM
It took a long time to discover the optimizer problem, the main reason is that compiling with debug the bug disappear.
- Full debugging info can at best be used to find logical errors, but since most optimizer passes that move around code are disabled in this mpde, there is little chance of finding code generator problems.
- Line number debugging info will not disable any optimizer passes, but line numbers can still get lost (complicating debugging) because it can be hard to explain sequences like this:
line #1
line #8
line #3
line #9
line #2
The common concept is to relate to the source file, but how to do that without confusing everyone involved?
Title: Re: V12 time optimization
Post by: frankie on October 16, 2023, 10:22:48 AM: Quote from: Pelle on October 15, 2023, 11:14:11 PM
Quote from: frankie on October 14, 2023, 10:02:27 PM
It took a long time to discover the optimizer problem, the main reason is that compiling with debug the bug disappear.
- Full debugging info can at best be used to find logical errors, but since most optimizer passes that move around code are disabled in this mpde, there is little chance of finding code generator problems.
- Line number debugging info will not disable any optimizer passes, but line numbers can still get lost (complicating debugging) because it can be hard to explain sequences like this:
line #1
line #8
line #3
line #9
line #2
The common concept is to relate to the source file, but how to do that without confusing everyone involved?
Thanks for the problem sharing Pelle.
I understand the points about debug, but I met also another problem. After I have created assembler module with disassembled code, at first, I set a breakpoint on the assembler routines entry to debug in assembler from there, but the debugger passed over ignoring them. The only way I was able to debug was setting breakpoint before the call in C source, then on break I had to change debug mode to assembly then using step-in I was able at last to see the assembler code.
Title: Re: V12 time optimization
Post by: Pelle on October 16, 2023, 01:02:54 PM: Quote from: frankie on October 16, 2023, 10:22:48 AM
I understand the points about debug, but I met also another problem.
Ah, OK. I will have a look and see if I can improve things...
Title: Re: V12 time optimization
Post by: John Z on October 18, 2023, 03:00:50 PM: Quote from: Pelle on October 15, 2023, 11:14:11 PM
The problem turned out to be in the spiller/rewriter.

The spiller/rewriter is rarely touched, last time was apparently in 2010, so this logical error has remained undetected for a long time. I guess it needed a register-starved architecture like X86 to trigger (most people are presumably using X64 these days).

Is it possible that versions prior to version 12 were not utilizing this section/feature of the optimizer, or perhaps had something that mitigated the effect? I ask because when running the exact test code on Version 10 or Version 11 they do not display the error. That might explain why the issue has remained undetected since 2010... I think a lot of code was 32bit in 2010....

John Z
Title: Re: V12 time optimization
Post by: Pelle on October 18, 2023, 05:25:25 PM: Quote from: John Z on October 18, 2023, 03:00:50 PM
Is it possible that versions prior to version 12 were not utilizing this section/feature of the optimizer, or perhaps had something that mitigated the effect?
This piece of code is not part of the optimizer, only affected by it: the register-allocator will (basically) attempt to change the generated machine code into holding as many variables as possible in fast CPU-registers, to avoid having to load the variable again and again from slow RAM-memory. This register-allocator will optimistically assume that all variables can be placed in a CPU-register. When this is not possible, the variable is "spilled" - meaning the machine code is changed into loading this variable from RAM-memory. There is a guessed cost of spilling a variable, and the one with the smallest cost is normally spilled.

An unoptimized function will keep all variables in RAM memory, leaving no work for the register-allocator.

An optimized function that performs actual work is unlikely to keep all variables in CPU-registers. This means the spiller-function (with this bug) is called again and again and again (especially with 32-code that has few CPU-registers to begin with). However, this function can take several different paths depending on the kind/class of the variable to spill. The path that leads to this bug is less likely, but obviously not impossible. A small change to the generated machine code can affect when a spill is needed, and then the actual spilling will affect the generated machine code that may affect further spilling, etc. etc.

Quote from: John Z on October 18, 2023, 03:00:50 PM
I ask because when running the exact test code on Version 10 or Version 11 they do not display the error. That might explain why the issue has remained undetected since 2010...
Basically, your code manage to combine some unlikely factors into revealing this bug. I'm not going to investigate further than that...

Quote from: John Z on October 18, 2023, 03:00:50 PM
I think a lot of code was 32bit in 2010....
I doubt it, but let's not get into that...
Title: Re: V12 time optimization
Post by: John Z on October 18, 2023, 06:56:53 PM: OK, understandable Thanks for figuring it out considering it is only 32 bit...

I'll invest more time in trying to move to 64 bit code.

John Z
Title: Re: V12 time optimization
Post by: Pelle on October 23, 2023, 08:38:36 PM: Quote from: frankie on October 16, 2023, 10:22:48 AM
I understand the points about debug, but I met also another problem. After I have created assembler module with disassembled code, at first, I set a breakpoint on the assembler routines entry to debug in assembler from there, but the debugger passed over ignoring them. The only way I was able to debug was setting breakpoint before the call in C source, then on break I had to change debug mode to assembly then using step-in I was able at last to see the assembler code.
I have now looked at this.

Setting breakpoints in source files (before starting the debugger) requires debug info (with source lines) when starting the debugger. Source lines in the debug info is always relative the start of a function. For an assembly file you can only get the concept of a function by using PROC .. ENDP, just a code label is not enough.

OTOH, starting the debugger with "always break at entry-point" and then setting breakpoints should work...
Title: Re: V12 time optimization
Post by: frankie on October 23, 2023, 10:10:11 PM: Quote from: Pelle on October 23, 2023, 08:38:36 PM
I have now looked at this.

Setting breakpoints in source files (before starting the debugger) requires debug info (with source lines) when starting the debugger. Source lines in the debug info is always relative the start of a function. For an assembly file you can only get the concept of a function by using PROC .. ENDP, just a code label is not enough.

OTOH, starting the debugger with "always break at entry-point" and then setting breakpoints should work...
Ok. Get it.
Thanks