Optimizing issue

cosh · February 01, 2023, 04:16:58 PM

Here's a small issue I met during using Pelles C Compiler.
This is a simple code to calculate division of two integers:


/* Result of unsigned integer division. */
typedef struct st_stdiv_t {
	size_t quot; /* Quotient.  */
	size_t rem;  /* Remainder. */
} stdiv_t;

stdiv_t stdiv(size_t numerator, size_t denominator)
{
	stdiv_t result;
	/* May your compiler optimize division of these two following codes into one instruction. */
	result.quot = numerator / denominator;
	result.rem  = numerator % denominator;
	return result;
}

Now we put this snippet into https://godbolt.org/ and we choose x86-64 gcc.
We open 3 levels of optimization for the compiler -O3 and check what happens in the right disassembly code view.

Code Select


stdiv:
        mov     rax, rdi
        xor     edx, edx
        div     rsi
        ret

Now we can see Gcc has optimized the code / and % into one div instruction.
I tested the above code for MSVC and clang either. Each of them could optimize the above code.
Let's see our Pelles C compiler's performance. We choose Project->Project option->Compiler->Code generation in Pelles C IDE and check maximize speed and Extra optimizations.

Code Select


mov eax
xor edx
div ebx
mov esi
mov eax
xor edx
div ebx
mov eax

Unfortunately, Pelles C Compiler didn't optimize the above code into one div instruction.
We knew that div instruction made divisions on one register and stored the result into two registers.
It means that we can use only one div instruction instead of two and finally return result directly to registers to enlighten the performance.

Shall we alter the back-end of Pelles C compiler to support this?
It can significantly reduce the code size and accelerate the execution speed.
PS: If we try to compile glibc(the gnu c lib) with Pelles C, the gnu C lib actually implement div function as this too, we will not get the advantage of such optimizations.

Pelle · March 05, 2023, 05:01:59 PM

This is hard to do nicely. The resulting struct from div() and ldiv() will fit in a 64-bit register, so this can be handled. In this case, assuming X64, the result will always go to memory. The code generator can't describe multiple resulting registers, so this would have to be some kind of pattern matching with some kind of hack, only to produce a rather sub-optimal result (might improve this exact case, but if you inline stdiv() it will probably look crappy again...)

News:

Optimizing issue

cosh

Pelle