JMP vs CALL (For the beginners)

Jokaste · October 30, 2017, 12:06:58 PM

When you code a CALL the processor makes the following things:

-MOV RAX,Next Instruction Address (RIP) ; RAX is an example this register is not used
-ADD ESP,8
-MOV [ESP + 8],RAX
-JMP Function

It will make the instructions MOV RAX,RIP and ADD ESP,8 in the same cycle
and will be execute MOV [ESP + 8],RAX
After it sleeps, sorry, it is waiting that the address into ESP register is really modified.
When the address is modified it accepts to make the JUMP.

ADD ESP,8 and MOV [ESP + 8],RAX is the same code than PUSH RAX, so we can write:

MOV RAX,OFFSET @ReturnAddressFromFunction
PUSH RAX
JMP Function

It is quicker because between the JUMP and the call there is no wait.

Agner explained better than I:

Quote
; Example 9.2, Splitting instructions into uops
push eax
call SomeFunction
The push eax instruction does two things. It subtracts 4 from the stack pointer and stores eax to the address pointed to by the stack pointer. Assume now that eax is the result of a long and time-consuming calculation. This delays the push instruction. The call instruction depends on the value of the stack pointer which is modified by the push instruction. If instructions were not split into μops then the call instruction would have to wait until the push instruction was finished. But the CPU splits the push eax instruction into sub esp,4 followed by mov [esp],eax. The sub esp,4 micro-operation can be executed before eax is ready, so the call instruction will wait only for sub esp,4, not for mov [esp],eax.

http://agner.org/optimize/

frankie · October 30, 2017, 04:37:40 PM

Sincerly I don't understand all this effort to force the CISC (Complex Instruction Set Computer) architecture of the IAPX32-64 family to mimic a RISC (Reduced Instruction Set Computer) one.

If the constructor (INTEL or AMD or VIA) will introduce any future improvement or change of the microarchitecture, or the pipelining, or even some other characteristic of the architecture, those adjustments could be potentially counterproductive...
I'm not even sure that JMP instructions, that invalidate caches, are a good choice to fasten programs execution. Nor exchanging a call to a system DLL function with a stack change and a JMP seems a good idea (apart from the SEH problem that I already explaned in another post), but just for the sake of the stack probing technics used in compiled code.
IMHO the only way to make provements that makes any sense is to refer to the producer optimization manuals.

Jokaste · October 30, 2017, 06:04:55 PM

I do not agree with you but I do not know how to explain it, even in English, even using Google Translate. This document I have as well as all the others and also those of AMD.
I think, and I'm not the only one, that the only optimization that is worthwhile is the improvement of the algorithms.
Recently I spent two days to win 500 bytes! Madness.
Given the number of people who communicate on this forum, I try to react members. Apart from you, Timo, Vortex, JJ2007 and one or two others I forget (excuse me gentlemen), those are always the same ones who make messages.
I also try to share what I have read or tested. I can be wrong and that's where it becomes constructive. As long as there are answers.

Philippe

Jokaste · October 30, 2017, 07:36:14 PM

Quote

Calls and returns are expensive; use inlining for the following reasons: • Parameter passing overhead can be eliminated. • In a compiler, inlining a function exposes more opportunity for optimization. • If the inlined routine contains branches, the additional context of the caller may improve branch prediction within the routine. • A mispredicted branch can lead to performance penalties inside a small function that are larger than those that would occur if that function is inlined. Assembly/Compiler Coding Rule 5. (MH impact, MH generality) Selectively inline a function if doing so decreases code size or if the function is small and the call site is frequently executed. Assembly/Compiler Coding Rule 6. (H impact, H generality) Do not inline a function if doing so increases the working set size beyond what will fit in the trace cache. Assembly/Compiler Coding Rule 7. (ML impact, ML generality) If there are more than 16 nested calls and returns in rapid succession; consider transforming the program with inline to reduce the call depth. Assembly/Compiler Coding Rule 8. (ML impact, ML generality) Favor inlining small functions that contain branches with poor prediction rates. If a branch misprediction results in a RETURN being prematurely predicted as taken, a performance penalty may be incurred. Assembly/Compiler Coding Rule 9. (L impact, L generality) If the last statement in a function is a call to another function, consider converting the call to a jump. This will save the call/return overhead as well as an entry in the return stack buffer. Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put more than four branches in a 16-byte chunk. Assembly/Compiler Coding Rule 11. (M impact, L generality) Do not put more than two end loop branches in a 16-byte chunk

I was right

.

https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf page 104

frankie · October 31, 2017, 09:10:03 AM

Quote from: Jokaste on October 30, 2017, 07:36:14 PM
Quote

Calls and returns are expensive; use inlining for the following reasons: • Parameter passing overhead can be eliminated. • In a compiler, inlining a function exposes more opportunity for optimization. • If the inlined routine contains branches, the additional context of the caller may improve branch prediction within the routine. • A mispredicted branch can lead to performance penalties inside a small function that are larger than those that would occur if that function is inlined. Assembly/Compiler Coding Rule 5. (MH impact, MH generality) Selectively inline a function if doing so decreases code size or if the function is small and the call site is frequently executed. Assembly/Compiler Coding Rule 6. (H impact, H generality) Do not inline a function if doing so increases the working set size beyond what will fit in the trace cache. Assembly/Compiler Coding Rule 7. (ML impact, ML generality) If there are more than 16 nested calls and returns in rapid succession; consider transforming the program with inline to reduce the call depth. Assembly/Compiler Coding Rule 8. (ML impact, ML generality) Favor inlining small functions that contain branches with poor prediction rates. If a branch misprediction results in a RETURN being prematurely predicted as taken, a performance penalty may be incurred. Assembly/Compiler Coding Rule 9. (L impact, L generality) If the last statement in a function is a call to another function, consider converting the call to a jump. This will save the call/return overhead as well as an entry in the return stack buffer. Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put more than four branches in a 16-byte chunk. Assembly/Compiler Coding Rule 11. (M impact, L generality) Do not put more than two end loop branches in a 16-byte chunk
I was right .

https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf page 104

You're wrong.
The suggestion is to avoid passing parameters, i.e. using inlining, and converting calls to jumps if, and only if, the last opcode in a function is a call. But "converting" means that you must prepare the jump, not simply add some space to the stack!

Consider jumping from inside a variadic function...
Anyway I understand that to explain this in detail will require a lot of time that I, unfortunately, don't have right now. So, if this makes you happy, good luck with your tests

Jokaste · October 31, 2017, 12:35:17 PM

Inlining in X64 is not possible.

TimoVJL · October 31, 2017, 12:50:47 PM

read this

Jokaste · October 31, 2017, 02:52:42 PM

read this.

frankie · October 31, 2017, 06:30:52 PM

It seems very childish to continue with the "read this", but I don't see another solution.
Read this.

Jokaste · November 01, 2017, 02:53:55 AM

I saw the code of memset as intrinsic function : rep stosb, not very optimized. There are many functions like this one! It is a non optimized kind of inlining.

TimoVJL · November 01, 2017, 11:08:17 AM

It is a size optimized version for a small size

News:

JMP vs CALL (For the beginners)

Jokaste

frankie

Jokaste

Jokaste

frankie

Jokaste

TimoVJL

Jokaste

frankie

Jokaste

TimoVJL