When you code a CALL the processor makes the following things:
-MOV RAX,Next Instruction Address (RIP) ; RAX is an example this register is not used
-ADD ESP,8
-MOV [ESP + 8],RAX
-JMP Function
It will make the instructions MOV RAX,RIP and ADD ESP,8 in the same cycle
and will be execute MOV [ESP + 8],RAX
After it sleeps, sorry, it is waiting that the address into ESP register is really modified.
When the address is modified it accepts to make the JUMP.
ADD ESP,8 and MOV [ESP + 8],RAX is the same code than PUSH RAX, so we can write:
MOV RAX,OFFSET @ReturnAddressFromFunction
PUSH RAX
JMP Function
It is quicker because between the JUMP and the call there is no wait.
Agner explained better than I:
; Example 9.2, Splitting instructions into uops
push eax
call SomeFunction
The push eax instruction does two things. It subtracts 4 from the stack pointer and stores eax to the address pointed to by the stack pointer. Assume now that eax is the result of a long and time-consuming calculation. This delays the push instruction. The call instruction depends on the value of the stack pointer which is modified by the push instruction. If instructions were not split into μops then the call instruction would have to wait until the push instruction was finished. But the CPU splits the push eax instruction into sub esp,4 followed by mov [esp],eax. The sub esp,4 micro-operation can be executed before eax is ready, so the call instruction will wait only for sub esp,4, not for mov [esp],eax.
http://agner.org/optimize/