JJ2007 le 29/10/2017
int 3
or rax, -1 ; allows to quickly return to -1 a register equivalent to xor rax,rax then not rax
xor eax, eax
or rax, -1
xor rax, rax
---------------------------------------------------
MOV RAX,00BADF00DBADCAFE ;RAX=00BADF00DBADCAFE
OR EAX,-1 ;RAX=00000000FFFFFFFF
MOV RAX,0BADF00DBADCAFEh ;RAX=00BADF00DBADCAFE
OR RAX,-1 ;RAX=FFFFFFFFFFFFFFFF
---------------------------------------------------
A 32-bit AND is extended to 64-bit
---------------------------------------------------
No INC or DEC before a LOOP or JCC
Prefer ADD / SUB
---------------------------------------------------
A call before a RET must be replaced by a JMP
CALL Fonction
ret
becomes
ADD RSP,XXXXX
JMP Fonction
---------------------------------------------------
REPRET TEXTEQU <DB 0F3h, 0C3h>
To be certain that the RET intruction will be well predicted to replace it by REP RET
---------------------------------------------------
JZ Label
RET
must be replaced by
JZ LABEL
NOP
RET
---------------------------------------------------
For repeat counts of less than 4k, expand REP string instructions into equivalent sequences of simple
AMD64 instructions
---------------------------------------------------
Replace :
LABEL :
.
.
.
LOOP LABEL
by
LABEL :
.
.
.
DEC RCX
JNZ LABEL
---------------------------------------------------
MOV REG,0 becomes XOR REG,REG
---------------------------------------------------
Set XOR R64, R64 to XOR R32, R32 because this operation completes the next 32 bits
---------------------------------------------------
XOR R64, R64 followed by a mov R32, value.
Remove the xor because the 32-bit values are extended according to the sign
So this can only be done in case of positive values
---------------------------------------------------
An instruction with RIP relative addressing is not micro-fused in the following cases:
• An additional immediate is needed, for example:
• CMP [RIP+400], 27
• MOV [RIP+3000], 142
• The instruction is a control flow instruction with an indirect target specified using RIP-relative
addressing, for example:
• JMP [RIP+5000000]
In these cases, an instruction that can not be micro-fused will require decoder 0 to issue two micro-ops,
resulting in a slight loss of decode bandwidth.
Macro-fusion merges two instructions into a single micro-op. In Intel Core microarchitecture, this hardware
optimization is limited to specific conditions specific to the first and second of the macro-fusable
instruction pair.
• The first instruction of the macro-fused pair modifies the flags. The following instructions can be
macro-fused:
— In Intel microarchitecture code name Nehalem: CMP, TEST.
— In Intel microarchitecture code name Sandy Bridge: CMP, TEST, ADD, SUB, AND, INC, DEC
— These instructions can fuse if
• The first source / destination operand is a register.
• The second source operand (if exists) is one of: immediate, register, or non RIP-relative
memory.
• The second instruction of the macro-fusable pair is a conditional branch. Table 3-1 describes, for each
instruction, what branches it can fuse with.
Calls and returns are expensive; use inlining for the following reasons:
• Parameter passing overhead can be eliminated.
• In a compiler, inlining a function exposes more opportunity for optimization.
• If the inlined routine contains branches, the additional context of the caller may improve branch
prediction within the routine.
• A mispredicted branch can lead to performance penalties inside a small function that are larger than
those that would occur if that function is inlined.
Assembly/Compiler Coding Rule 5. (MH impact, MH generality) Selectively inline a function if
doing so decreases code size or if the function is small and the call site is frequently executed.
Assembly/Compiler Coding Rule 6. (H impact, H generality) Do not inline a function if doing so
increases the working set size beyond what will fit in the trace cache.
Assembly/Compiler Coding Rule 7. (ML impact, ML generality) If there are more than 16 nested
calls and returns in rapid succession; consider transforming the program with inline to reduce the call
depth.
Assembly/Compiler Coding Rule 8. (ML impact, ML generality) Favor inlining small functions that
contain branches with poor prediction rates. If a branch misprediction results in a RETURN being
prematurely predicted as taken, a performance penalty may be incurred.
Assembly/Compiler Coding Rule 9. (L impact, L generality) If the last statement in a function is
a call to another function, consider converting the call to a jump. This will save the call/return overhead
as well as an entry in the return stack buffer.
Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put more than four
branches in a 16-byte chunk.
Assembly/Compiler Coding Rule 11. (M impact, L generality) Do not put more than two end loop
branches in a 16-byte chunk
Macro-fusion merges two instructions to a single micro-op. Intel Core microarchitecture performs this
hardware optimization under limited circumstances.
The first instruction of the macro-fused pair must be a CMP or TEST instruction. This instruction can be
REG-REG, REG-IMM, or a micro-fused REG-MEM comparison. The second instruction (adjacent in the
instruction stream) should be a conditional branch.
Since these pairs are common ingredient in basic iterative programming sequences, macro-fusion
improves performance even on un-recompiled binaries. All of the decoders can decode one macro-fused
pair per cycle, with up to three other instructions, resulting in a peak decode bandwidth of 5 instructions
per cycle.
Each macro-fused instruction executes with a single dispatch. This process reduces latency, which in this
case shows up as a cycle removed from branch mispredict penalty. Software also gain all other fusion
benefits: increased rename and retire bandwidth, more storage for instructions in-flight, and power
savings from representing more work in fewer bits.
The following list details when you can use macro-fusion:
• CMP or TEST can be fused when comparing:
REG-REG. For example: CMP EAX,ECX; JZ label
REG-IMM. For example: CMP EAX,0x80; JZ label
REG-MEM. For example: CMP EAX,[ECX]; JZ label
MEM-REG. For example: CMP [EAX],ECX; JZ label
• TEST can fused with all conditional jumps.
• CMP can be fused with only the following conditional jumps in Intel Core microarchitecture. These
conditional jumps check carry flag (CF) or zero flag (ZF). jump. The list of macro-fusion-capable
conditional jumps are:
JA or JNBE
JAE or JNB or JNC
JE or JZ
JNA or JBE
JNAE or JC or JB
JNE or JNZ
CMP and TEST can not be fused when comparing MEM-IMM (e.g. CMP [EAX],0x80; JZ label). Macrofusion
is not supported in 64-bit mode for Intel Core microarchitecture.
• Intel microarchitecture code name Nehalem supports the following enhancements in macrofusion:
— CMP can be fused with the following conditional jumps (that was not supported in Intel Core
microarchitecture):
• JL or JNGE
• JGE or JNL
• JLE or JNG
• JG or JNLE
— Macro-fusion is support in 64-bit mode.