Pelles C forum
Assembly language => Assembly discussions => Topic started by: Jokaste on October 28, 2017, 07:30:47 PM
-
Tell me if I am right, the ghosts you can answer too.
mov [hWndSysInput+rip],rcx
mov rdx,[hInstance+rip]
xor rax,rax
mov [rsp + 64],rcx
mov [rsp + 80],rdx
mov [rsp + 88],rax
mov [rsp + 72],rax
mov [rsp + 56],rax
mov [rsp + 48],rax
mov [rsp + 40],rax
mov [rsp + 32],rax
mov r9,WS_CHILD or LVS_NOSORTHEADER or LVS_SORTASCENDING or LVS_REPORT
mov r8,OFFSET szNullString
mov rdx,OFFSET WC_LISTVIEW
xor rcx,rcx
call CreateWindowExA
If I have well understood Agner Fog and others this must execute like this:
ov [hWndSysInput+rip],rcx mov rdx,[hInstance+rip] NO PENALTY
xor rax,rax mov [rsp + 64],rcx PENALTY because XOR is quicker than MOV
mov [rsp + 80],rdx mov [rsp + 88],rax NO PENALTY
mov [rsp + 72],rax mov [rsp + 56],rax NO PENALTY
mov [rsp + 48],rax mov [rsp + 40],rax NO PENALTY
mov [rsp + 32],rax mov r9,WS_CHILD or LVS_NOSORTHEADER PENALTY cause of 1st instruction
mov r8,OFFSET szNullString mov rdx,OFFSET WC_LISTVIEW NO PENALTY
xor rcx,rcx PENALTY because the CALL cannot be excuted in the same cycle as the XOR
call CreateWindowExA PENALTY
this could be improved:
mov [hWndSysInput+rip],rcx mov rdx,[hInstance+rip]
mov [rsp + 64],rcx mov [rsp + 80],rdx
xor rax,rax xor rcx,rcx
mov r8,OFFSET szNullString mov rdx,OFFSET WC_LISTVIEW
mov [rsp + 88],rax mov [rsp + 72],rax
mov [rsp + 56],rax mov [rsp + 48],rax
mov [rsp + 40],rax mov [rsp + 32],rax
mov r9,WS_CHILD or LVS_NO.. xchg rax,rax (NOP)
call CreateWindowExA
And the final code is:
mov [hWndSysInput+rip],rcx
mov rdx,[hInstance+rip]
mov [rsp + 64],rcx
mov [rsp + 80],rdx
xor rax,rax
xor rcx,rcx
mov r8,OFFSET szNullString
mov rdx,OFFSET WC_LISTVIEW
mov [rsp + 88],rax
mov [rsp + 72],rax
mov [rsp + 56],rax
mov [rsp + 48],rax
mov [rsp + 40],rax
mov [rsp + 32],rax
mov r9,WS_CHILD or LVS_NOSORTHEADER or LVS_SORTASCENDING or LVS_REPORT
xchg rax,rax (NOP)
call CreateWindowExA
Give me advice about this kind of parallel code.
Thanks
Jokaste / Grincheux
-
xor rax,rax mov [rsp + 64],rcx PENALTY because XOR is quicker than MOV
Launch a debugger and try this:
int 3
or rax, -1
xor eax, eax
or rax, -1
xor rax, rax
No idea whether xor eax, eax is any faster or slower than xor rax, rax, but it is one byte shorter and does the same.
-
Thanks that's the kind of code I am looking for.
I have found that constant are not good for an assembler program because if they have the value the assembler codes them in the same way as they were greather than 0.
MOV RAX,SH_HIDE = MOV RAX,0 rather than XOR RAX,RAX
I would like to create a pdf with all the tricks experienced programmers have.
Yestirday I dowloaded MasmBasic and tryed to read what the editor displays... I stopped before the end, this program really is rich.
-
MOV RAX,00BADF00DBADCAFE ;RAX=00BADF00DBADCAFE
OR EAX,-1 ;RAX=00000000FFFFFFFF
MOV RAX,0BADF00DBADCAFEh ;RAX=00BADF00DBADCAFE
OR RAX,-1 ;RAX=FFFFFFFFFFFFFFFF
-
I have found that constant are not good for an assembler program because if they have the value the assembler codes them in the same way as they were greather than 0.
MOV RAX,SH_HIDE = MOV RAX,0 rather than XOR RAX,RAX
Sometimes I use conditional assembly to decide which instruction to take:
if (IMAGE_ICON-IMAGE_BITMAP) eq 1
inc edi
else
add edi, IMAGE_ICON-IMAGE_BITMAP
endif
In this case, it is not really necessary because IMAGE_ICON and IMAGE_BITMAP are Windows constants that will never change IMHO, so it will be inc edi, always. But in this excerpt from the For_ ... Next macro, it saves a few bytes (MasmBasic.inc, lines 9719ff): ifdifi tmpReg, vtS$
if atVt eq atImmediate
ife vtS$
xor tmpReg, tmpReg
elseif vtS$ eq -1
or tmpReg, -1
elseif (vtS$ le 127) and (vtS$ ge -128)
push vtS$
pop tmpReg
else
mov tmpReg, vtS$
endif
else
mov tmpReg, vtS$
endif
endif
Yestirday I dowloaded MasmBasic and tryed to read what the editor displays... I stopped before the end, this program really is rich.
Thanks ;)
-
For the instant it is in my mind only but I was thining it would be useful to create a macro like this one (pseudo code)
MOV_CONST_TO_REG MACRO \1 \2
IF \2 == 0
XOR \1,\2
ELSE
MOV \1,\2
ENDM
-
INC/DEC instruction must not be used before LOOP or JCC because they partially affect the flags.
ADD/SUB affect entirely the flag register, it is quicker.
-
Very old stuff, I use it in the source of the RichMasm editor:
movi MACRO M1, M2
LOCAL oa, num
num = M2
oa = (opattr num) and 127
if oa ne 36
echo <M2> is not an immediate
.err
endif
if type(M1) ne DWORD
% echo M1 is not a DWORD
.err
endif
ife num
if (opattr M1) eq atRegister
xor M1, M1
else
and M1, 0
endif
elseif (num le 127) and (num ge -128)
pushd M2
pop dword ptr M1
else
mov M1, M2
endif
ENDM
INC/DEC instruction must not be used before LOOP or JCC because they partially affect the flags.
ADD/SUB affect entirely the flag register, it is quicker.
If you absolutely need the carry flag, use add/sub (but I never saw a need for that). Otherwise don't, because it needs 3 bytes instead of 1, and we have frequently confirmed with measurements that add/sub is not faster.
-
If you absolutely need the carry flag, use add/sub (but I never saw a need for that). Otherwise don't, because it needs 3 bytes instead of 1, and we have frequently confirmed with measurements that add/sub is not faster.
ADD RAX,1
JBE xxxxxx
Not possible using unsing INC
-
3.5.1.1 Use of the INC and DEC Instructions The INC and DEC instructions modify only a subset of the bits in the flag register. This creates a dependence on all previous writes of the flag register. This is especially problematic when these instructions are on the critical path because they are used to change an address for a load on which many other instructions depend. Assembly/Compiler Coding Rule 33. (M impact, H generality) INC and DEC instructions should be replaced with ADD or SUB instructions, because ADD and SUB overwrite all flags, whereas INC and DEC do not, therefore creating false dependencies on earlier instructions that set the flags.
From Intel :https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf (https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf) page 117
I was right too.
-
ADD RAX,1
JBE xxxxxx
Not possible using unsing INC[/size]
1. Use JLE instead.
2. Put the inc some lines after instructions that modify flags.
3. MEASURE THE TIMINGS, everything else is theory, esoterics, hearsay.
-
JJ you are right, At one place we read something and elsewhere we read the opposite! I say that because I read that now it is not possible to measure timings because of many cores... Don't use RD... use Time!
I would like to have something for measuring timings. If it's you or Vortex I accept, everyone else I refuse.
-
I would like to have something for measuring timings.
GetTickCount is OK if you choose the number of loops so high that the test takes a second or two (granularity is 16 milliseconds). Otherwise QPC (https://msdn.microsoft.com/en-us/library/windows/desktop/ms644904%28v=vs.85%29.aspx). NanoTimer() (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1171) is based on QPC, and I use it all the time.
-
Thank You, I test it now.