Pelles C forum

Assembly language => Assembly discussions => Topic started by: Jokaste on October 28, 2017, 07:30:47 PM

Title: I suppose...
Post by: Jokaste on October 28, 2017, 07:30:47 PM: Tell me if I am right, the ghosts you can answer too.

Code: [Select]
mov [hWndSysInput+rip],rcx mov rdx,[hInstance+rip] xor rax,rax mov [rsp + 64],rcx mov [rsp + 80],rdx mov [rsp + 88],rax mov [rsp + 72],rax mov [rsp + 56],rax mov [rsp + 48],rax mov [rsp + 40],rax mov [rsp + 32],rax mov r9,WS_CHILD or LVS_NOSORTHEADER or LVS_SORTASCENDING or LVS_REPORT mov r8,OFFSET szNullString mov rdx,OFFSET WC_LISTVIEW xor rcx,rcx call CreateWindowExA
If I have well understood Agner Fog and others this must execute like this:

Code: [Select]
ov [hWndSysInput+rip],rcx mov rdx,[hInstance+rip] NO PENALTY xor rax,rax mov [rsp + 64],rcx PENALTY because XOR is quicker than MOV mov [rsp + 80],rdx mov [rsp + 88],rax NO PENALTY mov [rsp + 72],rax mov [rsp + 56],rax NO PENALTY mov [rsp + 48],rax mov [rsp + 40],rax NO PENALTY mov [rsp + 32],rax mov r9,WS_CHILD or LVS_NOSORTHEADER PENALTY cause of 1st instruction mov r8,OFFSET szNullString mov rdx,OFFSET WC_LISTVIEW NO PENALTY xor rcx,rcx PENALTY because the CALL cannot be excuted in the same cycle as the XOR call CreateWindowExA PENALTY
this could be improved:

Code: [Select]
mov [hWndSysInput+rip],rcx mov rdx,[hInstance+rip] mov [rsp + 64],rcx mov [rsp + 80],rdx xor rax,rax xor rcx,rcx mov r8,OFFSET szNullString mov rdx,OFFSET WC_LISTVIEW mov [rsp + 88],rax mov [rsp + 72],rax mov [rsp + 56],rax mov [rsp + 48],rax mov [rsp + 40],rax mov [rsp + 32],rax mov r9,WS_CHILD or LVS_NO.. xchg rax,rax (NOP) call CreateWindowExA
And the final code is:

Code: [Select]
mov [hWndSysInput+rip],rcx mov rdx,[hInstance+rip] mov [rsp + 64],rcx mov [rsp + 80],rdx xor rax,rax xor rcx,rcx mov r8,OFFSET szNullString mov rdx,OFFSET WC_LISTVIEW mov [rsp + 88],rax mov [rsp + 72],rax mov [rsp + 56],rax mov [rsp + 48],rax mov [rsp + 40],rax mov [rsp + 32],rax mov r9,WS_CHILD or LVS_NOSORTHEADER or LVS_SORTASCENDING or LVS_REPORT xchg rax,rax (NOP) call CreateWindowExA
Give me advice about this kind of parallel code.
Thanks

Jokaste / Grincheux
Title: Re: I suppose...
Post by: jj2007 on October 29, 2017, 12:45:41 PM: Quote from: Jokaste on October 28, 2017, 07:30:47 PM
xor rax,rax mov [rsp + 64],rcx PENALTY because XOR is quicker than MOV

Launch a debugger and try this:
Code: [Select]
int 3 or rax, -1 xor eax, eax or rax, -1 xor rax, rax
No idea whether xor eax, eax is any faster or slower than xor rax, rax, but it is one byte shorter and does the same.
Title: Re: I suppose...
Post by: Jokaste on October 29, 2017, 03:46:51 PM: Thanks that's the kind of code I am looking for.
I have found that constant are not good for an assembler program because if they have the value the assembler codes them in the same way as they were greather than 0.
MOV RAX,SH_HIDE = MOV RAX,0 rather than XOR RAX,RAX
I would like to create a pdf with all the tricks experienced programmers have.
Yestirday I dowloaded MasmBasic and tryed to read what the editor displays... I stopped before the end, this program really is rich.
Title: Re: I suppose...
Post by: Jokaste on October 29, 2017, 04:48:17 PM: Code: [Select]
MOV RAX,00BADF00DBADCAFE ;RAX=00BADF00DBADCAFE OR EAX,-1 ;RAX=00000000FFFFFFFF MOV RAX,0BADF00DBADCAFEh ;RAX=00BADF00DBADCAFE OR RAX,-1 ;RAX=FFFFFFFFFFFFFFFF
Title: Re: I suppose...
Post by: jj2007 on October 29, 2017, 05:28:27 PM: Quote from: Jokaste on October 29, 2017, 03:46:51 PM
I have found that constant are not good for an assembler program because if they have the value the assembler codes them in the same way as they were greather than 0.
MOV RAX,SH_HIDE = MOV RAX,0 rather than XOR RAX,RAX

Sometimes I use conditional assembly to decide which instruction to take:
Code: [Select]
if (IMAGE_ICON-IMAGE_BITMAP) eq 1 inc edi else add edi, IMAGE_ICON-IMAGE_BITMAP endif
In this case, it is not really necessary because IMAGE_ICON and IMAGE_BITMAP are Windows constants that will never change IMHO, so it will be inc edi, always. But in this excerpt from the For_ ... Next macro, it saves a few bytes (MasmBasic.inc, lines 9719ff):
Code: [Select]
ifdifi tmpReg, vtS$ if atVt eq atImmediate ife vtS$ xor tmpReg, tmpReg elseif vtS$ eq -1 or tmpReg, -1 elseif (vtS$ le 127) and (vtS$ ge -128) push vtS$ pop tmpReg else mov tmpReg, vtS$ endif else mov tmpReg, vtS$ endif endif
Quote
Yestirday I dowloaded MasmBasic and tryed to read what the editor displays... I stopped before the end, this program really is rich.

Thanks ;)
Title: Re: I suppose...
Post by: Jokaste on October 29, 2017, 08:07:38 PM: For the instant it is in my mind only but I was thining it would be useful to create a macro like this one (pseudo code)
Quote
MOV_CONST_TO_REG MACRO \1 \2
IF \2 == 0
XOR \1,\2
ELSE
MOV \1,\2
ENDM
Title: Re: I suppose...
Post by: Jokaste on October 29, 2017, 08:11:15 PM: INC/DEC instruction must not be used before LOOP or JCC because they partially affect the flags.
ADD/SUB affect entirely the flag register, it is quicker.
Title: Re: I suppose...
Post by: jj2007 on October 30, 2017, 09:00:40 AM: Very old stuff, I use it in the source of the RichMasm editor:
Code: [Select]
movi MACRO M1, M2 LOCAL oa, num num = M2 oa = (opattr num) and 127 if oa ne 36 echo <M2> is not an immediate .err endif if type(M1) ne DWORD % echo M1 is not a DWORD .err endif ife num if (opattr M1) eq atRegister xor M1, M1 else and M1, 0 endif elseif (num le 127) and (num ge -128) pushd M2 pop dword ptr M1 else mov M1, M2 endif ENDM
Quote from: Jokaste on October 29, 2017, 08:11:15 PM
INC/DEC instruction must not be used before LOOP or JCC because they partially affect the flags.
ADD/SUB affect entirely the flag register, it is quicker.

If you absolutely need the carry flag, use add/sub (but I never saw a need for that). Otherwise don't, because it needs 3 bytes instead of 1, and we have frequently confirmed with measurements that add/sub is not faster.
Title: Re: I suppose...
Post by: Jokaste on October 30, 2017, 12:22:22 PM: Quote

If you absolutely need the carry flag, use add/sub (but I never saw a need for that). Otherwise don't, because it needs 3 bytes instead of 1, and we have frequently confirmed with measurements that add/sub is not faster.

ADD RAX,1
JBE xxxxxx

Not possible using unsing INC
Title: Re: I suppose...
Post by: Jokaste on October 30, 2017, 07:49:59 PM: Quote
3.5.1.1 Use of the INC and DEC Instructions The INC and DEC instructions modify only a subset of the bits in the flag register. This creates a dependence on all previous writes of the flag register. This is especially problematic when these instructions are on the critical path because they are used to change an address for a load on which many other instructions depend. Assembly/Compiler Coding Rule 33. (M impact, H generality) INC and DEC instructions should be replaced with ADD or SUB instructions, because ADD and SUB overwrite all flags, whereas INC and DEC do not, therefore creating false dependencies on earlier instructions that set the flags.

From Intel :https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf (https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf) page 117

I was right too.
Title: Re: I suppose...
Post by: jj2007 on November 01, 2017, 05:55:19 AM: Quote from: Jokaste on October 30, 2017, 12:22:22 PM

ADD RAX,1
JBE xxxxxx

Not possible using unsing INC[/size]

1. Use JLE instead.
2. Put the inc some lines after instructions that modify flags.
3. MEASURE THE TIMINGS, everything else is theory, esoterics, hearsay.
Title: Re: I suppose...
Post by: Jokaste on November 01, 2017, 05:43:54 PM: JJ you are right, At one place we read something and elsewhere we read the opposite! I say that because I read that now it is not possible to measure timings because of many cores... Don't use RD... use Time!
I would like to have something for measuring timings. If it's you or Vortex I accept, everyone else I refuse.
Title: Re: I suppose...
Post by: jj2007 on November 02, 2017, 12:12:25 PM: Quote from: Jokaste on November 01, 2017, 05:43:54 PM
I would like to have something for measuring timings.

GetTickCount is OK if you choose the number of loops so high that the test takes a second or two (granularity is 16 milliseconds). Otherwise QPC (https://msdn.microsoft.com/en-us/library/windows/desktop/ms644904%28v=vs.85%29.aspx). NanoTimer() (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1171) is based on QPC, and I use it all the time.
Title: Re: I suppose...
Post by: Jokaste on November 05, 2017, 07:27:08 AM: Thank You, I test it now.