C declaration explainer

C language > Work in progress

<< < (2/2)

Jokaste:
When do you make the same program for the assembler?

frankie:

--- Quote from: Jokaste on October 28, 2017, 05:11:26 PM ---When do you make the same program for the assembler?

--- End quote ---
I'm working on it.
A compiler is a complicate thing. I can't even think to produce assembler output before the preprocessor is working. And I'm now dealing with it.
As soon I'll have something I'll publish it. :)
Thanks for trusting me. ;)

Jokaste:
It can help you...

--- Quote ---
Sched
=====
Sched takes specially commented source code and reoutputs the code in an order
that can be more optimally executed by an in-order but superscalar or
pipelined CPU.
Why would I do this?
====================
Most modern CPU of relevance have out-of-order execution engines, that will
automatically schedule instructions to extract better parallelism anyways. So
what is the point?
1) Well, as a side effect, sched exposes the critical path dependencies in a
kind of obvious way. This will allow you see where you need to apply
code tweaks to improve your performance as opposed to throwing out red
herrings at you.
2) Different OOE architectures vary in quality. Certainly none of them that I
have encountered can perform perfect reordering. If you do the reording
yourself, then there is no issue about the quality of OOE engine for your
CPU.
3) Itanium and Sparc are an in-order processors. So are a lot of really
low-end processors and DSPs. So this is still useful for some platforms.
Ok how do I make it work?
=========================
Each line of code in the input corresponds to "one cycle" (yes, this is
dramatically simplified from reality, but it works for integer code on most
CPUs.) In each line of your code you need a tail comment which describes the
resources that are read and written. Then simply feed your the input to sched
and it will output the code reordered.
The tail comment includes a resource description in the following format:
@ wr(...) rd (...)
The ellipses (...) are replaced by the sequence of comma seperated resources
either written or read, respectively. For example:
mov eax, ebx ; @ wr(eax), rd(ebx)
add eax, ecx ; @ wr(eax), rd(eax,ecx)
shl eax, 2 ; @ wr(eax), rd(eax)
mov edx, ebx ; @ wr(edx), rd(ebx)
add edx, ecx ; @ wr(edx), rd(edx,ecx)
shl edx, 2 ; @ wr(edx), rd(edx)
After run through sched will give the following:
mov eax, ebx ; @ wr(eax), rd(ebx)
mov edx, ebx ; @ wr(edx), rd(ebx)
; /* cycle 1 */
add eax, ecx ; @ wr(eax), rd(eax,ecx)
add edx, ecx ; @ wr(edx), rd(edx,ecx)
; /* cycle 2 */
shl eax, 2 ; @ wr(eax), rd(eax)
shl edx, 2 ; @ wr(edx), rd(edx)
; /* cycle 3 */
Note that the "resources" are just arbitrary (case sensitive) strings, and no
interpretation of the source code is done. So obviously you could write some
C code, and assume a 3-operand CPU model:
_t0 = a + b; /* @ wr(_t0), rd(a,b) */
_t1 = c + d; /* @ wr(_t1), rd(c,d) */
a = _t0 + _t1; /* @ wr(a), rd(_t1,_t0) */
_t2 = a | b; /* @ wr(_t2), rd(a,b) */
_t3 = c | d; /* @ wr(_t3), rd(c,d) */
b = _t3 | _t2; /* @ wr(b), rd(_t3,_t2) */
_t4 = a ^ b; /* @ wr(_t4), rd(a,b) */
_t5 = c ^ d; /* @ wr(_t5), rd(c,d) */
c = _t5 ^ _t4; /* @ wr(c), rd(_t5,_t4) */
which outputs:
_t0 = a + b; /* @ wr(_t0), rd(a,b) */
_t1 = c + d; /* @ wr(_t1), rd(c,d) */
_t3 = c | d; /* @ wr(_t3), rd(c,d) */
_t5 = c ^ d; /* @ wr(_t5), rd(c,d) */
; /* cycle 1 */
a = _t0 + _t1; /* @ wr(a), rd(_t1,_t0) */
; /* cycle 2 */
_t2 = a | b; /* @ wr(_t2), rd(a,b) */
; /* cycle 3 */
b = _t3 | _t2; /* @ wr(b), rd(_t3,_t2) */
; /* cycle 4 */
_t4 = a ^ b; /* @ wr(_t4), rd(a,b) */
; /* cycle 5 */
c = _t5 ^ _t4; /* @ wr(c), rd(_t5,_t4) */
; /* cycle 6 */
Woohoo!
This looks like a hack I wrote in a couple hours
================================================
It is. But it uses modules that I didn't have in my toolchest until
relatively recently. I was asked by someone at a CPU vendor company to build
such a tool for an old CPU whose OOE capabilities were relatively modest. The
problem is that the C language by itself is piss poor at string, and ADT
handling using its native libraries, and at the time I didn't quite
understand that solving the most general problem helps solve the specific
problem.
Over time I have been writing simple components to solve standard problems in
C with the same ease in which I can solve problems in higher level languages.
I have also seen enough CPU architectures now, that it has dawned on me that
assuming you have infinite number of pipelines allows you to see all the
parallelism inherent in the software source itself.
It should also be said, that to solve the problem exactly for a finite
pipelined machine (with other strange restrictions like slotting or random
scheduling when faced with too much parallelism, as a couple x86 CPU
micro-architectures are limited by) is extremely difficult, as it requires
a full search.
This is the result, now that I know the right way to do it. :)
Is this free?
=============
Well, I am covering it with the BSD licence. So its very free, but not quite
public domain -- just don't go taking credit for my work.
--
Paul Hsieh

--- End quote ---

frankie:

--- Quote from: Jokaste on October 28, 2017, 07:05:49 PM ---It can help you...

--- End quote ---
I'll have a look.
Thanks

Navigation

[0] Message Index

[*] Previous page

Go to full version