Intelligent Design: 06

The source code generated by GCC 3.3.5 perfectly reveals the greatest weaknesses of the C conventions. The first step to slow down C code markably is the abuse of EBP as base pointer, reducing our set of available registers to 6: EAX, EBX, ECX, EDX, EDI and ESI. One of these remaining registers, EAX, is used to pass results or error codes from a called function to the calling function, reducing our register set to five. By default, ECX and EDX neither are saved nor restored, so every time we call another function their content is sent to the great formatter, or, in other words: If we stored frequently used parameters in ECX or EDX prior to the call, they're probably overwritten by the called function. Due to counterproductive conventions, only three registers are left to store our frequently used parameters. If four or more parameters are required throughout a function, parameters 4 and up must be reloaded whenever we call another function, because the content of EAX, ECX and EDX is changed by the called function. Reloading parameters from the slower memory subsystem instead of preloading them in registers to perform much faster operations is quite time consuming. It is not about those five additional clock cycles we waste with reloading a parameter over and over again. The most negative side-effect is the immediate interruption of parallel execution, forcing two execution pipes to idle until the required parameter was reloaded, again. It's one of the reasons why C and C++ applications are that slow.

An exemplatory sequence is the following code snippet taken from GCC's output. It demonstrates how strict implementation of the C conventions slows down the execution of standard C and C++ code:

    ...
    movl    _GVAR, %eax
    movl    7252(%eax), %eax
    pushl    %eax
    ...

This sequence turns any parallel execution off. Instead of executing up to 3 instructions in one of the 3 available execution pipes, the code listed above is executed sequentially, or, in other words: Two execution pipes have to wait until the previous instruction is executed and its result is available. If all data are present in L1 cache, it takes (approximately) nine clock cycles to execute the three lines listed above. Two execution pipes are idling while _GVAR (_GVAR is BNR defined as an array of dwords to satisfy GCC) is loaded into EAX. Next, two pipes are idling while 1C54[BNR] is loaded into EAX. Finally, EAX is pushed onto the stack, blocking ESP for two clock cycles. If some data were not loaded into L1 cache, yet, execution time is extended markably. Reading data from the L2 cache costs about six clock cycles, loads from main memory consume about 27 clock cycles. In the end, our primary intention, saving 14 clock cycles to push and pop ECX and EDX in every function's prologue and epilogue, turns out to be a bad idea. In the best case, the assumed advantage is eaten up after the second reload of a frequently used parameter. In general, there is no advantage at all - The saved 14 clock cycles are wasted with the first reloading from main memory. This is the reason, why conventional software is creeping through the execution pipes of John Doe's hypermodern Gigantium processor like thick dough. Perhaps it were a good idea to purify the sap called software to give John Doe the chance to partially enjoy the overwhelming computational powers of his expensive Gigantium machine?

Another obstacle to unleash the powers of recent processors is the flying stack pointer ESP. Modern software design should use the capabilities of modern processors instead of blocking them with outdated mechanisms and ancient techniques. A pointlessly floating stack pointer with random content cannot be used seriously to address stack elements. How to solve these problems elegantly is the topic of the following post.

Tuesday, April 13, 2010

06 - Analysis

No comments:

Post a Comment