Tuesday, April 13, 2010

11 - Rules

Whenever we want to build a greater whole out of many independent parts created by individual contributors, we have to use a common set of rules. Anyone contributing a part to the whole has to obey these rules. If all contributors applied their own rules, we finally ended up with a lot of parts, but none of them were able to work together with any other part. Using different interfaces, single parts either could not communicate at all or misunderstood received messages completely. Therefore, we need a set of rules before we start to create parts of a greater whole. This set of rules, like a common language, defines a common interface, forcing any contributor to use standardised communication protocols, so other parts are able to understand all sent messages and do the right work. The big idea behind any set of rules lives or dies with strict obedience.


Rule 1

The instructions enter, leave, push and pop are archaic remains of the past era. They must not be used in Intelligent Design code.

Any of these instructions change the content of the stack pointer automatically. All Intelligent Design functions depend on one principle: The stack pointer must not change its content for their entire runtime. We either can enjoy the advantages of a static stack pointer or we can waste precious time with obligatory corrections and repetitive writes of one and the same parameter. Using a 'flying' stack pointer and its urgently required compagnion called base pointer is mutual exclusive with Intelligent Design. Both concepts use the stack pointer in a very different way - there is no chance to use a mix of both concepts in one and the same function.


Rule 2

Stack frames are created by subtracting 0x3C in 32 bit functions or 0x38 in 64 bit functions from the stack pointer ESP/RSP. If larger stack frames are required, the basic value 0x3C or0x38 can be expanded in 64 byte steps (0x7C, 0xBC, 0xFC, ... in 32 bit functions, 0x78, 0xB8, 0xF8, ... in 64 bit functions).

This rule is valid for IDEOS, only. Any other operating system will not benefit from the advantages of a properly aligned stack. Because the stack generally isn't aligned in old fashioned operating systems, you have to align it with some lines of redundant bloat if you use XMM registers.

If you write IDEOS code, you have to apply this rule. The only exception is the rare case that you do not use other registers than rAX and XMM0 through XMM3. If there are registers to preserve, you need a stack frame. The advantage of this rule is a properly aligned stack pointer, starting at the beginning of a cache line. IDEOS's concept guarantees that the stack pointer is naturally aligned to a multiple of 64 before any function is called. Because the call instruction stores EIP/RIP in the next stack element, subtracting the correct value automatically aligns the stack pointer to a multiple of 64 and the corresponding cache line is preloaded into L1 cache without a single line of additional code.


Rule 3

All registers used in a function must be saved before they are overwritten and must be restored before returning to the caller.

Any software running on modern processors is most efficient if it can access as much hardware resources as possible. The most valuable resources of a x86 processor are its registers. The more frequently used parameters are held in registers, the less time is required to perform a given task. If we reload parameters with slow memory reads rather than to keep them in registers, we slow down our code with superfluous operations and interrupt simultaneous execution completely. Preloading all required parameters into registers in one gulp reduces all dependencies, supporting parallel execution, keeping all three pipes busy all of the time.

Saving precious clock cycles with much faster register operations outweighs the cost of the few clock cycles needed to save and restore all used registers by far. Many functions work with a subset of all available registers. The less registers we use, the faster they are saved and restored. Another welcome side-effekt is based on the fact that continuous writes to ascending addresses trigger write combining, where up to 16 doublewords are written to a cache line in one gulp. This surely is much faster than 16 single writes. Moreover, pairing of multiple reads or writes is a good opportunity to support simultaneous execition flow as long as possible.


Rule 4

The registers RAX, XMM0, XMM1, XMM2 and XMM3 are reserved for special purposes - they neither are saved nor restored by default.

This rule partially overrides rule 3:

Per definitionem, Register RAX is used to pass return values or errorcodes to the calling function. It does not make sense to save a register that is overwritten in all epilogues by default. If a function declaration defines its return value as VOID, rAX must be cleared with a simple xor %rAX,%rAX on exit.

Registers XMM0 through XMM3 with their size of 4 * 128 bit = 64 byte cover an entire cache line. They are used to clear, move or manipulate large data areas in steps of 64 byte, preferably aligned to a multiple of 64. In general, these operations are executed in small loops where no other functions are called, so we can skip to save and restore these registers per se. This saves a lot of clock cycles if your application makes heavy use of such operations. If your function calls others while manipulating large memory blocks, the called function might use XMM0 through XMM3 as temporary storage, overwriting whatever was stored in these registers. Hence, it is recommended to use XMM4 through XMM15 if preloaded values must not change while other functions are called. As stipulated by rule 3, you have to save and restore
 XMM4 through XMM15 if you use them.


Rule 5

Accessing registers is much faster than accessing memory locations. The most frequently used parameters in a function should be preloaded into registers as soon as these registers were saved on the stack.

Even if old-fashioned programmers aren't used to work with multiple execution pipes: Obeying to this rule can turn an ox cart into a rocket (see Rule 3).


Rule 6

Modern processors execute three (LETNi: 1.5) instructions simultaneously. Good code should make use of these capabilities rather than to ignore them. Dependency chains interrupt simultaneous execution. While one instruction is executed in one pipe, the other two pipes are waiting for the result. Sending two of three pipes to sleep for some clocks wastes two thirds of the available resources.

This extraordinary product of a classical compiler

        ...
    pushl    %esi
    pushl    $0
    pushl    $2
    pushl    $0
    pushl    $11
    movl     _GVAR, %eax
    movl     7252(%eax), %eax
    pushl    %eax
    call     _FDacc
    movl     _GVAR, %eax
    addl     $24, %esp
    movl     472(%eax), %ebx
    pushl    %ebx
    pushl    $0
    pushl    $2
    pushl    $4
    pushl    $11
    movl     7252(%eax), %ecx
    pushl    %ecx
    call     _FDacc
    addl     $20, %esp
    movl     _GVAR, %eax
    movl     $0, 10464(%eax)
        ...

can be improved with a few useful changes

        ...
        movl _GVAR,%esi    # at least 3 clocks ahead
        ...
        ...
        ...
        movl 0x1C54(%esi),%eax
        movl 0x01D8(%esi),%ecx
        movl $0x00,0x28E0(%esi)
        movl %eax,0x00(%esp)
        movl $0x0B,0x04(%esp)
        movl $0x00,0x08(%esp)
        movl $0x02,0x0C(%esp)
        movl $0x00,0x10(%esp)
        movl %edi,0x14(%esp)
        call _FDacc
        movl $0x04,0x08(%esp)
        movl %ecx,0x14(%esp)
        call _FDacc
        ...

to run at least twice as fast, now. Removing all superfluous instructions, code size could be reduced from 23 to 14 instructions. Placing all instructions in the proper order removes dependencies and keeps all pipes busy without delays.


Rule 7

Jump tables belong to the data segment. AS is able to build jump tables in the data segment, so there's absolutely no reason to pollute the code segment with data of any kind.

No comment is required for this rule. Just do it!




No comments:

Post a Comment