Tuesday, April 13, 2010

04 - Caveats

As mentioned before, conventions, methods and techniques introduced with the programming language C have a lot of advantages over monolithic code we used to write applications for DOS and other ancient operating systems. All functions easily are portable to other operating systems or processor architectures and can be used multiple times, leading to versatile applications for multiple platforms. Unfortunately, after those C conventions were established, they never were updated or revised to keep pace with the development of processor architectures. If we compare an 8086 against a recent quad-core Athlon, a picturesque comparison were a cart dragged by a tired ox versus an Airbus A380. While no person with sane mind ever wasted one thought about equipping an A380 with a harness and motivate it to lift off with reins, whip and loud 'hee' and 'ho' shouts, software designers all over the world practise such weird things every day. They still create 'flying' stack frames, use slow leave, pop and push instructions with the obligatory update of the stack pointer instead of the faster mov and continue to abuse valuable resources by using EBP as base pointer. This reduces the too small register set of the x86 architecture by 1/7th, forcing the programmer to use slow memory reads and writes instead of much faster register operations.

To demonstrate the caveats, we analyse a dialog procedure taken from an existing ST-Open program. The dialog consists of four groups of radiobuttons, the three buttons 'Abort', 'Move', 'Help' and some static texts. DLGtxt() sets all texts in this dialog to strings taken from a subfield with the current language (multi-lingual dialog and menu texts are an integral part of ST-Open's libraries). The dialog is used to move datasets within a datafield, the target is selected with the radiobuttons.

    .text
    .p2align 4,,15

.globl MoveDlg
MoveDlg:
    pushl   %ebp
    movl    %esp, %ebp
    pushl   %esi
    pushl   %ebx
    movl    12(%ebp), %edx
    movl    8(%ebp), %esi
    movl    16(%ebp), %ecx
    movl    20(%ebp), %ebx
    cmpl    $48, %edx
    je    L12
    ja    L39
    cmpl    $32, %edx
    je    L4

L37:movl    %ebx, 20(%ebp)
    movl    %ecx, 16(%ebp)
    movl    %edx, 12(%ebp)
    movl    %esi, 8(%ebp)
    leal    -8(%ebp), %esp
    popl    %ebx
    popl    %esi
    popl    %ebp
    jmp    _DefDP

Compared against GCC 2.6.1. (1993), GCC 3.3.5. (2006) generates much worse code, spiced with a lot of counterproductive, superfluous instructions. Writing back parameters to the stack violates some basic rules. First, there's absolutely no need to write data back to the locations we took them from some instructions before. Secondly, writes to stack locations above ESP violate all rules of proper programming. If any function starts to write data to the stack frames of other functions, it is just a question of time until we encounter desastrous malfunctions.

    .p2align 4,,7

 L4:movl    %ecx, %eax
    andl    $65535, %eax
    cmpl    $4658, %eax
    je    L7
    jg    L11
    cmpl    $4657, %eax
    je    L6

Because we only need the low word stored in message parameter 1, it was a good idea to extract this lower word with the instruction movzwl 0x10(%ebp),%ecx rather than to waste valuable clock cycles with the code sequence shown above. Recent processors have three, not just one execution pipe(s). The choosen way sends two execution pipes to sleep while one pipe is busy to extract data from a register. This is repeated two times. We could switch off two execution pipes while this code is executed, because we created two avoidable dependencies.

 L9:movl    %ebx, 20(%ebp)
    movl    %ecx, 16(%ebp)
    movl    %edx, 12(%ebp)
    movl    %esi, 8(%ebp)
    leal    -8(%ebp), %esp
    popl    %ebx
    popl    %esi
    popl    %ebp
    jmp    _DefDP

Obviously, L37 and L9 provide identical code. Using our brain, these eighteen redundant (partially pointless) lines can be reduced to five really necessary instructions.

 L6:movl    _GVAR, %eax        # [1]
    subl    $12, %esp
    movl    $0, 10464(%eax)

L42:pushl    %esi
    call    _WinDD

L41:addl    $16, %esp

    .p2align 4,,7

Up to seven nops are executed every time we branch to this part of code. It is a good way to slow down execution flow as much as possible.

 L2:leal    -8(%ebp), %esp
    xorl    %eax, %eax
    popl    %ebx
    popl    %esi
    popl    %ebp
    ret

L11:cmpl    $4659, %eax
    jne    L9
    subl    $12, %esp
    pushl    $17
    call    _Help
    jmp    L41

 L7:pushl    %edx
    pushl    %edx
    pushl    $18
    movl    _GVAR, %eax        # [1]
    addl    $10464, %eax
    pushl    %eax
    call    _FlgS
    popl    %eax
    jmp    L42

The branch prediction logic assumes every first branch as false if there is no entry in its internal table. Almost all of the above branches trigger the obligatory ten penalty cycles, because the wrong branch instructions were chosen.

    .p2align 4,,7

L39:cmpl    $59, %edx
    jne    L37
    pushl   $233
    pushl   $211
    pushl   $210
    pushl   %esi
    call    _DLGtxt
    movl    $0, (%esp)
    pushl   $-1
    pushl   $288
    pushl   $4672
    pushl   %esi
    call    _SnDIM
    addl    $20, %esp
    pushl   $0
    pushl   $-1
    pushl   $288
    pushl   $4680
    pushl   %esi
    call    _SnDIM
    addl    $20, %esp
    pushl   $0
    pushl   $-1
    pushl   $288
    pushl   $4688
    pushl   %esi
    call    _SnDIM
    addl    $20, %esp
    pushl   $0
    pushl   $-1
    pushl   $288
    pushl   $4696
    pushl   %esi
    call    _SnDIM
    addl    $24, %esp

Only the second parameter changes for the four consecutive calls of SnDIM(). Twelve of these twenty push instructions (60 percent) are redundant.

    pushl    %esi
    pushl    $0
    pushl    $2
    pushl    $0
    pushl    $11
    movl     _GVAR, %eax        # [1]
    movl     7252(%eax), %eax
    pushl    %eax
    call     _FDacc
    movl     _GVAR, %eax        # [1]
    addl     $24, %esp
    movl     472(%eax), %ebx
    pushl    %ebx
    pushl    $0
    pushl    $2
    pushl    $4
    pushl    $11
    movl     7252(%eax), %ecx
    pushl    %ecx
    call    _FDacc

Only parameters 3 and 6 change for the both calls of FDacc(). While we are bound to push instructions, there is no way to change just these parameters. We have to push all six parameters, again, because the parameter on top must be changed.

    addl    $20, %esp
    movl    _GVAR, %eax        # [1]
    movl    $0, 10464(%eax)
    pushl   %esi
    call    _CtrWn
    movl    %esi, (%esp)
    call    _DlgShow
    jmp    L41

    .p2align 4,,7

L12:movl    %ecx, %eax
    andl    $65535, %eax
    subl    $4672, %eax
    cmpl    $30, %eax
    ja    L37
    jmp    *L36(,%eax,4)

Again, it were better to extract the low word via one movzwl 0x10(%ebp),%ecx rather than to use the chosen way. Probably, parallel execution was considered to be too fast?

    .p2align 2
    .align 2,0xcc

I don't know what the second .align is good for. Any hints? My jump table is too large, because C programmers do not think about side effects like blowing up code while assigning resource IDs to 'straight' numbers. Due to the gaps between those IDs, there are a lot of superfluous entries in this jump table.

L36:.long    L14
    .long    L15
    .long    L16
    .long    L17
    .long    L2
    .long    L37
    .long    L37
    .long    L37
    .long    L18
    .long    L19
    .long    L37
    .long    L37
    .long    L37
    .long    L37
    .long    L37
    .long    L37
    .long    L21
    .long    L23
    .long    L25
    .long    L27
    .long    L29
    .long    L31
    .long    L33
    .long    L37
    .long    L21
    .long    L23
    .long    L25
    .long    L27
    .long    L29
    .long    L31
    .long    L33

By the way: Jump tables belong to the .data, not to the .code segment. AS supports jump tables in the .data segment, so there's no need to violate the recommendations of AMD and LETNi as all versions of GCC do...

L14:movl    _GVAR, %edx        # [1], [2]
    movl    10464(%edx), %eax
    andb    $225, %ah
    orb     $16, %ah

    .p2align 4,,7

L40:movl    %eax, 10464(%edx)
    jmp    L2

L15:movl    _GVAR, %edx        # [1], [2]
    movl    10464(%edx), %eax
    andb    $225, %ah
    orb     $8, %ah
    jmp    L40

L16:movl    _GVAR, %edx        # [1], [2]
    movl    10464(%edx), %eax
    andb    $225, %ah
    orb     $4, %ah
    jmp    L40

L17:movl    _GVAR, %edx        # [1], [2]
    movl    10464(%edx), %eax
    andb    $225, %ah
    orb     $2, %ah
    jmp    L40

L18:movl    _GVAR, %edx        # [1], [2]
    movl    10464(%edx), %eax
    andl    $-385, %eax
    orb     $1, %ah
    jmp    L40

L19:movl    _GVAR, %edx        # [1], [2]
    movl    10464(%edx), %eax
    andl    $-385, %eax
    orb     $-128, %al
    jmp    L40

L21:movl    _GVAR, %edx        # [1], [2]
    movl    10464(%edx), %eax
    andl    $-128, %eax
    orl     $64, %eax
    jmp    L40

L23:movl    _GVAR, %edx        # [1], [2]
    movl    10464(%edx), %eax
    andl    $-384, %eax
    orl     $32, %eax
    jmp    L40

L25:movl    _GVAR, %edx        # [1], [2]
    movl    10464(%edx), %eax
    andl    $-384, %eax
    orl     $16, %eax
    jmp    L40

L27:movl    _GVAR, %edx        # [1], [2]
    movl    10464(%edx), %eax
    andl    $-384, %eax
    orl     $8, %eax
    jmp    L40

L29:movl    _GVAR, %edx        # [1], [2]
    movl    10464(%edx), %eax
    andl    $-384, %eax
    orl     $4, %eax
    jmp    L40

L31:movl    _GVAR, %edx        # [1], [2]
    movl    10464(%edx), %eax
    andl    $-384, %eax
    orl     $2, %eax
    jmp    L40

L33:movl    _GVAR, %edx        # [1], [2]
    movl    10464(%edx), %eax
    andl    $-384, %eax
    orl     $1, %eax
    jmp    L40

Many redundant instructions unnecessarily blow up code size. The first two lines of all jump targets could be reduced to a common read before the distributor branches to the selected table entry.

.comm _hab,16
.comm _DEBUG,16
.comm _USE_LDF,16
.comm _LDR_AVAIL,16
.comm _MSGLD,16
.comm _BMM,16
.comm _BNR,16
.comm _GVAR,16
.comm _BST,16
.comm _BBF,16
.comm _TST,16
.comm _MHSTR,16
.comm _LDF,16
.comm _DUMPLINE,16
.comm _DUMPCNT,16
.comm _OLH_MODE,16
.comm _SEC,16
.comm _XXX,16
.comm _FLD_XXX,16
.comm _FLD_SEC,16

We only use _GVAR in this file - enumerating all global variables is quite stupid. All globals are defined as doublewords (32 bit), so we might ask why GCC expands all these variables to a size of 128 bit, filling up the .data segment with garbage.


Footnotes

[1] This is the typical side effect of the C convention saying 'Thou shalt not save ECX and EDX'. Using two of six registers as a sanitary fill for temporary data, we reduce the set of safe registers to three, forcing us to reload frequently required parameters from memory over and over again. Reloading parameters more than two times eats up the advantage of omitting the two push and pop instructions to save and restore the content of ECX and EDX. After the third reload operation, we are on the losing side. Adding unnecessary loads of parameters to our code slows down our functions and does not speed them up.

[2] Loading _GVAR into EDX and the flags MVP_FLAGS into EAX could be done in L12. Applying some brain didn't just save 24 lines of code, it also sped up the 13 target functions, because loading and evaluating MVP_FLAGS were separated and the dependency chain were reduced to two single rather than three consecutive dependencies.

Go to the next post (05 - Improvements).

No comments:

Post a Comment