Tuesday, April 13, 2010

05 - Improvements

Obviously, the code generated by GCC 3.3.5. is anything else than optimised. Even if you do not know anything about reading source code, you surely are able to grasp what those comments say. This document is not an introduction to programming, so you have to rely on my words, but you can be sure: I definitely know what I am talking about.

Re-arranging some parts and reducing code sequences to really required instructions does shrink GCC's draft markably. Applying some human brain, the remaining (optimised) code should run about 30 percent faster now.

        .data

        .p2align 4,0x00
    jt0:.long  L02
        .long  L03
        .long  L04
        .long  L05
        .long  L16
        .long  L15
        .long  L15
        .long  L15
        .long  L06
        .long  L07
        .long  L15
        .long  L15
        .long  L15
        .long  L15
        .long  L15
        .long  L15
        .long  L08
        .long  L09
        .long  L10
        .long  L11
        .long  L12
        .long  L13
        .long  L14
        .long  L15
        .long  L08
        .long  L09
        .long  L10
        .long  L11
        .long  L12
        .long  L13
        .long  L14

Following recommendations of AMD and LETNi, the jump table is moved to the .data segment. This is much better than mixing code and data in the .code segment.

        .text

        .align 2,0x90
.globl MoveDlg
MoveDlg:pushl  %ebp
        movl   %esp,%ebp
        pushl  %edi
        pushl  %esi
        movl   0x08(%ebp),%edi
        movl   0x0C(%ebp),%eax
        movzwl 0x10(%ebp),%ecx
        movl   _GVAR,%esi
        cmpl   $0x30,%eax
        je     L01
        cmpl   $0x20,%eax
        je     L00
        cmpl   $0x3B,%eax
        jne    L15

The distributor was optimised for the branch prediction logic. WM_CONTROL was put on top of the distributor, because most sent messages are WM_CONTROL messages. WM_COMMAND only is sent if the user pushes a button. No user is able to recognise delays of about 5 ns, so we can live with a ten cycles penalty if a branch target is misprediced. WM_INITDLG is sent only once. While the 1st comparison is 'guessed' as not taken, the branch does not trigger a penalty. The 2nd and all following comparisons are assumed to be taken, so the branch to the default routine (DefDP()) does not trigger penalties, as well.

        pushl  $0xE9
        pushl  $0xD3
        pushl  $0xD2
        pushl  %edi
        call   _DLGtxt
        pushl  $0x00
        pushl  $-0x01
        pushl  $0x0120
        pushl  $0x1240
        pushl  %edi
        call   _SnDIM
        addl   $0x08,%esp
        pushl  $0x1248
        pushl  %edi
        call   _SnDIM
        addl   $0x08,%esp
        pushl  $0x1250
        pushl  %edi
        call   _SnDIM
        addl   $0x08,%esp
        pushl  $0x1258
        pushl  %edi
        call   _SnDIM

The three parameters on top are pushed for the first call, only. This saves nine redundant instructions.

        addl   $0x14,%esp
        movl   0x1C54(%esi),%ecx
        movl   0x01D8(%esi),%edx

Both parameters can be preloaded at this point, because FDacc() is a function taken from ST-Open's library. Functions in my libraries restore all registers (including ECX and EDX) by default - they are 'clean'. But - watch out: MoveDlg() is a function following the C conventions. ECX and EDX neither are saved nor restored - MoveDlg() is a 'dirty' function.

        pushl  %edi
        pushl  $0x00
        pushl  $0x02
        pushl  $0x00
        pushl  $0x0B
        pushl  %ecx
        call   _FDacc
        pushl  %edx
        pushl  $0x00
        pushl  $0x02
        pushl  $0x04
        pushl  $0x0B
        pushl  %ecx
        call   _FDacc
        addl   $0x36,%esp
        movl   $0x00,0x28E0(%esi)
        pushl  %edi
        call   _CtrWn
        call   _DlgShow
        jmp   3f

The next distributor was optimised for code reduction. Because none of the three buttons is pushed more than once (in general..), we can live with a ten cycles penalty for one or two mispredicted branch(es). The delay added by the penalty is at least six powers of ten faster than anything human senses could perceive.

    L00:subl  $0x1231,%ecx
        je    0f
        decl  %ecx
        je    1f
        decl  %ecx
        jne   L15
        pushl $0x11
        call  _Help
        jmp   3f
      0:movl  $0x00,0x28E0(%esi)
        jmp   2f
      1:orl   $0x00040000,0x28E0(%esi)
      2:pushl %edi
        call _WinDD
      3:addl  $0x04,%esp
        jmp   L16

    L01:subl   $0x1240,%ecx
        js     L15
        cmpl   $0x1E,%ecx
        ja     L15
        movl   0x28E0(%esi),%eax
        jmp    *jt0(, %ecx, 4)

The jump table was moved to the .data segment. To keep an overwiew, your symbols for jump tables generally should be marked with special names. ST-Open uses the symbol 'jtX', where X is the number of the current jump table.

    L02:andl   $0xFFFFE1FF,%eax
        orl    $0x1000,%eax
        jmp    0f
    L03:andl   $0xFFFFE1FF,%eax
        orl    $0x0800,%eax
        jmp    0f
    L04:andl   $0xFFFFE1FF,%eax
        orl    $0x0400,%eax
        jmp    0f
    L05:andl   $0xFFFFE1FF,%eax
        orl    $0x0200,%eax
        jmp    0f
    L06:andl   $0xFFFFFE7F,%eax
        orl    $0x0100,%eax
        jmp    0f
    L07:andl   $0xFFFFFE7F,%eax
        orl    $0x80,%eax
        jmp    0f
    L08:andl   $0xFFFFFF80,%eax
        orl    $0x40,%eax
        jmp    0f
    L09:andl   $0xFFFFFE80,%eax
        orl    $0x20,%eax
        jmp    0f
    L10:andl   $0xFFFFFE80,%eax
        orl    $0x10,%eax
        jmp    0f
    L11:andl   $0xFFFFFE80,%eax
        orl    $0x08,%eax
        jmp    0f
    L12:andl   $0xFFFFFE80,%eax
        orl    $0x04,%eax
        jmp    0f
    L13:andl   $0xFFFFFE80,%eax
        orl    $0x02,%eax
        jmp    0f
    L14:andl   $0xFFFFFE80,%eax
        orl    $0x01,%eax
      0:movl   %eax,0x28E0(%esi)
        jmp    L16

This part surely could be reduced further if I could remember what all those flags are good for...

    L15:popl  %ebx
        popl  %edi
        popl  %esi
        popl  %ebp
        jmp  _DefDP

    L16:xorl  %eax, %eax
        popl  %ebx
        popl  %edi
        popl  %esi
        popl  %ebp
        ret

'Exits' belong to the bottom of a function. First, the processor does not have to jump back and forth to random locations within the instruction chain. Secondly, human senses perceive structured (sorted) input much faster than random patterns spread all over the screen.

.comm _GVAR,4

One (used) out of all (unused). The size of the global variable was reset to the proper size of 32 bit, so four - rather than one - variable(s) will fit into one paragraph (16 byte), again.

Go to the next post (06 - Analysis).

No comments:

Post a Comment