Intelligent Design: 13

The second appendix provides a collection of recommendations, how existing code can be optimised. This collection will be extended whenever I stumble upon another piece of badly designed code...

Optimisation 01

To access compound data stored in doublewords, GCC uses one of these constructs by default:

Lower Word

Version 1:

...
movl %edx,%eax     # DP 1
andl $65535,%eax   # DP 1 + wait for EAX
cmpl $4623,%eax    # DP 1 + wait for EAX
...

Version 2:

...
movl %edx,%eax     # DP 1
cmpw $4623,%ax     # DP 1 + wait for EAX
...                #    + wait for AX

The real task is the comparison of the lower word in a compound datatype - in our case a message parameter - against a given number. The first version copies the message parameter stored in EDX to EAX, clears the upper word via AND and finally compares the result against the ID. It takes three clock cycles to execute this construct, because the processor has to wait for the updated content of EAX after each of both instructions. The second version copies the content of EDX to EAX, then compares AX against the ID. That's worse than the first version. Loading the 32 bit register EAX and accessing the lower portion AX in the next instruction is rewarded with some extra clock cycles, because some internal processing known as register merging is required to make AX accessible for the comparison.

Actually, EDX already contains the entire message parameter, so there is no reason to copy it to anywhere else. We also can assume, the message parameter was loaded into EDX several clock cycles before, so no register merging is required. Therefore, the straight way

...
cmpw $0x120F,%dx # DP 1
...

is the only proper solution, executed in one clock cycle, saving 9 (2) byte.

Upper Word

...
movl %esi,%eax     # DP 1
shrl $16,%eax      # DP 1 + auf EAX warten
andl $65535,%eax   # DP 1 + auf EAX warten
cmpl $523,%eax     # DP 1 + auf EAX warten
...

The message parameter stored in ESI is copied to EAX, then EAX is shifted right 16 bits. I do not know the reason why the upper word is cleared a second time, but I do know this: It is a superfluous operation. Finally, the separated upper word is compared against an immediate value, in our case a notification ID sent by one of the controls.

Extracting the upper word from a doubleword is more complex, because there are no instructions to access it directly. For the following examples, it is assumed the message parameter is stored at location 0x8C[ESP]. The first possible solution is to preload the upper word like this:

...
movzwl 0x8E(%esp),%edx # DP 4
...
... # 4 clocks distance
...
...
cmpl $0x010B,%edx # DP 1
...

It is recommended to place the MOVZWL at top of your function, after all registers are saved. If you cannot keep the minimum distance of four clock cycles, a better solution might be

...
cmpw $0x010B,0x8E(%esp) # DP 4
...

This one costs four clock cycles and should not be used if a lot of notifications are evaluated. If more than two comparisons are required, preloading saves a lot of clock cycles, even if it takes four clock cycles to preload that register and access its content with the next instruction. All following comparisons are done in one clock cycle, so we save time rather than to spend it.

Optimisation 02

GCC has some bad manners, like the one shown in Conventions:

Wrong:

       movl %eax,%esi   # DP 1
       testl %esi,%esi # DP 1
       je L8            # DP 1

Okay:

       movl %eax,%esi   # DP 1
       testl %eax,%eax # DP 1
       je L8            # DP 1

Even if both snippets look like monozygotic twins, the first one is one clock cycle slower. Both versions copy the content of EAX to ESI, because we need it after calling some other functions. The decisive difference between both versions is the register we use to test if its content is equal to zero. While the first version uses ESI and thus has to wait, until EAX was copied to ESI, the second version uses EAX for testing. Without dependencies, the first two instructions of the second version are executed simultaneously, saving one clock cycle. If this code snippet sits within a loop running 32,768 times, we save 32,768 clock cycles with a simple optimisation.

Optimisation 03

In most cases, the direct path is the shortest way to move from A to B. Here is a snippet taken from the message procedure of ST-Open's V700 skeleton, translated into assembly language by GCC 3.3.5.:

    ...
    movl 12(%ebp),%eax   # DP 3
    movl %eax,-52(%ebp) # DP 3 wait
    cmpl $35,-52(%ebp)   # DP 4 wait
    je L25               # DP 1
    cmpl $35,-52(%ebp)   # DP 4 !
    ja L56               # DP 1
    cmpl $7,-52(%ebp)    # DP 4 !
    je L14               # DP 1
    cmpl $7,-52(%ebp)    # DP 4 !
    ja L57               # DP 1
    cmpl $1,-52(%ebp) # DP 4 !
    je L15               # DP 1
    jmp L54              # DP 1
L57:cmpl $32,-52(%ebp)   # DP 4 !
    je L27               # DP 1
    jmp L54              # DP 1
L56:cmpl $41,-52(%ebp)   # DP 4 !
    je L17               # DP 1
    cmpl $41,-52(%ebp)   # DP 4 !
    ja L58               # DP 1
    cmpl $36,-52(%ebp)   # DP 4 !
    je L51               # DP 1
    jmp L54              # DP 1
L58:cmpl $79,-52(%ebp)   # DP 4 !
    je L24               # DP 1
    jmp L54              # DP 1
    ...

The message is copied from 12[EBP] to EAX, then EAX is copied to -52[EBP] to perform 10 comparisons of -52[EBP] against several message numbers. It is quite stupid to compare message numbers against a memory location, if the message already was loaded into a register, but copying the content of one stack element to another to compare that copy against message numbers is the unbeaten top of stupidity. It seems, these performance brakes were not wasting enough superfluous clock cycles, so GCC 3.3.5. decided to spice the distributor with pointless conditional jumps to force a lot of 'misses' of the branch prediction logic. Each 'miss' is rewarded with at least ten penalty cycles...

Applying some logic, the message distributor could look like this:

    ...
    movl 0x0C(%ebp),%eax     # DP 3
    movl 0x08(%ebp),%edi     # DP 3
    movzwl 0x10(%ebp),%ecx   # DP 4
    cmpl $0x4F,%eax          # DP 1
    je L24                   # DP 1
    cmpl $0x23,%eax          # DP 1
    je L25                   # DP 1
    cmpl $0x07,%eax          # DP 1
    je L14                   # DP 1
    cmpl $0x24,%eax          # DP 1
    je L51                   # DP 1
    cmpl $0x20,%eax          # DP 1
    je L27                   # DP 1
    cmpl $0x29,%eax          # DP 1
    je L17                   # DP 1
    cmpl $0x01,%eax          # DP 1
    jne L54                  # DP 1
    ...                      # WM_CREATE

Code size is reduced to about one third and it takes less than a fourth of the time to execute it. Please notice, that preloading three registers in the second snippet is done in the same time it takes to execute the first line in the first snippet.

Because some hundred message numbers are defined, a jump table did occupy some KB, so we cannot avoid a distributor with conditional jumps. A disadvantage of conditional jumps are possible 'misses' of the branch prediction logic. As mentioned, a 'miss' causes at least ten penalty cycles (the processor has to flush the instruction cache and reload the proper instructions). We cannot avoid the one or other 'miss', so you should place the most often processed messages on top of the distributor. It doesn't matter if a WM_CREATE or WM_CLOSE procedure needs 20 clock cycles more or less, but the user recognises for sure if required updates of the main window (repainting) are delayed.

Optimisation 04

Another snippet taken from ST-Open's V700 skeleton, translated into assembly language by GCC 3.3.5.. Here, some control variables for the Loader are set to initial values before LDinit() is called. If the initialisation of the Loader fails, a message box with the error code is displayed. ST-Open's entire programming system depends on the loader. It manages memory allocation, loading of fields for the database engine and other files. Without running Loader, it is impossible to create a main window, because its position, size and control flags are stored in a file called SystemNumerics. Next, ST-Open's multilingual menu and dialog texts are stored in fields managed by ST-Open's database engine. No fields are available if the Loader is missing. Therefore, if the Loader is not initialised, the program is terminated with the mentioned message box, because it cannot access its data.

   LC3:.ascii "RC LDinit()\0" # CODE segment!

.globl _StInit
_StInit:
       pushl %ebp             # DP 3
       movl %esp,%ebp         # DP 1
       subl $8,%esp           # DP 1
       movl $1,_DEBUG         # DP 3
       movl $0,_DUMPLINE      # DP 3
       movl $0,_USE_LDF       # DP 3
       call _LDinit
       movl %eax,-4(%ebp)     # DP 3   why?
       cmpl $0,-4(%ebp)       # DP 4   why?
       je L56                 # DP 1
       subl $8,%esp           # DP 1   why?
       pushl $LC3             # DP 3
       pushl -4(%ebp)         # DP 3   why?
       call _debug
       addl $16,%esp          # DP 1   great!
       movl -4(%ebp),%eax     # DP 3   why?
       movl %eax,-8(%ebp)     # DP 3   why?
       jmp L55                # DP 1
   L56:movl $1,_OLH_MODE      # DP 3
       movl $0,-8(%ebp)       # DP 3   why?
   L55:movl -8(%ebp),%eax     # DP 3   why?
       leave                  # VP 3
       ret

Without error 24, else 35 clock cycles.

This version of GCC has a pathological tendency to waste as much time as possible. It might be a good therapy for stressed managers to relax, have a drink and smoke a box of cigarettes or two while the machine is counting from one to ten...

The smart version:

       .data                  # DATA segment!
   LC3:.ascii "RC LDinit()\0"
       .text                  # CODE segment!
.globl _StInit
_StInit:
       movl $0x00,_DUMPLINE   # DP 3
       movl $0x00,_USE_LDF    # DP 3
       movl $0x01,_DEBUG      # DP 3
       call _LDinit
       testl %eax,%eax        # DP 1
       je 0f                  # DP 1
       subl $0x08,%esp        # DP 1
       movl %eax,0x00(%esp)   # DP 3
       movl $LC3,0x04(%esp)   # DP 3
       call _debug
       addl $0x08,%esp        # DP 1
     0:movl $0x01,_OLH_MODE   # DP 3
       ret

Without error 8, else 12 clock cycles.

Only one third of the time is required to execute the smart version. Writing code like this reduces code size and executes much faster. Faster programs make users happy. Happy people are much better than stressed and angry people. Hence, Intelligent Design makes all
people very happy, because it is the fastest software design they ever have seen... ;)

Optimisation 05

This is the About... box of many recent ST-Open programs:

.globl DlgAbout
DlgAbout:
       pushl      %ebp              # DP 3
       movl       %esp, %ebp        # DP 1
       pushl      %ebx              # DP 3
       pushl      %eax              # DP 3 ?
       movl       12(%ebp), %eax    # DP 3
       cmpl       $32, %eax         # DP 1
       movl       8(%ebp), %ebx     # DP 3
       movl       16(%ebp), %edx    # DP 3
       movl       20(%ebp), %ecx    # DP 3
       je         L4
       cmpl       $59, %eax         # DP 1
       je         L13

These four instructions are redundant and violate the rule "Never write to stack elements above EBP!":

       movl       %ecx, 20(%ebp)    # DP 3
       movl       %edx, 16(%ebp)    # DP 3
       movl       %eax, 12(%ebp)    # DP 3
   L12:movl       %ebx, 8(%ebp)     # DP 3
       movl       -4(%ebp), %ebx    # DP 3
       leave                        # VP 3
       jmp        DefDP

       .p2align 2,,3
   L13:pushl      $155              # DP 3
       pushl      $151              # DP 3
       pushl      $150              # DP 3
       pushl      %ebx              # DP 3
       call       DLGtxt
       movl       %ebx, (%esp)      # DP 3
       call       CtrWn
   L11:addl       $16, %esp         # DP 1
       xorl       %eax, %eax        # DP 1
       movl       -4(%ebp), %ebx    # DP 3
       leave                        # VP 3
       ret

       .p2align 2,,3
    L4:cmpw       $4623, %dx        # DP 1
       je         L14

Here, GCC 3.3.5. does it again:

       movl       %ecx, 20(%ebp)    # DP 3
       movl       %edx, 16(%ebp)    # DP 3
       movl       $32, 12(%ebp)     # DP 3
       jmp        L12
   L14:subl       $12, %esp         # DP 1
       pushl      %ebx              # DP 3
       call       WinDD
       jmp        L11

This isn't only the worst code a compiler can generate. It also does very mean and dangerous things like overwriting passed parameters. Code like this might be usual in the evil world of virus or malware programmers, but definitely does not belong to the world of proper and sane software.

The smart version:

.globl _DlgAbout
_DlgAbout:
       movl 0x08(%esp),%eax         # DP 3
       cmpl $0x3B,%eax              # DP 1
       je 0f
       cmpl $0x32,%eax              # DP 1
       jne _DefDP
       cmpw $0x1218,0x0C(%esp)      # DP 4
       jne _DefDP
       pushl 0x04(%esp)             # DP 4
       call _WinDD
       addl $0x04,%esp              # DP 1
       ret
     0:pushl $0x9B                  # DP 3
       pushl $0x97                  # DP 3
       pushl $0x96                  # DP 3
       pushl 0x10(%esp)             # DP 4
       call _DLGtxt
       call _CtrWn
       addl $0x10,%esp              # DP 1
       ret

This is the smart version of DlgAbout(). It's not the fastest, but definitely the most compact form. The fall back to five PUSH instructions is one of those few exceptions, where PUSH is superior over MOV, because we can PUSH, but not MOV the content of a memory location to another memory location with a single instruction. If we want to do it with MOV, we have to use a register where the data is stored temporarily. We execute WM_INITDLG only once, and six clock cycles (3 nanoseconds) less or more cannot be recognised by the user.

Exceptions, of course, should not become the default behaviour. Most functions in a program are more complex than popping up a simple dialog with only one task: "Destroy yourself after the user pushed that OK button."

Optimisation 06

Finally, the optimised version of the function we discussed while talking about some Caveats of conventional programming techniques and possible Improvements. You have seen a small snippet of the optimised version while I introduced the Intelligent Design Rules to you. Now, here is the entire code, spiced with some comments why what has to be done exactly at this point and nowhere else.

        .data

        .p2align 4,0x00
    jt0:.long L02
        .long L03
        .long L04
        .long L05
        .long L06
        .long L07
        .long L08
        .long L09
        .long L10
        .long L11
        .long L12
        .long L13
        .long L14
        .long L15
        .long L08
        .long L09
        .long L10
        .long L11
        .long L12
        .long L13
        .long L14

Jump tables never should be placed anywhere in the code segment - they belong to the data segment! All 'dead' jump targets were removed after updating the resource IDs in the resource definition file, reducing the table size by 40 byte (about thirty percent).

        .text

        .align 2,0x90
.globl MoveDlg
MoveDlg:subl   $0x3C,%ebp
        nop
        nop
        movl   %ecx,0x30(%esp)
        movl   %edi,0x34(%esp)
        movl   %esi,0x38(%esp)
        movl   0x40(%ebp),%edi
        movl   0x44(%ebp),%eax
        movzwl 0x48(%ebp),%ecx
        movl   _BNR,%esi
        cmpl   $0x30,%eax
        je     L01
        cmpl   $0x20,%eax
        je     L00
        cmpl   $0x3B,%eax
        jne    L15

Our distributor was taken from the improved C version, the frequently required base address of the global variables (SystemNumerics) is preloaded into ESI. Preloading parameters speeds up execution of subfunctions, because we can access the contents of the corresponding registers without delay whenever they are needed. Placing our loads like listed above keeps all pipes busy. While the first three instructions read data from L1 cache, it is not very likely that the address of _BNR is loaded in L1 cache, yet. We can assume that the first three loads have a latency of three clock cycles, only the last load has to access main memory. Therefore, the message number surely is present in EAX before the first comparison is executed.

        movl   0x1C54(%esi),%eax
        movl   0x01D8(%esi),%ecx
        movl   $0x00,0x28E0(%esi)
        movl   %eax,0x00(%esp)
        movl   $0x0B,0x04(%esp)
        movl   $0x00,0x08(%esp)
        movl   $0x02,0x0C(%esp)
        movl   $0x00,0x10(%esp)
        movl   %edi,0x14(%esp)
        call   _FDacc
        movl   $0x04,0x08(%esp)
        movl   %ecx,0x14(%esp)
        call   _FDacc

Here, the order of instructions was rearranged to keep all execution pipes busy. In our case, there is only one critical instruction. The remaining Instructions don't cause dependencies, so we can place them anywhere in our preload sequence. Placing the line movl 0x1C54(%esi),%eax on top, there still are two other preloads between loading and using EAX. All mov instructions have a latency of three clock cycles, keeping a sufficient distance between load and store. The worst case scenario, 0x01D8[BNR] and 0x28E0[ESI] are present in L1 cache while 0x1C54[ESI] is not, delayed execution for 15 clock cycles, because the 4th instruction blocks one pipe for 24 (27 for line one - 3 for lines two and three) clock cycles, while the following instructions are executed in the two remaining pipes in nine clocks (our stack definitely is present in L1 cache whenever MoveDlg() is called). Before we call FDacc() the second time, only the two changing parameters are updated, saving four superfluous write operations.

        movl   %edi,0x00(%esp)
        movl   $0x1240,0x04(%esp)
        movl   $0x0120,0x08(%esp)
        movl   $0xFFFFFFFF,0x0C(%esp)
        call   _SnDIM
        movl   $0x1244,0x04(%esp)
        call   _SnDIM
        movl   $0x1246,0x04(%esp)
        call   _SnDIM
        movl   $0x124E,0x04(%esp)
        call   _SnDIM
        movl   $0xD2,0x04(%esp)
        movl   $0xD3,0x08(%esp)
        movl   $0xE9,0x0C(%esp)
        call   _DLGtxt
        call   _CtrWn
        call   _DlgShow
        jmp    L16

Performing only really required write operations, the improved version listed above works with just ten instructions - less than a third of GCC's draft with its 26 push instructions and six obligatory corrections of the stack pointer.

    L00:subl   $0x1231,%ecx
        je     0f
        decl   %ecx
        je     1f
        decl   %ecx
        jne    L15
        movl   $0x11,0x00(%esp)
        call   _Help
        jmp    L16
      0:movl   $0x00,0x28E0(%esi)
        jmp    2f
      1:orl    $0x00040000,0x28E0(%esi)
      2:movl   %edi,0x00(%esp)
        call   _WinDD
        jmp    L16

All compares with immediate values were replaced by a sub and two dec instructions. The one byte instruction dec reduces code size and relieves the processor's prefetch mechanism.

    L01:subl   $0x1240,%ecx
        js     L15
        cmpl   $0x14,%ecx
        ja     L15
        movl   0x28E0(%esi),%eax
        jmp    *jt0(, %ecx, 4)

Moving jump tables to the data segment is an obligatory exercise for all functions following Intelligent Design standards. Separating code and data avoids a lot of caveats coming along with mixing them together in the code segment. There's surely a reason why AMD and LETNi advise programmers against storing data of any kind in the code segment. By the way: The GNU assembler AS supports jump tables in the data segment. There's no reason to store them in the code segment as practised by GCC!

    L02:andl   $0xFFFFE1FF,%eax
        orl    $0x1000,%eax
        jmp    0f
    L03:andl   $0xFFFFE1FF,%eax
        orl    $0x0800,%eax
        jmp    0f
    L04:andl   $0xFFFFE1FF,%eax
        orl    $0x0400,%eax
        jmp    0f
    L05:andl   $0xFFFFE1FF,%eax
        orl    $0x0200,%eax
        jmp    0f
    L06:andl   $0xFFFFFE7F,%eax
        orl    $0x0100,%eax
        jmp    0f
    L07:andl   $0xFFFFFE7F,%eax
        orl    $0x80,%eax
        jmp    0f
    L08:andl   $0xFFFFFF80,%eax
        orl    $0x40,%eax
        jmp    0f
    L09:andl   $0xFFFFFE80,%eax
        orl    $0x20,%eax
        jmp    0f
    L10:andl   $0xFFFFFE80,%eax
        orl    $0x10,%eax
        jmp    0f
    L11:andl   $0xFFFFFE80,%eax
        orl    $0x08,%eax
        jmp    0f
    L12:andl   $0xFFFFFE80,%eax
        orl    $0x04,%eax
        jmp    0f
    L13:andl   $0xFFFFFE80,%eax
        orl    $0x02,%eax
        jmp    0f
    L14:andl   $0xFFFFFE80,%eax
        orl    $0x01,%eax
      0:movl   %eax,0x28E0(%esi)
        jmp    L16

Sorry, but the only solution for this problem were to change the definition of all flags to support automated evaluation. Obviously, this is not the case and we are not allowed to replace existing definitions without asking, first...

    L15:movl   0x30(%esp),%ecx
        movl   0x34(%esp),%edi
        movl   0x38(%esp),%esi
        addl   $0x3C,%esp
        jmp    _DefDP

One version of this exit is enough. GCC's draft, once split up into two absolutely identical subfuntions, were reduced to really necessary code, keeping five (27.78 percent) of eighteen instructions. Even if we removed entire thirteen (redundant) instructions, this subfunction still works fine.

    L16:movl   0x30(%esp),%ecx
        movl   0x34(%esp),%edi
        movl   0x38(%esp),%esi
        addl   $0x3C,%esp
        xorl   %eax,%eax
        ret

.comm _GVAR,4

This is at least as fast as three pop and one leave instruction!

Tuesday, April 13, 2010

13 - Appendix 2

No comments:

Post a Comment