Wednesday, April 14, 2010

14 - AT&T Syntax

ST-Open software is developed with GCC/2 (1994) and the GNU assembler AS. AS is a sophisticated assembler, so nothing is ASSUMEd and no hints like SEGMENT, BYTE PTR and compagnions are required. This saves a lot of typing work and the readability of source files markably grows. A simple .data on top of user data belonging to the DATA segment and a simple .text on top of the code going to the CODE segment is all AS needs to know. On the other hand, AS wants to be fed with code written in AT&T syntax.


Register Set

All register names are written in small letters and a percent sign preceeds the register name as a delimiter. The lists below enumerate the entire register set available for LETNiums and AMD's Athlon64:

 8 bit Registers      16 bit Registers
 LETNi     AT+T       LETNi     AT+T

 AL        %al        AX        %ax
 BL        %bl        BX        %bx
 CL        %cl        CX        %cx
 DL        %dl        DX        %dx
 DIL       %dil       DI        %di
 SIL       %sil       SI        %si
 BPL       %bpl       BP        %bp
 SPL       %spl       SP        %sp
 R8B       %r8b       R8W       %r8w
 R9B       %r9b       R9W       %r9w
 R10B      %r10b      R10W      %r10w
 R11B      %r11b      R11W      %r11w
 R11B      %r12b      R12W      %r12w
 R11B      %r13b      R13W      %r13w
 R11B      %r14b      R14W      %r14w
 R11B      %r15b      R15W      %r15w

32 bit Registers      64 bit Registers
LETNi     AT+T        LETNi     AT+T

EAX       %eax        RAX       %rax
EBX       %ebx        RBX       %rbx
ECX       %ecx        RCX       %rcx
EDX       %edx        RDX       %rdx
EDI       %edi        RDI       %rdi
ESI       %esi        RSI       %rsi
EBP       %ebp        RBP       %rbp
ESP       %esp        RSP       %rsp
R8D       %r8d        R8        %r8
R9D       %r9d        R9        %r9
R10D      %r10d       R10       %r10
R11D      %r11d       R11       %r11
R12D      %r12d       R12       %r12
R13D      %r13d       R13       %r13
R14D      %r14d       R14       %r14
R15D      %r15d       R15       %r15

FP / MMX              SSE / 3Dnow!
LETNi     AT+T        LETNi     AT+T

ST0       %st(0)      XMM0      %xmm0
ST1       %st(1)      XMM1      %xmm1
ST2       %st(2)      XMM2      %xmm2
ST3       %st(3)      XMM3      %xmm3
ST4       %st(4)      XMM4      %xmm4
ST5       %st(5)      XMM5      %xmm5
ST6       %st(6)      XMM6      %xmm6
ST7       %st(7)      XMM7      %xmm7
MM0       %mm0        XMM8      %xmm8
MM1       %mm1        XMM9      %xmm9
MM2       %mm2        XMM10     %xmm10
MM3       %mm3        XMM11     %xmm11
MM4       %mm4        XMM12     %xmm12
MM5       %mm5        XMM13     %xmm13
MM6       %mm6        XMM14     %xmm14
MM7       %mm7        XMM15     %xmm15

 Special              Debug
 LETNi    AT+T        LETNi     AT+T

 CS       %cs         DB0       %db0
 DS       %ds         DB1       %db1
 DS       %ds         DB2       %db2
 ES       %es         DB3       %db3
 FS       %fs         -          -
 GS       %gs         -          -
 SS       %ss         DB6       %db6
                      DB7       %db7
CR0       %cr0        DB8       %db8
CR1       %cr1        DB9       %db9
CR2       %cr2        DB10      %db10
                      DB11      %db11
TR6       %tr6        DB12      %db12
TR7       %tr7        DB13      %db13
                      DB14      %db14
                      DB15      %db15


Appendices

Data sizes of instructions with operands are specified by "b" (byte), "w" (word), "d" (MMX or XMM for doubleword) or "l" (integer for doubleword) and 'q' (quadword). They replace the hints "byte ptr", "word ptr", "dword ptr" and "qword ptr" used in LETNi syntax:

movb $0x01,%al          # load byte        01 into AL
movw $0x01,%ax          # load word      0001 into AX
movl $0x01,%eax         # load dword 00000001 into EAX

...but:

movsbl $0x81,%eax
(load sign extended byte 81 into EAX, so EAX holds FFFFFF81, now)

movzb $0x81,%eax
(load zero extended byte 81 into EAX, so EAX holds 00000081, now)


Numbers And Addresses

Numbers are preceeded by a Dolar sign '$', addresses are written as plain numbers:

movl $0x01,%eax
(copy 00000001 to EAX)

movl 0x01,%eax
(copy the doubleword found at address 00000001 to EAX; this causes some penalty cycles for accessing an address not divisible by four, then crashes because we try to access protected memory)


Indirect Addressing

The register holding the address is put into round brackets. The offset, in LETNi vocabulary it is called "displacement", is written in front of the leading bracket:

movw 0x04(%esi),%ax
(copy the word found at address [ESI + 0x04] to AX)


Indexed Adressing

The index register follows the register holding the address. The multiplicator, LETNi vocabulary uses the term "scale factor", follows the index register. All three are separated by commata:

movb 0x00(%esi, %edx, 1),%al
(copy the byte at memory location [ESI + 0x04 + (EDX * 1)] to AL)

movl 0x00(, %edx, 4),%eax
(copy the doubleword at memory location [0x00 + (EDX * 4)] to EAX)


Global Variables And Functions

To make a function globally visible, we have to add a .globl declaration in front of the function declaration. To make variables globally visible, we add a .comm in each source file where this variable is required. All global functions and variables must be preceeded by an underscore "_".

To access adresses of functions or variables, their name must be preceeded by a Dollar sign "$". To access the content of a variable (read, write, increment, decrement, compare against, etc.), we write their name "as is":

.align 2,0x90
(only in front of your functions!)

.globl _MyFunction
(make MyFunction globally visible)

_MyFunction:   # declaration
     ...       # function body
     ret       # finished, return to caller

.comm _BNR,4
(reserves 4 byte in the data segment for the global variable _BNR)

movl _BNR,%eax
(copy the content of _BNR to EAX)

movl $_BNR,%eax
(copy the address where _BNR is stored to EAX)

movl $_AllMine,%eax
(copy the address where function _AllMine starts to EAX)

call *%eax
(execute _AllMine)

The instruction call *%eax is equivalent with call _AllMine. However, it wastes one clock cycle with loading the address of _AllMine into EAX. On the other hand, loading a return address into a register can save six clock cycles if we use simple JMP instructions instead of the CALL/RET mechanism. This, of course, is limited to a few local helper functions - the usual CALL/RET is more flexible, because we don't need to know where the called function is stored.


Calls And Jumps

Calls and jumps either can use (global) labels or registers as operands. If a register is used, its name must be preceeded by an asterisk *. While the previous example showed us how to use a register together wit a CALL instruction, the following example shows us how to create a jump table.

    .data
    .align 2,0x00
L99:.long L00            # jump table
    .long L01
    .long L02
    .long L03
    .long L04

    .text
    ...                  # prologue
    movl $0x04,%ebx
    cmpl $0x04,%eax
    cmova %ebx,%eax      # keep valid
    jmp *L99(, %eax, 4)  # indexed jump
L00:nop                  # target
proc
    jmp L05
L01:nop
    jmp L05
L02:nop
    jmp L05
L03:nop
    jmp L05
L04:nop
    jmp L05
L05:nop                  # epilogue
    ...
    ret

This is a C switch{} statement coded in assembler. Using cmova, we save one conditional jump and avoid the ten penalty cycles for a false "guess" of the branch prediction logic.

Please notice, that I put the jump table into the .data, not the .text segment. As LETNi and AMD clearly state - this is the place data belongs to. Unfortunately, GCC creates all jump tables in the .text segment. To optimise your code, you should move them to the top of the file and put them into the .data segment as shown above.

Keep in mind that 32 bit jump tables must be aligned to a multiple of 4, while 64 bit jump tables must be aligned to a multiple of 8. This is done by putting an appropriate .align statement in front of the (first) jump table.


.align

GCC spices source files with tons of .align statements spread all over the text segment. If an .align preceeds a function, you should leave it alone - do not remove it! Because modern processors work with quite small instruction caches (32 byte on Athlon64 machines), it might be necessary to insert an .align 4,,15 in front of a branch target to support the processor's prefetch mechanisms. However, you should avoid to insert .align statements at places where they might be executed. Each .align statement inserts an appropriate number of nops to move the instruction pointer to the next multiple of the cacheline's size, so the next instruction "sits" at the beginning of a new cache line. This is important, if the next instruction is the target of a branch. Because the processor speculatively prefetches  the code of branch targets, execution continues at the beginning of a cache line if the branch was taken. Execution is sped up if the processor doesn't have to load the instructions of the branch target before it can continue to execute them.


nop

The nop instruction puts the next free execution pipeline into idle mode for one clock cycle. If you insert it at the proper places, it can improve performance and speed up execution. However, the benefits only can be determined experimentally. You have to test the runtime of
 several variants of your code with exceptional care. The rdtsc instruction is a good tool to measure the runtime of test functions with acceptable accuracy. If you write the output to a file, the gathered data might be sufficient to find out which variant is the fastest.




Tuesday, April 13, 2010

13 - Appendix 2

The second appendix provides a collection of recommendations, how existing code can be optimised. This collection will be extended whenever I stumble upon another piece of badly designed code...


Optimisation 01

To access compound data stored in doublewords, GCC uses one of these constructs by default:


Lower Word

Version 1:

...
movl %edx,%eax     # DP 1
andl $65535,%eax   # DP 1 + wait for EAX
cmpl $4623,%eax    # DP 1 + wait for EAX
...

Version 2:

...
movl %edx,%eax     # DP 1
cmpw $4623,%ax     # DP 1 + wait for EAX
...                #      + wait for AX

The real task is the comparison of the lower word in a compound datatype - in our case a message parameter - against a given number. The first version copies the message parameter stored in EDX to EAX, clears the upper word via AND and finally compares the result against the ID. It takes three clock cycles to execute this construct, because the processor has to wait for the updated content of EAX after each of both instructions. The second version copies the content of EDX to EAX, then compares AX against the ID. That's worse than the first version. Loading the 32 bit register EAX and accessing the lower portion AX in the next instruction is rewarded with some extra clock cycles, because some internal processing known as register merging is required to make AX accessible for the comparison.

Actually, EDX already contains the entire message parameter, so there is no reason to copy it to anywhere else. We also can assume, the message parameter was loaded into EDX several clock cycles before, so no register merging is required. Therefore, the straight way

...
cmpw $0x120F,%dx   # DP 1
...

is the only proper solution, executed in one clock cycle, saving 9 (2) byte.


Upper Word

...
movl %esi,%eax     # DP 1
shrl $16,%eax      # DP 1 + auf EAX warten
andl $65535,%eax   # DP 1 + auf EAX warten
cmpl $523,%eax     # DP 1 + auf EAX warten
...

The message parameter stored in ESI is copied to EAX, then EAX is shifted right 16 bits. I do not know the reason why the upper word is cleared a second time, but I do know this: It is a superfluous operation. Finally, the separated upper word is compared against an immediate value, in our case a notification ID sent by one of the controls.

Extracting the upper word from a doubleword is more complex, because there are no instructions to access it directly. For the following examples, it is assumed the message parameter is stored at location 0x8C[ESP]. The first possible solution is to preload the upper word like this:

...
movzwl 0x8E(%esp),%edx  # DP 4
...
...                     # 4 clocks distance
...
...
cmpl $0x010B,%edx       # DP 1
...

It is recommended to place the MOVZWL at top of your function, after all registers are saved. If you cannot keep the minimum distance of four clock cycles, a better solution might be

...
cmpw $0x010B,0x8E(%esp)     # DP 4
...

This one costs four clock cycles and should not be used if a lot of notifications are evaluated. If more than two comparisons are required, preloading saves a lot of clock cycles, even if it takes four clock cycles to preload that register and access its content with the next instruction. All following comparisons are done in one clock cycle, so we save time rather than to spend it.


Optimisation 02

GCC has some bad manners, like the one shown in Conventions:

Wrong:

       movl %eax,%esi   # DP 1
       testl %esi,%esi  # DP 1
       je L8            # DP 1

Okay:

       movl %eax,%esi   # DP 1
       testl %eax,%eax  # DP 1
       je L8            # DP 1

Even if both snippets look like monozygotic twins, the first one is one clock cycle slower. Both versions copy the content of EAX to ESI, because we need it after calling some other functions. The decisive difference between both versions is the register we use to test if its content is equal to zero. While the first version uses ESI and thus has to wait, until EAX was copied to ESI, the second version uses EAX for testing. Without dependencies, the first two instructions of the second version are executed simultaneously, saving one clock cycle. If this code snippet sits within a loop running 32,768 times, we save 32,768 clock cycles with a simple optimisation.



Optimisation 03

In most cases, the direct path is the shortest way to move from A to B. Here is a snippet taken from the message procedure of ST-Open's V700 skeleton, translated into assembly language by GCC 3.3.5.:

    ...
    movl 12(%ebp),%eax   # DP 3
    movl %eax,-52(%ebp)  # DP 3 wait
    cmpl $35,-52(%ebp)   # DP 4 wait
    je L25               # DP 1
    cmpl $35,-52(%ebp)   # DP 4 !
    ja L56               # DP 1
    cmpl $7,-52(%ebp)    # DP 4 !
    je L14               # DP 1
    cmpl $7,-52(%ebp)    # DP 4 !
    ja L57               # DP 1
    cmpl $1,-52(%ebp)    # DP 4 !
    je L15               # DP 1
    jmp L54              # DP 1
L57:cmpl $32,-52(%ebp)   # DP 4 !
    je L27               # DP 1
    jmp L54              # DP 1
L56:cmpl $41,-52(%ebp)   # DP 4 !
    je L17               # DP 1
    cmpl $41,-52(%ebp)   # DP 4 !
    ja L58               # DP 1
    cmpl $36,-52(%ebp)   # DP 4 !
    je L51               # DP 1
    jmp L54              # DP 1
L58:cmpl $79,-52(%ebp)   # DP 4 !
    je L24               # DP 1
    jmp L54              # DP 1
    ...

The message is copied from 12[EBP] to EAX, then EAX is copied to -52[EBP] to perform 10 comparisons of -52[EBP] against several message numbers. It is quite stupid to compare message numbers against a memory location, if the message already was loaded into a register, but copying the content of one stack element to another to compare that copy against message numbers is the unbeaten top of stupidity. It seems, these performance brakes were not wasting enough superfluous clock cycles, so GCC 3.3.5. decided to spice the distributor with pointless conditional jumps to force a lot of 'misses' of the branch prediction logic. Each 'miss' is rewarded with at least ten penalty cycles...

Applying some logic, the message distributor could look like this:

    ...
    movl 0x0C(%ebp),%eax     # DP 3
    movl 0x08(%ebp),%edi     # DP 3
    movzwl 0x10(%ebp),%ecx   # DP 4
    cmpl $0x4F,%eax          # DP 1
    je L24                   # DP 1
    cmpl $0x23,%eax          # DP 1
    je L25                   # DP 1
    cmpl $0x07,%eax          # DP 1
    je L14                   # DP 1
    cmpl $0x24,%eax          # DP 1
    je L51                   # DP 1
    cmpl $0x20,%eax          # DP 1
    je L27                   # DP 1
    cmpl $0x29,%eax          # DP 1
    je L17                   # DP 1
    cmpl $0x01,%eax          # DP 1
    jne L54                  # DP 1
    ...                      # WM_CREATE

Code size is reduced to about one third and it takes less than a fourth of the time to execute it. Please notice, that preloading three registers in the second snippet is done in the same time it takes to execute the first line in the first snippet.

Because some hundred message numbers are defined, a jump table did occupy some KB, so we cannot avoid a distributor with conditional jumps. A disadvantage of conditional jumps are possible 'misses' of the branch prediction logic. As mentioned, a 'miss' causes at least ten penalty cycles (the processor has to flush the instruction cache and reload the proper instructions). We cannot avoid the one or other 'miss', so you should place the most often processed messages on top of the distributor. It doesn't matter if a WM_CREATE or WM_CLOSE procedure needs 20 clock cycles more or less, but the user recognises for sure if required updates of the main window (repainting) are delayed.


Optimisation 04

Another snippet taken from ST-Open's V700 skeleton, translated into assembly language by GCC 3.3.5.. Here, some control variables for the Loader are set to initial values before LDinit() is called. If the initialisation of the Loader fails, a message box with the error code is displayed.  ST-Open's entire programming system depends on the loader. It manages memory allocation, loading of fields for the database engine and other files. Without running Loader, it is impossible to create a main window, because its position, size and control flags are stored in a file called SystemNumerics. Next, ST-Open's multilingual menu and dialog texts are stored in fields managed by ST-Open's database engine. No fields are available if the Loader is missing. Therefore, if the Loader is not initialised, the program is terminated with the mentioned message box, because it cannot access its data.

   LC3:.ascii "RC LDinit()\0" # CODE segment!

.globl _StInit
_StInit:
       pushl %ebp             # DP 3
       movl %esp,%ebp         # DP 1
       subl $8,%esp           # DP 1
       movl $1,_DEBUG         # DP 3
       movl $0,_DUMPLINE      # DP 3
       movl $0,_USE_LDF       # DP 3
       call _LDinit
       movl %eax,-4(%ebp)     # DP 3   why?
       cmpl $0,-4(%ebp)       # DP 4   why?
       je L56                 # DP 1
       subl $8,%esp           # DP 1   why?
       pushl $LC3             # DP 3
       pushl -4(%ebp)         # DP 3   why?
       call _debug
       addl $16,%esp          # DP 1   great!
       movl -4(%ebp),%eax     # DP 3   why?
       movl %eax,-8(%ebp)     # DP 3   why?
       jmp L55                # DP 1
   L56:movl $1,_OLH_MODE      # DP 3
       movl $0,-8(%ebp)       # DP 3   why?
   L55:movl -8(%ebp),%eax     # DP 3   why?
       leave                  # VP 3
       ret

Without error 24, else 35 clock cycles.

This version of GCC has a pathological tendency to waste as much time as possible. It might be a good therapy for stressed managers to relax, have a drink and smoke a box of cigarettes or two while the machine is counting from one to ten...

The smart version:

       .data                  # DATA segment!
   LC3:.ascii "RC LDinit()\0"
       .text                  # CODE segment!
.globl _StInit
_StInit:
       movl $0x00,_DUMPLINE   # DP 3
       movl $0x00,_USE_LDF    # DP 3
       movl $0x01,_DEBUG      # DP 3
       call _LDinit
       testl %eax,%eax        # DP 1
       je 0f                  # DP 1
       subl $0x08,%esp        # DP 1
       movl %eax,0x00(%esp)   # DP 3
       movl $LC3,0x04(%esp)   # DP 3
       call _debug
       addl $0x08,%esp        # DP 1
     0:movl $0x01,_OLH_MODE   # DP 3
       ret

Without error 8, else 12 clock cycles.

Only one third of the time is required to execute the smart version. Writing code like this reduces code size and executes much faster. Faster programs make users happy. Happy people are much better than stressed and angry people. Hence, Intelligent Design makes all
 people very happy, because it is the fastest software design they ever have seen... ;)


Optimisation 05

This is the About... box of many recent ST-Open programs:

.globl DlgAbout
DlgAbout:
       pushl      %ebp              # DP 3
       movl       %esp, %ebp        # DP 1
       pushl      %ebx              # DP 3
       pushl      %eax              # DP 3 ?
       movl       12(%ebp), %eax    # DP 3
       cmpl       $32, %eax         # DP 1
       movl       8(%ebp), %ebx     # DP 3
       movl       16(%ebp), %edx    # DP 3
       movl       20(%ebp), %ecx    # DP 3
       je         L4
       cmpl       $59, %eax         # DP 1
       je         L13

These four instructions are redundant
and violate the rule "Never write to stack elements above EBP!":

       movl       %ecx, 20(%ebp)    # DP 3
       movl       %edx, 16(%ebp)    # DP 3
       movl       %eax, 12(%ebp)    # DP 3
   L12:movl       %ebx, 8(%ebp)     # DP 3
       movl       -4(%ebp), %ebx    # DP 3
       leave                        # VP 3
       jmp        DefDP

       .p2align 2,,3
   L13:pushl      $155              # DP 3
       pushl      $151              # DP 3
       pushl      $150              # DP 3
       pushl      %ebx              # DP 3
       call       DLGtxt
       movl       %ebx, (%esp)      # DP 3
       call       CtrWn
   L11:addl       $16, %esp         # DP 1
       xorl       %eax, %eax        # DP 1
       movl       -4(%ebp), %ebx    # DP 3
       leave                        # VP 3
       ret

       .p2align 2,,3
    L4:cmpw       $4623, %dx        # DP 1
       je         L14

Here, GCC 3.3.5. does it again:

       movl       %ecx, 20(%ebp)    # DP 3
       movl       %edx, 16(%ebp)    # DP 3
       movl       $32, 12(%ebp)     # DP 3
       jmp        L12
   L14:subl       $12, %esp         # DP 1
       pushl      %ebx              # DP 3
       call       WinDD
       jmp        L11

This isn't only the worst code a compiler can generate. It also does very mean and dangerous things like overwriting passed parameters. Code like this might be usual in the evil world of virus or malware programmers, but definitely does not belong to the world of proper and sane software.

The smart version:

.globl _DlgAbout
_DlgAbout:
       movl 0x08(%esp),%eax         # DP 3
       cmpl $0x3B,%eax              # DP 1
       je 0f
       cmpl $0x32,%eax              # DP 1
       jne _DefDP
       cmpw $0x1218,0x0C(%esp)      # DP 4
       jne _DefDP
       pushl 0x04(%esp)             # DP 4
       call _WinDD
       addl $0x04,%esp              # DP 1
       ret
     0:pushl $0x9B                  # DP 3
       pushl $0x97                  # DP 3
       pushl $0x96                  # DP 3
       pushl 0x10(%esp)             # DP 4
       call _DLGtxt
       call _CtrWn
       addl $0x10,%esp              # DP 1
       ret

This is the smart version of DlgAbout(). It's not the fastest, but definitely the most compact form. The fall back to five PUSH instructions is one of those few exceptions, where PUSH is superior over MOV, because we can PUSH, but not MOV the content of a memory location to another memory location with a single instruction. If we want to do it with MOV, we have to use a register where the data is stored temporarily. We execute WM_INITDLG only once, and six clock cycles (3 nanoseconds) less or more cannot be recognised by the user.

Exceptions, of course, should not become the default behaviour. Most functions in a program are more complex than popping up a simple dialog with only one task: "Destroy yourself after the user pushed that OK button."


Optimisation 06

Finally, the optimised version of the function we discussed while talking about some Caveats of conventional programming techniques and possible Improvements. You have seen a small snippet of the optimised version while I introduced the Intelligent Design Rules to you. Now, here is the entire code, spiced with some comments why what has to be done exactly at this point and nowhere else.

        .data

        .p2align 4,0x00
    jt0:.long  L02
        .long  L03
        .long  L04
        .long  L05
        .long  L06
        .long  L07
        .long  L08
        .long  L09
        .long  L10
        .long  L11
        .long  L12
        .long  L13
        .long  L14
        .long  L15
        .long  L08
        .long  L09
        .long  L10
        .long  L11
        .long  L12
        .long  L13
        .long  L14

Jump tables never should be placed anywhere in the code segment - they belong to the data segment! All 'dead' jump targets were removed after updating the resource IDs in the resource definition file, reducing the table size by 40 byte (about thirty percent).

        .text

        .align 2,0x90
.globl MoveDlg
MoveDlg:subl   $0x3C,%ebp
        nop
        nop
        movl   %ecx,0x30(%esp)
        movl   %edi,0x34(%esp)
        movl   %esi,0x38(%esp)
        movl   0x40(%ebp),%edi
        movl   0x44(%ebp),%eax
        movzwl 0x48(%ebp),%ecx
        movl   _BNR,%esi
        cmpl   $0x30,%eax
        je     L01
        cmpl   $0x20,%eax
        je     L00
        cmpl   $0x3B,%eax
        jne    L15

Our distributor was taken from the improved C version, the frequently required base address of the global variables (SystemNumerics) is preloaded into ESI. Preloading parameters speeds up execution of subfunctions, because we can access the contents of the corresponding registers without delay whenever they are needed. Placing our loads like listed above keeps all pipes busy. While the first three instructions read data from L1 cache, it is not very likely that the address of _BNR is loaded in L1 cache, yet. We can assume that the first three loads have a latency of three clock cycles, only the last load has to access main memory. Therefore, the message number surely is present in EAX before the first comparison is executed.

        movl   0x1C54(%esi),%eax
        movl   0x01D8(%esi),%ecx
        movl   $0x00,0x28E0(%esi)
        movl   %eax,0x00(%esp)
        movl   $0x0B,0x04(%esp)
        movl   $0x00,0x08(%esp)
        movl   $0x02,0x0C(%esp)
        movl   $0x00,0x10(%esp)
        movl   %edi,0x14(%esp)
        call   _FDacc
        movl   $0x04,0x08(%esp)
        movl   %ecx,0x14(%esp)
        call   _FDacc

Here, the order of instructions was rearranged to keep all execution pipes busy. In our case, there is only one critical instruction. The remaining Instructions don't cause dependencies, so we can place them anywhere in our preload sequence. Placing the line movl 0x1C54(%esi),%eax on top, there still are two other preloads between loading and using EAX. All mov instructions have a latency of three clock cycles, keeping a sufficient distance between load and store. The worst case scenario, 0x01D8[BNR] and 0x28E0[ESI] are present in L1 cache while 0x1C54[ESI] is not, delayed execution for 15 clock cycles, because the 4th instruction blocks one pipe for 24 (27 for line one - 3 for lines two and three) clock cycles, while the following instructions are executed in the two remaining pipes in nine clocks (our stack definitely is present in L1 cache whenever MoveDlg() is called). Before we call FDacc() the second time, only the two changing parameters are updated, saving four superfluous write operations.

        movl   %edi,0x00(%esp)
        movl   $0x1240,0x04(%esp)
        movl   $0x0120,0x08(%esp)
        movl   $0xFFFFFFFF,0x0C(%esp)
        call   _SnDIM
        movl   $0x1244,0x04(%esp)
        call   _SnDIM
        movl   $0x1246,0x04(%esp)
        call   _SnDIM
        movl   $0x124E,0x04(%esp)
        call   _SnDIM
        movl   $0xD2,0x04(%esp)
        movl   $0xD3,0x08(%esp)
        movl   $0xE9,0x0C(%esp)
        call   _DLGtxt
        call   _CtrWn
        call   _DlgShow
        jmp    L16

Performing only really required write operations, the improved version listed above works with just ten instructions - less than a third of GCC's draft with its 26 push instructions and six obligatory corrections of the stack pointer.

    L00:subl   $0x1231,%ecx
        je     0f
        decl   %ecx
        je     1f
        decl   %ecx
        jne    L15
        movl   $0x11,0x00(%esp)
        call   _Help
        jmp    L16
      0:movl   $0x00,0x28E0(%esi)
        jmp    2f
      1:orl    $0x00040000,0x28E0(%esi)
      2:movl   %edi,0x00(%esp)
        call   _WinDD
        jmp    L16

All compares with immediate values were replaced by a sub and two dec instructions. The one byte instruction dec reduces code size and relieves the processor's prefetch mechanism.

    L01:subl   $0x1240,%ecx
        js     L15
        cmpl   $0x14,%ecx
        ja     L15
        movl   0x28E0(%esi),%eax
        jmp    *jt0(, %ecx, 4)

Moving jump tables to the data segment is an obligatory exercise for all functions following Intelligent Design standards. Separating code and data avoids a lot of caveats coming along with mixing them together in the code segment. There's surely a reason why AMD and LETNi advise programmers against storing data of any kind in the code segment. By the way: The GNU assembler AS supports jump tables in the data segment. There's no reason to store them in the code segment as practised by GCC!

    L02:andl   $0xFFFFE1FF,%eax
        orl    $0x1000,%eax
        jmp    0f
    L03:andl   $0xFFFFE1FF,%eax
        orl    $0x0800,%eax
        jmp    0f
    L04:andl   $0xFFFFE1FF,%eax
        orl    $0x0400,%eax
        jmp    0f
    L05:andl   $0xFFFFE1FF,%eax
        orl    $0x0200,%eax
        jmp    0f
    L06:andl   $0xFFFFFE7F,%eax
        orl    $0x0100,%eax
        jmp    0f
    L07:andl   $0xFFFFFE7F,%eax
        orl    $0x80,%eax
        jmp    0f
    L08:andl   $0xFFFFFF80,%eax
        orl    $0x40,%eax
        jmp    0f
    L09:andl   $0xFFFFFE80,%eax
        orl    $0x20,%eax
        jmp    0f
    L10:andl   $0xFFFFFE80,%eax
        orl    $0x10,%eax
        jmp    0f
    L11:andl   $0xFFFFFE80,%eax
        orl    $0x08,%eax
        jmp    0f
    L12:andl   $0xFFFFFE80,%eax
        orl    $0x04,%eax
        jmp    0f
    L13:andl   $0xFFFFFE80,%eax
        orl    $0x02,%eax
        jmp    0f
    L14:andl   $0xFFFFFE80,%eax
        orl    $0x01,%eax
      0:movl   %eax,0x28E0(%esi)
        jmp    L16

Sorry, but the only solution for this problem were to change the definition of all flags to support automated evaluation. Obviously, this is not the case and we are not allowed to replace existing definitions without asking, first...

    L15:movl   0x30(%esp),%ecx
        movl   0x34(%esp),%edi
        movl   0x38(%esp),%esi
        addl   $0x3C,%esp
        jmp    _DefDP

One version of this exit is enough. GCC's draft, once split up into two absolutely identical subfuntions, were reduced to really necessary code, keeping five (27.78 percent) of eighteen instructions. Even if we removed entire thirteen (redundant) instructions, this subfunction still works fine.

    L16:movl   0x30(%esp),%ecx
        movl   0x34(%esp),%edi
        movl   0x38(%esp),%esi
        addl   $0x3C,%esp
        xorl   %eax,%eax
        ret

.comm _GVAR,4

This is at least as fast as three pop and one leave instruction!




12 - Appendix 1

The first appendix supplies you with a collection of examples, showing how to design new functions from scratch. Some of them might be used as templates - just C&P them to your source files. The remaining functions may be used as a hint how to solve some specific poblems. Practice is the best teacher you can get. Read the examples, then start coding your own stuff. While debugging it, you will learn much more than a teacher ever could show you. Without errors, you never get a deeper insight how something really works.


Example 01

No registers, local variables or parameters are required. All three areas therefore have a size of zero and the stackpointer stays untouched:

.globl _Brd
  _Brd:movl 0x04(%esp),%eax   # block address
       addl 0x08(%esp),%eax   # + offset
       movzb 0x00(%eax),%eax  # Byte[block+offset]
       ret

Brd() is was one of the few remaining crutches for C programmers provided by ST-Open's libraries. Even if the function is a convincing example for really bad code, this construct still is ways faster than any equivalent generated by a C compiler.

Assembly language programmers do not need external functions to access data from an offset to a base address, of course. Loading the block address at least three clock cycles before we start to access data relative to it is all we have to do:

       ...
       movl 0x04(%esp),%ecx    # block address
       ...
       ...                     # 3 clocks distance!
       ...
       movzb 0x1234(%ecx),%eax # DB @ 0x1234[block]
       ...

Code like this prevents the other two execution pipes from taking a nap while the first pipe is busy with copying the block address to ECX.

Side Note: 'Three clock cycles away' does not mean 'three instructions away'! Not all instructions have a latency of three clock cycles, so it might be necessary to fill the dotted lines with much more than just two instructions. For example, direct manipulations of registers without memory operands - like xorl %eax,%eax - generally are executed within one clock cycle. We had to insert six of them to feed the other two pipes for the three clock cycles the first pipe is busy with loading the block address into ECX. You have to calculate the latencies of all used instructions and keep them in the proper order to prevent interruption of simultaneouos execution in all three pipes.


Example 02

Here we store two registers and pass for parameters to the API. The size of our stack frame therefore is 8 + 16 = 24 byte:

.globl _WinPP
_WinPP:subl $0x3C,%esp
       nop
       nop
       movdqu 0x40(%esp),%xmm0
       movl %edx,0x10(%esp)
       movl %ecx,0x14(%esp)
       movdqu %xmm0,0x00(%esp)
       call _WinSetPresParam
       movl 0x10(%esp),%edx
       movl 0x14(%esp),%ecx
       addl $0x3C,%esp
       ret

WinnPP() is one of many 'sandboxes', only called to save ECX und EDX and restore them after the API destroyed them. To speed up execution and save six MOV instructions with memory references, two MOVDQU instructions are used. XMM registers can hold four doublewords, so all four parameters can be copied in one gulp.


Example 03

Finally, we store one register, use a 32 byte string and call two functions - the first has three, the second one parameter to pass. Hence, the size of our stack frame is 4 + 32 + 12 = 48 byte. We add a 16 byte safety gap and subtract 64 from ESP:

.globl _GetSz
_GetSz:movl 0x04(%esp),%eax
       subl $0x3C,%esp
       nop
       movl %ebx,0x38(%esp)
       leal 0x0C(%esp),%ebx
       movl %eax,0x00(%esp)
       movl $0x1234,0x04(%esp)
       movl %ebx,0x08(%esp)
       call _QEf
       movl %ebx,0x00(%esp)
       call _SLen
       movl 0x38(%esp),%ebx
       addl $0x3C,%esp
       ret

GetSz(h) is called with the dialogs window handle as only parameter. The content of the entryfield specified by the dialog's window handle and a fixed resource ID (0x1234) is queried via the sandbox QEf(). Our temporary string buffer occupies the area 0x0C[ESP] through 0x3B[ESP]. Entryfields are limited to 32 byte by default (OS/2), so the buffer is large enough to prevent QEf() from overwriting other data. To determine the size of the returned string, SLen(), a function provided by ST-Open's main library, is called. Finally, the string size returned in EAX is passed back to the caller.

We need the window handle for the first call, only, so it is copied to EAX before the stack frame is created. This saves extra code and clock cycles to save and restore an additional register. SLen() is a standard function provided by ST-Open's main library. It could be replaced by this alternative:

       ...
       call _QEf
       xorl %eax,%eax
     0:cmpl $0x00,0x00(%ebx)
       je 1f
       incl %ebx
       incl %eax
       jmp 0b
     1:movl 0x38(%esp),%ebx
       addl $0x3C,%esp
       ret

Replacing SLen() with equivalent code saves 8 clock cycles for the CALL/RET sequence and preloading registers in the called function, again. To reduce overhead, you might consider to replace tasks of less complex functions like SLen() with own code to save external calls.


Example 04

This example shows how local variables can be aligned to 16 byte boundaries, as required for XMM instructions like MOVDQA and friends. Actually, it is impossible to align ESP directly, so we have to sacrifice a general purpose register. Because we do not know, to which multiple of four ESP currently is aligned to, we have to add a 16 byte safety gap to the value we subtract from ESP. In lng.S, a part of ST-Open's libraries, we can find the following code. Please notice that this code does not fully comply to Intelligent Design rules - it creates a stack frame not aligned to a multiple of 64!

       ...
       .align 2,0x90
.globl _MNUtxt
_MNUtxt:
       subl $0x50,%esp
       nop
       nop
       movl %ebp,0x4C(%esp)
       movl %esi,0x48(%esp)
       movl %edi,0x44(%esp)
       movl %ebx,0x40(%esp)
       movl %ecx,0x3C(%esp)
       movl %edx,0x38(%esp)
       movl _BNR,%esi
       leal 0x10(%esp),%edi
       movl 0x58(%esp),%ebx
       movl 0x5C(%esp),%ecx
       movl 0x20(%esi),%edx
       andl $0xFFFFFFF0,%edi
       pxor %xmm0,%xmm0
       subl %ebx,%ecx
       jns 0f
       movl $0x0A,%eax
       jmp L00
       /*
         load field FFFFFF12
       */
     0:andl $0x0F,%edx
       movq %xmm0,0x00(%edi)
       movl $0xFFFFFF12,0x08(%edi)
       movl $0x00000003,0x0C(%edi)
       movdqa %xmm0,0x10(%edi)
       movq %xmm0,0x20(%edi)
       movl %edx,0x20(%esi)
       movl %edi,0x00(%esp)
       call _LDreq
       testl %eax,%eax
       jne L00
       ...

This function passes the address of a LD structure to LDreq(). The LD structure is only required for this call, so we create it in our stack frame on the fly. Because most of the parameters should be set to zero, we clear them with XMM instructions, saving five movs. To align the structure, we consider the following facts: EDI cannot end with any other number than 0, 4, 8 or C - the only possible multiples of four in hexadecimal notation. If it is 0, it already is aligned. If it is any other number x, we have to add the difference (16 - x) to the required offset, moving the offset to the beginning of the next 16 byte boundary. Calculations like this definitely take too much time, so we use a trick and add the largest possible number to EDI, then and the new content of EDI with the pattern 0xFFFFFFF0. If ESP currently points to address 0x0003FEC4, the required structure starts at 0x04[ESP] => 0x0003FEC8. We add 0x0C + 0x04 = 0x10 to move the offset to a safe region. Now we do a leal 0x10(%esp),%edi. This loads 0x0003FED8 into EDI. The final andl $0xFFFFFFF0,%edi clears the lowest 4 bits of the address, leaving 0x0003FED0 - a properly aligned address with sufficient safety distance from the parameter at the bottom of our stack frame.

MNUtxt() is a part of ST-Open's language support. Depending on the current language, entries of the corresponding subfield are copied to the menu items specified by their ID. Up to 16 languages can be stored in fields FFFFFF12 (user) and FFFFFF13 (system) and the user can switch between them in the running program.



11 - Rules

Whenever we want to build a greater whole out of many independent parts created by individual contributors, we have to use a common set of rules. Anyone contributing a part to the whole has to obey these rules. If all contributors applied their own rules, we finally ended up with a lot of parts, but none of them were able to work together with any other part. Using different interfaces, single parts either could not communicate at all or misunderstood received messages completely. Therefore, we need a set of rules before we start to create parts of a greater whole. This set of rules, like a common language, defines a common interface, forcing any contributor to use standardised communication protocols, so other parts are able to understand all sent messages and do the right work. The big idea behind any set of rules lives or dies with strict obedience.


Rule 1

The instructions enter, leave, push and pop are archaic remains of the past era. They must not be used in Intelligent Design code.

Any of these instructions change the content of the stack pointer automatically. All Intelligent Design functions depend on one principle: The stack pointer must not change its content for their entire runtime. We either can enjoy the advantages of a static stack pointer or we can waste precious time with obligatory corrections and repetitive writes of one and the same parameter. Using a 'flying' stack pointer and its urgently required compagnion called base pointer is mutual exclusive with Intelligent Design. Both concepts use the stack pointer in a very different way - there is no chance to use a mix of both concepts in one and the same function.


Rule 2

Stack frames are created by subtracting 0x3C in 32 bit functions or 0x38 in 64 bit functions from the stack pointer ESP/RSP. If larger stack frames are required, the basic value 0x3C or0x38 can be expanded in 64 byte steps (0x7C, 0xBC, 0xFC, ... in 32 bit functions, 0x78, 0xB8, 0xF8, ... in 64 bit functions).

This rule is valid for IDEOS, only. Any other operating system will not benefit from the advantages of a properly aligned stack. Because the stack generally isn't aligned in old fashioned operating systems, you have to align it with some lines of redundant bloat if you use XMM registers.

If you write IDEOS code, you have to apply this rule. The only exception is the rare case that you do not use other registers than rAX and XMM0 through XMM3. If there are registers to preserve, you need a stack frame. The advantage of this rule is a properly aligned stack pointer, starting at the beginning of a cache line. IDEOS's concept guarantees that the stack pointer is naturally aligned to a multiple of 64 before any function is called. Because the call instruction stores EIP/RIP in the next stack element, subtracting the correct value automatically aligns the stack pointer to a multiple of 64 and the corresponding cache line is preloaded into L1 cache without a single line of additional code.


Rule 3

All registers used in a function must be saved before they are overwritten and must be restored before returning to the caller.

Any software running on modern processors is most efficient if it can access as much hardware resources as possible. The most valuable resources of a x86 processor are its registers. The more frequently used parameters are held in registers, the less time is required to perform a given task. If we reload parameters with slow memory reads rather than to keep them in registers, we slow down our code with superfluous operations and interrupt simultaneous execution completely. Preloading all required parameters into registers in one gulp reduces all dependencies, supporting parallel execution, keeping all three pipes busy all of the time.

Saving precious clock cycles with much faster register operations outweighs the cost of the few clock cycles needed to save and restore all used registers by far. Many functions work with a subset of all available registers. The less registers we use, the faster they are saved and restored. Another welcome side-effekt is based on the fact that continuous writes to ascending addresses trigger write combining, where up to 16 doublewords are written to a cache line in one gulp. This surely is much faster than 16 single writes. Moreover, pairing of multiple reads or writes is a good opportunity to support simultaneous execition flow as long as possible.


Rule 4

The registers RAX, XMM0, XMM1, XMM2 and XMM3 are reserved for special purposes - they neither are saved nor restored by default.

This rule partially overrides rule 3:

Per definitionem, Register RAX is used to pass return values or errorcodes to the calling function. It does not make sense to save a register that is overwritten in all epilogues by default. If a function declaration defines its return value as VOID, rAX must be cleared with a simple xor %rAX,%rAX on exit.

Registers XMM0 through XMM3 with their size of 4 * 128 bit = 64 byte cover an entire cache line. They are used to clear, move or manipulate large data areas in steps of 64 byte, preferably aligned to a multiple of 64. In general, these operations are executed in small loops where no other functions are called, so we can skip to save and restore these registers per se. This saves a lot of clock cycles if your application makes heavy use of such operations. If your function calls others while manipulating large memory blocks, the called function might use XMM0 through XMM3 as temporary storage, overwriting whatever was stored in these registers. Hence, it is recommended to use XMM4 through XMM15 if preloaded values must not change while other functions are called. As stipulated by rule 3, you have to save and restore
 XMM4 through XMM15 if you use them.


Rule 5

Accessing registers is much faster than accessing memory locations. The most frequently used parameters in a function should be preloaded into registers as soon as these registers were saved on the stack.

Even if old-fashioned programmers aren't used to work with multiple execution pipes: Obeying to this rule can turn an ox cart into a rocket (see Rule 3).


Rule 6

Modern processors execute three (LETNi: 1.5) instructions simultaneously. Good code should make use of these capabilities rather than to ignore them. Dependency chains interrupt simultaneous execution. While one instruction is executed in one pipe, the other two pipes are waiting for the result. Sending two of three pipes to sleep for some clocks wastes two thirds of the available resources.

This extraordinary product of a classical compiler

        ...
    pushl    %esi
    pushl    $0
    pushl    $2
    pushl    $0
    pushl    $11
    movl     _GVAR, %eax
    movl     7252(%eax), %eax
    pushl    %eax
    call     _FDacc
    movl     _GVAR, %eax
    addl     $24, %esp
    movl     472(%eax), %ebx
    pushl    %ebx
    pushl    $0
    pushl    $2
    pushl    $4
    pushl    $11
    movl     7252(%eax), %ecx
    pushl    %ecx
    call     _FDacc
    addl     $20, %esp
    movl     _GVAR, %eax
    movl     $0, 10464(%eax)
        ...

can be improved with a few useful changes

        ...
        movl _GVAR,%esi    # at least 3 clocks ahead
        ...
        ...
        ...
        movl 0x1C54(%esi),%eax
        movl 0x01D8(%esi),%ecx
        movl $0x00,0x28E0(%esi)
        movl %eax,0x00(%esp)
        movl $0x0B,0x04(%esp)
        movl $0x00,0x08(%esp)
        movl $0x02,0x0C(%esp)
        movl $0x00,0x10(%esp)
        movl %edi,0x14(%esp)
        call _FDacc
        movl $0x04,0x08(%esp)
        movl %ecx,0x14(%esp)
        call _FDacc
        ...

to run at least twice as fast, now. Removing all superfluous instructions, code size could be reduced from 23 to 14 instructions. Placing all instructions in the proper order removes dependencies and keeps all pipes busy without delays.


Rule 7

Jump tables belong to the data segment. AS is able to build jump tables in the data segment, so there's absolutely no reason to pollute the code segment with data of any kind.

No comment is required for this rule. Just do it!




10 - Summary

I hope I could convince you of various advantages introduced with the development of Intelligent Design. Compared against conventional programming techniques (as discussed in this document), Intelligent Design wins in all disciplines - be it sheer speed, high code density, simplified development or efficiency. Due to the strict renunciation of slow leave, pop and push instructions, coming along with many ill side-effects like an unpredictable stack pointer, these caveats of conventional software design easily are turned into some really welcome side-effects by replacing those slow instructions with simple movs. The most welcome side-effect surely is the fact that we get EBP back as a general purpose register. One additional register is quite a lot if only six of them are available, because EBP was abused as base pointer. Each additional register is an advantage, because we can pair reads to copy parameters from memory to registers and access these registers rather than reloading single registers with some frequently used parameters over and over again.

Another advantage is that we mov parameters to the right location rather than to push them onto the stack. While mov instructions can store changed parameters anywhere, a push always addresses the next lower stack element (a quite limited range). If we call a function repeatedly, and only the last parameter changes from call to call, we had to push all parameters after each call to update this topmost parameter with the current data. Moreover, the correction of ESP is mandatory after each call, blocking the following push for an entire clock cycle. Using mov instructions, all these problems vanish. We save many unnecessary instructions and keep the three execution pipes busy most of the time. Of course, there are a lot of cases where only one or two instructions are required between two calls, but this still is faster than a conventional push orgy.

One more advantage is the fact that we can mov parameters and other data onto the stack anywhere within our source file. Pairing multiple reads in groups, we can avoid direct dependencies completely. Pairing multiple writes, we trigger write combining. Keeping proper distances between reads and writes can eliminate most dependencies with few exceptions. Applying these tricks as often as possible speeds up code markably.

Putting it all together, Intelligent Design is an up-to-date alternative to old-fashioned conventional programming techniques. Conventional software design did not change too much in the last thirty years. We neither can create the most simple integrated circuit with stone age tools, nor are we able to create software making full use of the capabilities provided by modern processors with conventional code. While old fashioned code is spiced with mandatory brake pads and compilers generate tons of counter-productive dependencies, because the convention allows to abuse the most valuable resources - our registers - as garbage pile to save four instructions at the cost of multiple reloads, Intelligent Design replaces these obstacles with straight and clean rules, providing full support for features and improvements of recent processors. Intelligent Design is an up-to-date concept for the next generation of operating systems and applications, introducing a new quality to software design. Moreover, it is much easier to handle a single stack pointer than to manage two registers, a separate stack and base pointer, simultaneously. Getting rid of the institution called base pointer, we finally can say good bye to positive and negative offsets and error prone corrections of the stack pointer after each of the obligatory push orgies. Another positive side-effect is the return of a valuable resource (EBP) to the very sparse pool of general purpose registers - the worst conceptual flaw of LETNi's x86 architecture.

All in all, Intelligent Design points the way ahead to fastest code with high density, causing less head aches for stressed programmers. Best of all: This is not just a theoretical construction, floating around in the head of a weird, unwordly developer - it exists for real! ST-Open's libraries as well as DatTools and the SRE editor were ported to Intelligent Design, working as expected: Really fast and reliable.

Go to the next post (11- Rules)


09 - First Steps

While the previous pages introduced the basic concepts of Intelligent Design to you, it is time to fill this abstract theortical building with life and show, how the described methods and techniques are applied to real life applications. To get in touch with Intelligent Design, an example function is developed step for step.

Two things should be stressed: All statements and comments assume that this function is run on a recent operating system with support for recent programming techniques like Intelligent Design, e.g. IDEOS. If you run the example on an old fashioned platform like Linux, OS/2 or Windows, there is a 75 percent chance to encounter a nice crash while the first XMM instruction is executed. Furthermore, Intelligent Design is optimised for the most mature processor design, AMD's Athlon64. It is the only machine with three execution pipes for integer and another three execution pipes for floating point instructions. This makes any Athlon64 superior to LETNi's I-am-so-slowium processors with their one and a half execution pipes for all (integer and FP) instructions. As a matter of fact, Intelligent Design code runs faster on a slower Athlon64 than on a faster I-am-so-slowium.

(Sidenote: Unfortunately, I had to remove all comments in the code snippets because of the limited formatting abitlities of the blog engine.)


Problem

Eight entry fields in a dialog are to fill with four hexadecimal (subfields 00...03, IDs 0x1200...0x1203) and two decimal (subfield 04/05, IDs 0x1204/0x1205) numbers as well as two strings (subfield 06/07, IDs 0x1206/0x1207). All data are stored in the given subfields of datafield 0000001E. Entries are selected via keyboard, evaluated and the selected entry number is stored in a global variable with the symbolic name CURSEL (adress 0x0240[BNR]), then our function is called to display all data of the chosen entry in a dialog. The datafield is loaded permanently, its MemoryHandle is stored in a runtime variable with the symbolic name MH_SEL (adress 0x2028[BNR]). It has 1,024 entries with 6 subfields (00-05) of type 03 (doubleword) and 2 subfields (06, 07) of type 07 (dynamic strings with garbage collection).


Solution

Even if this is a trivial problem, there are many ways to solve it. The preferred solution in typical C code is a loop executed eight times where some logic switches between three possible output types (hex, decimal, string). It is not the best way to solve the given problem - comparisons and repeated writes are quite slow.

To avoid repetitive work and speed up execution markably, it's a much better idea to unroll the suggested loop completely. Unfortunately, it is impossible to apply real improvements if we use high level languages like C. Well, we could add some lines inline code, but that's like using tape to fix a broken screw. The only way to get anywhere is writing our code in assembler. No existing compiler can see connections between called functions nor is it able to detect how many dependencies are created with the code it generates. It were possible to write a compiler with capabilities like that, but the result were bloated and, first of all, extremely slow. A human can solve really complex problems much better, especially if the code is written in pure assembler.


Prologue

The usual conventional prologue

.globl _MyFunction
_MyFunction:
        pushl %ebp
        movl %esp,%ebp
        subl $0xC0,%esp
        pushl %ebx
        pushl %edi
        pushl %esi
        ...

looks like this in Intelligent Design

.globl _MyFunction
_MyFunction:
        subl $0xFC,%esp
        nop
        nop
        movl %ebp,0xE4(%esp)
        movl %esi,0xE8(%esp)
        movl %edi,0xEC(%esp)
        movl %edx,0xF0(%esp)
        movl %ecx,0xF4(%esp)
        movl %ebx,0xF8(%esp)
        ...

The difference between both versions is obvious on the first sight. What is going on 'under the hood' cannot be seen at all, but it can be shown in form of a grahical image. Figure 0E shows the snapshot of a conventional stack frame



figure 0F (right) a snapshot of an Intelligent Design stack frame soon after the above  prologues were executed



Because conventional code is based on a flying stack pointer with random content, it is impossible to determine the current alignment of the stack pointer. Conventional code uses old fashioned push - call -addl $x,%esp sequences, as well, so ESP/RSP frequently changes its current content. These disadvantages are the reason why we need a second register to address stack elements with reliable accuracy. The question marks in figure 0E emphasize the undetermined content of the stack pointer - an artificial feature of conventional code. Obviously, the visible difference between conventional and Intelligent Design stack frames is the naturally aligned order of the latter ones, one of many built-in and cost-free (in terms of additional clock cycles) Intelligent Design features.

The second feature of Intelligent Design code is its prologue. It generally starts with the subtraction of the stack frame's size from ESP/RSP. The nops between the sub and mov instructions are required to feed the other two execution pipes while the content of ESP/RSP is set to the bottom of our stack frame. It is important to keep the remaining pipes busy for exactly this one clock cycle, because we need ESP/RSP to address the stack elements where our registers are stored. Inserting two nops keeps all pipes busy. The simultaneous flow in all execution pipes is not interrupted, so up to three instructions are executed at any time. Hint: Both nops should be seen as placeholder for more useful instructions. For example, we could replace them with prefetch n instructions to preload memory areas to the L1 cache, so we can access these areas immediately after saving all used registers.

Step two: Every Intelligent Design prologue saves all used registers. These are EBX, ECX, EDX, EDI, ESI, EBP and XMM4 through XMM15. The return address to the calling function is written to the topmost element in our stack frame while the call _MyFunction instruction in our callers code is executed. Storing the content of EIP/RIP in this element automatically loads the topmost cache line of the new stack frame into the L1 cache, so this cache line should be present at the time we start to save our registers there. Writing to L1 cache is the fastest way to store data, writing data in ascending order to continuous locations speeds up execution further, because a mechanism called write combining is forced, allowing to write an entire cache line in one gulp. Therefore, we store registers in ascending order to speed up our function. As shown on the next page, we try to fill the topmost stack elements up to the return address without gaps to force the processor to use write combining rather than single movs. Applying all these improvements, the conventional prologue is not gaining huge advantages over the Intelligent Design prologue. Even if we only push one half of the registers, the conventional prologue is executed mere two or three clock cycles faster than the Intelligent Design prologue, Moving twice as much registers onto the stack. This small advantage disappears completely if the first parameter must be reloaded, because ECX and EDX were destroyed by a called function. If parameters must be reloaded more than once, the advantage turns into a handicap, losing some clocks with every reload. Conventional techniques do not speed up functions - they slow them down.


Preparation

Let's start with some real work. The standard prologue was modified slightly to suit our needs. It looks like this, now:

.globl _ShowUs
_ShowUs:subl $0xFC,%esp
        movl _BNR,%eax
        nop
        movl %ebp,0xE4(%esp)
        movl %esi,0xE8(%esp)
        movl %edi,0xEC(%esp)
        movl %edx,0xF0(%esp)
        movl %ecx,0xF4(%esp)
        movl %ebx,0xF8(%esp)
        movl MH_SEL(%eax),%esi
        movl CURSEL(%eax),%ebp
        movdqa %xmm6,0xC0(%esp)
        movdqa %xmm7,0xD0(%esp)
        ...



We need eight registers to preload all required parameters for later use. The set of general purpose registers is limited, so two XMM registers are used to fill the gap (nice to have them). We just need the lowest 32 bit of XMM6 and XMM7, but the used movd instruction clears all bits above (32 through 127), as well. Whenever the contents of XMM4 through XMM15 change, these registers must be saved initially and restored on exit (like all general purpose registers except EAX). Any of them might hold preloaded parameters of another function, so it were a bad idea to overwrite them with garbage, because this might cause fatal malfunctions or crashes whenever we return to that function. Keep in mind that FP instructions are executed in a separate unit, so both movdqa instructions are executed simultaneously with the six mov instructions. As a matter of fact - they do not cost any additional clock cycle!

Working with ST-Open's libraries, the base address of the global variables (_BNR or _GVAR) is required in most cases. This datafield is automatically loaded by _LDinit and is available for the entire runtime of the program. If it is shut down, _LDexit is called during termination and the permanent portion of the datafield (4,096 byte = 1,024 variables) is stored before the datafield is released. In ST-Open slang, this datafield is called 'SystemNumerics'. It allows to bypass the stack by storing frequently used parameters at defined locations, so we save some time for passing those parameters on the stack.

To load the eight parameters, some tricks requiring basic knowledge about ST-Open's database engine are used. A less informed programmer (or a compiler) had to preload parameter for parameter with repeated calls:

        ...
        movl %esi,0x00(%esp)
        movl %ebp,0x04(%esp)
        movl $0x07,0x08(%esp)
        movl $0x07,0x0C(%esp)
        call _FDacc
        movl %eax,%esi
        decl 0x08(%esp)
        call _FDacc
        movl %eax,%edi
        decl 0x08(%esp)
        movl $0x01,0x0C(%esp)
        call _FDacc
        movl %eax,0xBC(%esp)
        decl 0x08(%esp)
        call _FDacc
        movl %eax,0xB8(%esp)
        decl 0x08(%esp)
        call _FDacc
        movl %eax,%edx
        movd 0xB8(%esp),%xmm6
        movd 0xBC(%esp),%xmm7
        decl 0x08(%esp)
        call _FDacc
        movl %eax,%ecx
        decl 0x08(%esp)
        call _FDacc
        movl %eax,%ebx
        leal 0x20(%esp),%ebp
        decl 0x08(%esp)
        call _FDacc
        ...

Compared to conventional code, where all 6 parameters for _FDacc() repeatedly werepushed onto the stack for each call, the improvements introduced by Intelligent Design still keep the above code dense and fast, but this is not the most elegant solution.

Recapitulating some knowledge about ST-Open's database engine and ST-Open's Loader, there is a much faster way to load parameters stored in a datafield. Okay, we start with a delay because we load an address into ESI which is required for the next and all following load operations. But this still is faster than a single call, not to talk of the eight calls performed by the above code. The advantage of the below code is obvious - its only caveat is the distance of 4,096 byte between any of the accessed memory locations. It is not very likely that all these areas are present in the L1 cache - we should assume long delays, because all data surely are taken from main memory. But that is true for all other ways to read these data, as well - the below code is the fastest possible way to load our parameters within the given premises. It were possible to organise the field in blocks rather than in sequential subfields, so all data of a dataset could be kept in a cache line - but this is out of the frame of our exercise.

A field's MemoryHandle is a pointer into the Loader table _BNR, where all parameters of a loaded datafield are stored. The first parameter at 00[MemHandle] is the real address (EA) of the allocated memory block. Loading this address into ESI, we now can access all required parameters by using the appropriate offset to ESI. Because we do not know, which entry we have to read next, we access them using the indexed, indirect addressing mode provided by the instruction set of any x86 processor. This is a simple exercise, because all subfields have a distance of exactly 0x1000 from each other. Oh, there are two more complex actions, of course. String addresses are stored as offsets to the field's base address, so we have to preload two registers with this address prior to adding the offset of the string to them. Following the definition, empty strings are stored with an offset of zero. Because the first 32 bit of each datafield are reset by default, 00[EA] always holds an empty string...

        ...
        movl 0x00(%esi),%esi
        movl %esi,%edi
        movl 0x0100(%esi, %ebp, 4),%eax
        movl 0x1100(%esi, %ebp, 4),%ebx
        movl 0x2100(%esi, %ebp, 4),%ecx
        movl 0x3100(%esi, %ebp, 4),%edx
        movd 0x4100(%esi, %ebp, 4),%xmm6
        movd 0x5100(%esi, %ebp, 4),%xmm7
        addl 0x6100(%esi, %ebp, 4),%edi
        addl 0x7100(%esi, %ebp, 4),%esi
        leal 0x20(%esp),%ebp
        ...

Collecting similar tasks to groups and executing them one after the other is a good way to speed up a function markably. It often saves a lot of redundant operations, because we can mov a parameter onto the stack, once, then perform multiple calls to one and the same function repeatedly without having to pass this parameter over and over again. As shown in the first code snippet (the less elegant try to solve our problem), only two out of six parameters are updated between seven out of eight _FDacc() calls. Three out of six parameters are static, staying unchanged for all eight calls - it is a waste of time to push all six parameters onto the stack eight times as practised in conventional code.

Back to the last line of the improved source code. While EBP was uses as index for eight mov instructions, this index is not required any longer after the last parameter was loaded to ESI. It's recommended to preload all required parameters as soon as possible to keep an appropriate distance from instructions accessing them. Therefore, we preload the address 20[ESP] into EBP immediately after its previous content is not required any longer. EBP now holds the address of the buffer where the converted numeric values of our parameters are stored as strings. This speeds up our function a little bit - simultaneouos execution flow in all three pipes is supported if our source code provides intelligently distributed instructions.



Hexadecimal Numbers

ST-Open's libraries provide some functions to convert numeric data into hexadecimal strings. D2str() is the right choice to convert doublewords into formatted strings, the format is 'nnnn nnnn'. D2str() awaits two parameters: The number to convert and the address of the buffer where the output string shall be stored.

        ...
        movl %eax,0x00(%esp)
        movl %ebp,0x04(%esp)
        call _D2str
        movl %ebx,0x00(%esp)
        addl $0x10,0x04(%esp)
        call _D2str
        movl %ecx,0x00(%esp)
        addl $0x10,0x04(%esp)
        call _D2str
        movl %edx,0x00(%esp)
        addl $0x10,0x04(%esp)
        call _D2str
        ...

After converting the four numbers into hexadecimal strings, the stack now looks like this:



This is trivial code, so there is no need for lengthy descriptions. There's only one thing deserving some words: Have a look at the three lines addl $0x10,0x04(%esp) in the above code snippet. These instructions have a latency of four clocks, each. We could replace them with sequences like addl $0x10,%ebp / movl %ebp,0x04(%esp). The replacements have a latency of four clock cycles, as well, but the mov instruction has to wait until the content of EBP was updated. It is obvious that the alternative method interrupts simultaneous execution and adds some bloat to our source code. Another negative side-effect is the changed content of EBP. We need the address of the first string later on, so we had to subtract all added values before we could use EBP to address the output strings. A general rule of thumb: If something can be done with one instruction, we should not split it up into multiple instructions doing the same job.



Decimal Numbers

To convert a doubleword into a formatted decimal string, we use the function D2dec() provided by ST-Open's libraries. This function awaits a doubleword to convert, the address of a buffer where the output shall be stored and a doubleword defining the output format. Only the lower two bytes of this doubleword are recognised. The used code is 0000ffii, where ii is the amount of integer and ff the amount of pseudo floating point digits. Beware - this is just a char inserted somewhere in a string, not a real floating point! We do not need a floating point, so the second byte is set to zero. The integer digits should cover the full range between 0 and 4 294 967 295, so the appropriate value is ten. Because D2dec() treats a formatting dword with the value of zero as 0x0000000A, we either may pass zero or ten as third parameter.

        ...
        movd %xmm6,0x00(%esp)
        addl $0x10,0x04(%esp)
        movl $0x0A,0x08(%esp)
        call _D2dec
        movl 0x0100(%esp),%edx
        movd %xmm7,0x00(%esp)
        addl $0x10,0x04(%esp)
        call _D2dec
        ...

This is trivial code, so we skip lengthy explanations. EDX is loaded between both calls, because we need the WindowHandle (HWND) later on to write our strings to the corresponding entry fields. Preloading registers some instructions before they are used supports parallel execution in all three pipes and we should place these loads soon after the content of one of our registers is not required any longer. The more practised, the faster the resulting code.

Well, our stack looks like this after all conversions are done:




Strings

To avoid redundant copy operations, the addresses of both strings were stored in EDI and ESI while we preloaded our eight parameters. It does not make sense to copy a string from memory to the stack, first, if we have to copy it from the stack to another memory location later on. Passing strings always means to pass the address where the first byte of the string can be found - it is a waste of time to copy a string to a buffer instead of just passing the address where this string actually is stored.

At this point, all conversions are done and we can start to fill those entry fields with some data. Because our strings are passed via their address, the stack remains untouched.



Setting The Entry Fields

All data are converted and accessible, now. It's time to fill our eight entry fields with the corresponding strings:

        ...
        movl %edx,0x00(%esp)
        movl $0x1200,0x04(%esp)
        movl %ebp,0x08(%esp)
        call _SEf
        incl 0x04(%esp)
        addl $0x10,0x08(%esp)
        call _SEf
        incl 0x04(%esp)
        addl $0x10,0x08(%esp)
        call _SEf
        incl 0x04(%esp)
        addl $0x10,0x08(%esp)
        call _SEf
        incl 0x04(%esp)
        addl $0x10,0x08(%esp)
        call _SEf
        incl 0x04(%esp)
        addl $0x10,0x08(%esp)
        call _SEf
        incl 0x04(%esp)
        movl %edi,0x08(%esp)
        call _SEf
        incl 0x04(%esp)
        movl %esi,0x08(%esp)
        call _SEf
        ...


This is the most dense code. It were possible to replace all incl 0x04(%esp) with movl $0x120*,0x04(%esp) instructions, where the asterisk '*' stands for the current ID. This alternative does not speed up execution, but blows up our code by 32 byte. Why that? Well, each addl $0x10,0x08(%esp) has a latency of four clock cycles. Pairing a three clocks with a four clocks latency does not result in higher speed. All three pipes are engaged for four clock cycles and the last pipe is idling all of the time. If we replace the first instruction with the alternative instruction, the first pipe - together with the last pipe - is idling for one clock cycle. Hence, we win nothing than 32 byte additional code.

With this step, all tasks of our exercise are done, only the epilogue is missing.


Epilogue

Before we return to the calling function, we restore all used registers. This includes the garbage pile of conventional code (ECX and EDX) as well as XMM4-XMM15. When all registers contain what they contained before our prologue was executed, we might want to set EAX to a specific return value, if the function declaration says so. Whether we set EAX or not, the next step is to add the same value we subtracted with the very first instruction in our function's prologue. ESP should point to the address of the instruction following call _ShowUs, now. After executing the final ret, EIP should contain that address and the processor should continue to execute the code of the function calling ShowUs().


        ...
        movl 0xC0(%esp),%xmm6
        movl 0xD0(%esp),%xmm7
        movl 0xE4(%esp),%ebp
        movl 0xE8(%esp),%esi
        movl 0xEC(%esp),%edi
        movl 0xF0(%esp),%edx
        movl 0xF4(%esp),%ecx
        movl 0xF8(%esp),%ebx
        addl $0xFC,%esp
        xorl %eax,%eax
        ret



With the correction of the stack pointer ESP, all stack elements below the address stored in ESP are 'no-man's-land', again. The next function is free to reserve some byte for its private stack frame as required.


Notes

It still might not be clear, why grouping of similar tasks is much better than doing everything step by step, as we are taught by some gurus preaching the old fashioned methods and techniques. The advantages of Intelligent Design are based on a lot of small improvements. Each of them supports the other. In the end, all of them sum up to a new quality, making as much as possible use of the full power of modern processors.

As a matter of fact, most parameters never change in repetitive calls to one and the same function. Hence, we can save a lot of redundant writes if we just mov changing parameters to their proper position rather than push all parameters onto the stack over and over, again. In our example, parameter 1 is a MemoryHandle for the first call, the address of a buffer for the second call and a WindowHandle for the third call. Hence, we had to push these changing parameters onto the stack eight times. Parameters 2 and 3 also change with every call. We finally had to  perform 6*8 + 4*2 + 2*3 + 3*8 = 86 push instructions to solve the given problem. If we count all mov instructions in the listed example code, we get a result of 38 - less than one half of the instructions required for conventional code. But: That's just counting instructions. Having a closer look at them, there is much more than the pure count. Analysing both versions, conventional code is not as efficient as Intelligent Design. Conventional techniques not only prevent a processor to execute multiple instructions simultaneously, because they create many avoidable dependencies. They also create obstacles by default, because their rules stipulate us to omit important things like preserving ECX and EDX. Putting it all together, conventional programming techniques are as outdated as ox carts, while Intelligent Design is the programming technique of the 21st century.

Go to the next post (10 - Summary)