As mentioned before, conventions, methods and techniques introduced with the programming language C have a lot of advantages over monolithic code we used to write applications for DOS and other ancient operating systems. All functions easily are portable to other operating systems or processor architectures and can be used multiple times, leading to versatile applications for multiple platforms. Unfortunately, after those C conventions were established, they never were updated or revised to keep pace with the development of processor architectures. If we compare an 8086 against a recent quad-core Athlon, a picturesque comparison were a cart dragged by a tired ox versus an Airbus A380. While no person with sane mind ever wasted one thought about equipping an A380 with a harness and motivate it to lift off with reins, whip and loud 'hee' and 'ho' shouts, software designers all over the world practise such weird things every day. They still create 'flying' stack frames, use slow leave, pop and push instructions with the obligatory update of the stack pointer instead of the faster mov and continue to abuse valuable resources by using EBP as base pointer. This reduces the too small register set of the x86 architecture by 1/7th, forcing the programmer to use slow memory reads and writes instead of much faster register operations.
To demonstrate the caveats, we analyse a dialog procedure taken from an existing ST-Open program. The dialog consists of four groups of radiobuttons, the three buttons 'Abort', 'Move', 'Help' and some static texts. DLGtxt() sets all texts in this dialog to strings taken from a subfield with the current language (multi-lingual dialog and menu texts are an integral part of ST-Open's libraries). The dialog is used to move datasets within a datafield, the target is selected with the radiobuttons.
To demonstrate the caveats, we analyse a dialog procedure taken from an existing ST-Open program. The dialog consists of four groups of radiobuttons, the three buttons 'Abort', 'Move', 'Help' and some static texts. DLGtxt() sets all texts in this dialog to strings taken from a subfield with the current language (multi-lingual dialog and menu texts are an integral part of ST-Open's libraries). The dialog is used to move datasets within a datafield, the target is selected with the radiobuttons.
.text
.p2align 4,,15
.globl MoveDlg
MoveDlg:
pushl %ebp
movl %esp, %ebp
pushl %esi
pushl %ebx
movl 12(%ebp), %edx
movl 8(%ebp), %esi
movl 16(%ebp), %ecx
movl 20(%ebp), %ebx
cmpl $48, %edx
je L12
ja L39
cmpl $32, %edx
je L4
L37:movl %ebx, 20(%ebp)
movl %ecx, 16(%ebp)
movl %edx, 12(%ebp)
movl %esi, 8(%ebp)
leal -8(%ebp), %esp
popl %ebx
popl %esi
popl %ebp
jmp _DefDP
Compared against GCC 2.6.1. (1993), GCC 3.3.5. (2006) generates much worse code, spiced with a lot of counterproductive, superfluous instructions. Writing back parameters to the stack violates some basic rules. First, there's absolutely no need to write data back to the locations we took them from some instructions before. Secondly, writes to stack locations above ESP violate all rules of proper programming. If any function starts to write data to the stack frames of other functions, it is just a question of time until we encounter desastrous malfunctions.
.p2align 4,,7
L4:movl %ecx, %eax
andl $65535, %eax
cmpl $4658, %eax
je L7
jg L11
cmpl $4657, %eax
je L6
Because we only need the low word stored in message parameter 1, it was a good idea to extract this lower word with the instruction movzwl 0x10(%ebp),%ecx rather than to waste valuable clock cycles with the code sequence shown above. Recent processors have three, not just one execution pipe(s). The choosen way sends two execution pipes to sleep while one pipe is busy to extract data from a register. This is repeated two times. We could switch off two execution pipes while this code is executed, because we created two avoidable dependencies.
L9:movl %ebx, 20(%ebp)
movl %ecx, 16(%ebp)
movl %edx, 12(%ebp)
movl %esi, 8(%ebp)
leal -8(%ebp), %esp
popl %ebx
popl %esi
popl %ebp
jmp _DefDP
Obviously, L37 and L9 provide identical code. Using our brain, these eighteen redundant (partially pointless) lines can be reduced to five really necessary instructions.
L6:movl _GVAR, %eax # [1]
subl $12, %esp
movl $0, 10464(%eax)
L42:pushl %esi
call _WinDD
L41:addl $16, %esp
.p2align 4,,7
Up to seven nops are executed every time we branch to this part of code. It is a good way to slow down execution flow as much as possible.
L2:leal -8(%ebp), %esp
xorl %eax, %eax
popl %ebx
popl %esi
popl %ebp
ret
L11:cmpl $4659, %eax
jne L9
subl $12, %esp
pushl $17
call _Help
jmp L41
L7:pushl %edx
pushl %edx
pushl $18
movl _GVAR, %eax # [1]
addl $10464, %eax
pushl %eax
call _FlgS
popl %eax
jmp L42
The branch prediction logic assumes every first branch as false if there is no entry in its internal table. Almost all of the above branches trigger the obligatory ten penalty cycles, because the wrong branch instructions were chosen.
.p2align 4,,7
L39:cmpl $59, %edx
jne L37
pushl $233
pushl $211
pushl $210
pushl %esi
call _DLGtxt
movl $0, (%esp)
pushl $-1
pushl $288
pushl $4672
pushl %esi
call _SnDIM
addl $20, %esp
pushl $0
pushl $-1
pushl $288
pushl $4680
pushl %esi
call _SnDIM
addl $20, %esp
pushl $0
pushl $-1
pushl $288
pushl $4688
pushl %esi
call _SnDIM
addl $20, %esp
pushl $0
pushl $-1
pushl $288
pushl $4696
pushl %esi
call _SnDIM
addl $24, %esp
Only the second parameter changes for the four consecutive calls of SnDIM(). Twelve of these twenty push instructions (60 percent) are redundant.
pushl %esi
pushl $0
pushl $2
pushl $0
pushl $11
movl _GVAR, %eax # [1]
movl 7252(%eax), %eax
pushl %eax
call _FDacc
movl _GVAR, %eax # [1]
addl $24, %esp
movl 472(%eax), %ebx
pushl %ebx
pushl $0
pushl $2
pushl $4
pushl $11
movl 7252(%eax), %ecx
pushl %ecx
call _FDacc
Only parameters 3 and 6 change for the both calls of FDacc(). While we are bound to push instructions, there is no way to change just these parameters. We have to push all six parameters, again, because the parameter on top must be changed.
addl $20, %esp
movl _GVAR, %eax # [1]
movl $0, 10464(%eax)
pushl %esi
call _CtrWn
movl %esi, (%esp)
call _DlgShow
jmp L41
.p2align 4,,7
L12:movl %ecx, %eax
andl $65535, %eax
subl $4672, %eax
cmpl $30, %eax
ja L37
jmp *L36(,%eax,4)
Again, it were better to extract the low word via one movzwl 0x10(%ebp),%ecx rather than to use the chosen way. Probably, parallel execution was considered to be too fast?
.p2align 2
.align 2,0xcc
I don't know what the second .align is good for. Any hints? My jump table is too large, because C programmers do not think about side effects like blowing up code while assigning resource IDs to 'straight' numbers. Due to the gaps between those IDs, there are a lot of superfluous entries in this jump table.
L36:.long L14
.long L15
.long L16
.long L17
.long L2
.long L37
.long L37
.long L37
.long L18
.long L19
.long L37
.long L37
.long L37
.long L37
.long L37
.long L37
.long L21
.long L23
.long L25
.long L27
.long L29
.long L31
.long L33
.long L37
.long L21
.long L23
.long L25
.long L27
.long L29
.long L31
.long L33
By the way: Jump tables belong to the .data, not to the .code segment. AS supports jump tables in the .data segment, so there's no need to violate the recommendations of AMD and LETNi as all versions of GCC do...
L14:movl _GVAR, %edx # [1], [2]
movl 10464(%edx), %eax
andb $225, %ah
orb $16, %ah
.p2align 4,,7
L40:movl %eax, 10464(%edx)
jmp L2
L15:movl _GVAR, %edx # [1], [2]
movl 10464(%edx), %eax
andb $225, %ah
orb $8, %ah
jmp L40
L16:movl _GVAR, %edx # [1], [2]
movl 10464(%edx), %eax
andb $225, %ah
orb $4, %ah
jmp L40
L17:movl _GVAR, %edx # [1], [2]
movl 10464(%edx), %eax
andb $225, %ah
orb $2, %ah
jmp L40
L18:movl _GVAR, %edx # [1], [2]
movl 10464(%edx), %eax
andl $-385, %eax
orb $1, %ah
jmp L40
L19:movl _GVAR, %edx # [1], [2]
movl 10464(%edx), %eax
andl $-385, %eax
orb $-128, %al
jmp L40
L21:movl _GVAR, %edx # [1], [2]
movl 10464(%edx), %eax
andl $-128, %eax
orl $64, %eax
jmp L40
L23:movl _GVAR, %edx # [1], [2]
movl 10464(%edx), %eax
andl $-384, %eax
orl $32, %eax
jmp L40
L25:movl _GVAR, %edx # [1], [2]
movl 10464(%edx), %eax
andl $-384, %eax
orl $16, %eax
jmp L40
L27:movl _GVAR, %edx # [1], [2]
movl 10464(%edx), %eax
andl $-384, %eax
orl $8, %eax
jmp L40
L29:movl _GVAR, %edx # [1], [2]
movl 10464(%edx), %eax
andl $-384, %eax
orl $4, %eax
jmp L40
L31:movl _GVAR, %edx # [1], [2]
movl 10464(%edx), %eax
andl $-384, %eax
orl $2, %eax
jmp L40
L33:movl _GVAR, %edx # [1], [2]
movl 10464(%edx), %eax
andl $-384, %eax
orl $1, %eax
jmp L40
Many redundant instructions unnecessarily blow up code size. The first two lines of all jump targets could be reduced to a common read before the distributor branches to the selected table entry.
.comm _hab,16
.comm _DEBUG,16
.comm _USE_LDF,16
.comm _LDR_AVAIL,16
.comm _MSGLD,16
.comm _BMM,16
.comm _BNR,16
.comm _GVAR,16
.comm _BST,16
.comm _BBF,16
.comm _TST,16
.comm _MHSTR,16
.comm _LDF,16
.comm _DUMPLINE,16
.comm _DUMPCNT,16
.comm _OLH_MODE,16
.comm _SEC,16
.comm _XXX,16
.comm _FLD_XXX,16
.comm _FLD_SEC,16
We only use _GVAR in this file - enumerating all global variables is quite stupid. All globals are defined as doublewords (32 bit), so we might ask why GCC expands all these variables to a size of 128 bit, filling up the .data segment with garbage.
Footnotes
[1] This is the typical side effect of the C convention saying 'Thou shalt not save ECX and EDX'. Using two of six registers as a sanitary fill for temporary data, we reduce the set of safe registers to three, forcing us to reload frequently required parameters from memory over and over again. Reloading parameters more than two times eats up the advantage of omitting the two push and pop instructions to save and restore the content of ECX and EDX. After the third reload operation, we are on the losing side. Adding unnecessary loads of parameters to our code slows down our functions and does not speed them up.
[2] Loading _GVAR into EDX and the flags MVP_FLAGS into EAX could be done in L12. Applying some brain didn't just save 24 lines of code, it also sped up the 13 target functions, because loading and evaluating MVP_FLAGS were separated and the dependency chain were reduced to two single rather than three consecutive dependencies.
Go to the next post (05 - Improvements).
No comments:
Post a Comment