Obviously, the code generated by GCC 3.3.5. is anything else than optimised. Even if you do not know anything about reading source code, you surely are able to grasp what those comments say. This document is not an introduction to programming, so you have to rely on my words, but you can be sure: I definitely know what I am talking about.
Re-arranging some parts and reducing code sequences to really required instructions does shrink GCC's draft markably. Applying some human brain, the remaining (optimised) code should run about 30 percent faster now.
Re-arranging some parts and reducing code sequences to really required instructions does shrink GCC's draft markably. Applying some human brain, the remaining (optimised) code should run about 30 percent faster now.
.data
.p2align 4,0x00
jt0:.long L02
.long L03
.long L04
.long L05
.long L16
.long L15
.long L15
.long L15
.long L06
.long L07
.long L15
.long L15
.long L15
.long L15
.long L15
.long L15
.long L08
.long L09
.long L10
.long L11
.long L12
.long L13
.long L14
.long L15
.long L08
.long L09
.long L10
.long L11
.long L12
.long L13
.long L14
Following recommendations of AMD and LETNi, the jump table is moved to the .data segment. This is much better than mixing code and data in the .code segment.
.text
.align 2,0x90
.globl MoveDlg
MoveDlg:pushl %ebp
movl %esp,%ebp
pushl %edi
pushl %esi
movl 0x08(%ebp),%edi
movl 0x0C(%ebp),%eax
movzwl 0x10(%ebp),%ecx
movl _GVAR,%esi
cmpl $0x30,%eax
je L01
cmpl $0x20,%eax
je L00
cmpl $0x3B,%eax
jne L15
The distributor was optimised for the branch prediction logic. WM_CONTROL was put on top of the distributor, because most sent messages are WM_CONTROL messages. WM_COMMAND only is sent if the user pushes a button. No user is able to recognise delays of about 5 ns, so we can live with a ten cycles penalty if a branch target is misprediced. WM_INITDLG is sent only once. While the 1st comparison is 'guessed' as not taken, the branch does not trigger a penalty. The 2nd and all following comparisons are assumed to be taken, so the branch to the default routine (DefDP()) does not trigger penalties, as well.
pushl $0xE9
pushl $0xD3
pushl $0xD2
pushl %edi
call _DLGtxt
pushl $0x00
pushl $-0x01
pushl $0x0120
pushl $0x1240
pushl %edi
call _SnDIM
addl $0x08,%esp
pushl $0x1248
pushl %edi
call _SnDIM
addl $0x08,%esp
pushl $0x1250
pushl %edi
call _SnDIM
addl $0x08,%esp
pushl $0x1258
pushl %edi
call _SnDIM
The three parameters on top are pushed for the first call, only. This saves nine redundant instructions.
addl $0x14,%esp
movl 0x1C54(%esi),%ecx
movl 0x01D8(%esi),%edx
Both parameters can be preloaded at this point, because FDacc() is a function taken from ST-Open's library. Functions in my libraries restore all registers (including ECX and EDX) by default - they are 'clean'. But - watch out: MoveDlg() is a function following the C conventions. ECX and EDX neither are saved nor restored - MoveDlg() is a 'dirty' function.
pushl %edi
pushl $0x00
pushl $0x02
pushl $0x00
pushl $0x0B
pushl %ecx
call _FDacc
pushl %edx
pushl $0x00
pushl $0x02
pushl $0x04
pushl $0x0B
pushl %ecx
call _FDacc
addl $0x36,%esp
movl $0x00,0x28E0(%esi)
pushl %edi
call _CtrWn
call _DlgShow
jmp 3f
The next distributor was optimised for code reduction. Because none of the three buttons is pushed more than once (in general..), we can live with a ten cycles penalty for one or two mispredicted branch(es). The delay added by the penalty is at least six powers of ten faster than anything human senses could perceive.
L00:subl $0x1231,%ecx
je 0f
decl %ecx
je 1f
decl %ecx
jne L15
pushl $0x11
call _Help
jmp 3f
0:movl $0x00,0x28E0(%esi)
jmp 2f
1:orl $0x00040000,0x28E0(%esi)
2:pushl %edi
call _WinDD
3:addl $0x04,%esp
jmp L16
L01:subl $0x1240,%ecx
js L15
cmpl $0x1E,%ecx
ja L15
movl 0x28E0(%esi),%eax
jmp *jt0(, %ecx, 4)
The jump table was moved to the .data segment. To keep an overwiew, your symbols for jump tables generally should be marked with special names. ST-Open uses the symbol 'jtX', where X is the number of the current jump table.
L02:andl $0xFFFFE1FF,%eax
orl $0x1000,%eax
jmp 0f
L03:andl $0xFFFFE1FF,%eax
orl $0x0800,%eax
jmp 0f
L04:andl $0xFFFFE1FF,%eax
orl $0x0400,%eax
jmp 0f
L05:andl $0xFFFFE1FF,%eax
orl $0x0200,%eax
jmp 0f
L06:andl $0xFFFFFE7F,%eax
orl $0x0100,%eax
jmp 0f
L07:andl $0xFFFFFE7F,%eax
orl $0x80,%eax
jmp 0f
L08:andl $0xFFFFFF80,%eax
orl $0x40,%eax
jmp 0f
L09:andl $0xFFFFFE80,%eax
orl $0x20,%eax
jmp 0f
L10:andl $0xFFFFFE80,%eax
orl $0x10,%eax
jmp 0f
L11:andl $0xFFFFFE80,%eax
orl $0x08,%eax
jmp 0f
L12:andl $0xFFFFFE80,%eax
orl $0x04,%eax
jmp 0f
L13:andl $0xFFFFFE80,%eax
orl $0x02,%eax
jmp 0f
L14:andl $0xFFFFFE80,%eax
orl $0x01,%eax
0:movl %eax,0x28E0(%esi)
jmp L16
This part surely could be reduced further if I could remember what all those flags are good for...
L15:popl %ebx
popl %edi
popl %esi
popl %ebp
jmp _DefDP
L16:xorl %eax, %eax
popl %ebx
popl %edi
popl %esi
popl %ebp
ret
'Exits' belong to the bottom of a function. First, the processor does not have to jump back and forth to random locations within the instruction chain. Secondly, human senses perceive structured (sorted) input much faster than random patterns spread all over the screen.
.comm _GVAR,4
One (used) out of all (unused). The size of the global variable was reset to the proper size of 32 bit, so four - rather than one - variable(s) will fit into one paragraph (16 byte), again.
Go to the next post (06 - Analysis).
No comments:
Post a Comment