ST-Open software is developed with GCC/2 (1994) and the GNU assembler AS. AS is a sophisticated assembler, so nothing is ASSUMEd and no hints like SEGMENT, BYTE PTR and compagnions are required. This saves a lot of typing work and the readability of source files markably grows. A simple .data on top of user data belonging to the DATA segment and a simple .text on top of the code going to the CODE segment is all AS needs to know. On the other hand, AS wants to be fed with code written in AT&T syntax.
Register Set
All register names are written in small letters and a percent sign preceeds the register name as a delimiter. The lists below enumerate the entire register set available for LETNiums and AMD's Athlon64:
8 bit Registers 16 bit Registers
LETNi AT+T LETNi AT+T
AL %al AX %ax
BL %bl BX %bx
CL %cl CX %cx
DL %dl DX %dx
DIL %dil DI %di
SIL %sil SI %si
BPL %bpl BP %bp
SPL %spl SP %sp
R8B %r8b R8W %r8w
R9B %r9b R9W %r9w
R10B %r10b R10W %r10w
R11B %r11b R11W %r11w
R11B %r12b R12W %r12w
R11B %r13b R13W %r13w
R11B %r14b R14W %r14w
R11B %r15b R15W %r15w
32 bit Registers 64 bit Registers
LETNi AT+T LETNi AT+T
EAX %eax RAX %rax
EBX %ebx RBX %rbx
ECX %ecx RCX %rcx
EDX %edx RDX %rdx
EDI %edi RDI %rdi
ESI %esi RSI %rsi
EBP %ebp RBP %rbp
ESP %esp RSP %rsp
R8D %r8d R8 %r8
R9D %r9d R9 %r9
R10D %r10d R10 %r10
R11D %r11d R11 %r11
R12D %r12d R12 %r12
R13D %r13d R13 %r13
R14D %r14d R14 %r14
R15D %r15d R15 %r15
FP / MMX SSE / 3Dnow!
LETNi AT+T LETNi AT+T
ST0 %st(0) XMM0 %xmm0
ST1 %st(1) XMM1 %xmm1
ST2 %st(2) XMM2 %xmm2
ST3 %st(3) XMM3 %xmm3
ST4 %st(4) XMM4 %xmm4
ST5 %st(5) XMM5 %xmm5
ST6 %st(6) XMM6 %xmm6
ST7 %st(7) XMM7 %xmm7
MM0 %mm0 XMM8 %xmm8
MM1 %mm1 XMM9 %xmm9
MM2 %mm2 XMM10 %xmm10
MM3 %mm3 XMM11 %xmm11
MM4 %mm4 XMM12 %xmm12
MM5 %mm5 XMM13 %xmm13
MM6 %mm6 XMM14 %xmm14
MM7 %mm7 XMM15 %xmm15
Special Debug
LETNi AT+T LETNi AT+T
CS %cs DB0 %db0
DS %ds DB1 %db1
DS %ds DB2 %db2
ES %es DB3 %db3
FS %fs - -
GS %gs - -
SS %ss DB6 %db6
DB7 %db7
CR0 %cr0 DB8 %db8
CR1 %cr1 DB9 %db9
CR2 %cr2 DB10 %db10
DB11 %db11
TR6 %tr6 DB12 %db12
TR7 %tr7 DB13 %db13
DB14 %db14
DB15 %db15
Appendices
Data sizes of instructions with operands are specified by "b" (byte), "w" (word), "d" (MMX or XMM for doubleword) or "l" (integer for doubleword) and 'q' (quadword). They replace the hints "byte ptr", "word ptr", "dword ptr" and "qword ptr" used in LETNi syntax:
movb $0x01,%al # load byte 01 into AL
movw $0x01,%ax # load word 0001 into AX
movl $0x01,%eax # load dword 00000001 into EAX
...but:
movsbl $0x81,%eax
(load sign extended byte 81 into EAX, so EAX holds FFFFFF81, now)
movzb $0x81,%eax
(load zero extended byte 81 into EAX, so EAX holds 00000081, now)
Numbers And Addresses
Numbers are preceeded by a Dolar sign '$', addresses are written as plain numbers:
movl $0x01,%eax
(copy 00000001 to EAX)
movl 0x01,%eax
(copy the doubleword found at address 00000001 to EAX; this causes some penalty cycles for accessing an address not divisible by four, then crashes because we try to access protected memory)
Indirect Addressing
The register holding the address is put into round brackets. The offset, in LETNi vocabulary it is called "displacement", is written in front of the leading bracket:
movw 0x04(%esi),%ax
(copy the word found at address [ESI + 0x04] to AX)
Indexed Adressing
The index register follows the register holding the address. The multiplicator, LETNi vocabulary uses the term "scale factor", follows the index register. All three are separated by commata:
movb 0x00(%esi, %edx, 1),%al
(copy the byte at memory location [ESI + 0x04 + (EDX * 1)] to AL)
movl 0x00(, %edx, 4),%eax
(copy the doubleword at memory location [0x00 + (EDX * 4)] to EAX)
Global Variables And Functions
To make a function globally visible, we have to add a .globl declaration in front of the function declaration. To make variables globally visible, we add a .comm in each source file where this variable is required. All global functions and variables must be preceeded by an underscore "_".
To access adresses of functions or variables, their name must be preceeded by a Dollar sign "$". To access the content of a variable (read, write, increment, decrement, compare against, etc.), we write their name "as is":
To access adresses of functions or variables, their name must be preceeded by a Dollar sign "$". To access the content of a variable (read, write, increment, decrement, compare against, etc.), we write their name "as is":
.align 2,0x90
(only in front of your functions!)
.globl _MyFunction
(make MyFunction globally visible)
_MyFunction: # declaration
... # function body
ret # finished, return to caller
.comm _BNR,4
(reserves 4 byte in the data segment for the global variable _BNR)
movl _BNR,%eax
(copy the content of _BNR to EAX)
movl $_BNR,%eax
(copy the address where _BNR is stored to EAX)
movl $_AllMine,%eax
(copy the address where function _AllMine starts to EAX)
call *%eax
(execute _AllMine)
The instruction call *%eax is equivalent with call _AllMine. However, it wastes one clock cycle with loading the address of _AllMine into EAX. On the other hand, loading a return address into a register can save six clock cycles if we use simple JMP instructions instead of the CALL/RET mechanism. This, of course, is limited to a few local helper functions - the usual CALL/RET is more flexible, because we don't need to know where the called function is stored.
Calls And Jumps
Calls and jumps either can use (global) labels or registers as operands. If a register is used, its name must be preceeded by an asterisk *. While the previous example showed us how to use a register together wit a CALL instruction, the following example shows us how to create a jump table.
Calls And Jumps
Calls and jumps either can use (global) labels or registers as operands. If a register is used, its name must be preceeded by an asterisk *. While the previous example showed us how to use a register together wit a CALL instruction, the following example shows us how to create a jump table.
.data
.align 2,0x00
L99:.long L00 # jump table
.long L01
.long L02
.long L03
.long L04
.text
... # prologue
movl $0x04,%ebx
cmpl $0x04,%eax
cmova %ebx,%eax # keep valid
jmp *L99(, %eax, 4) # indexed jump
L00:nop # target proc
jmp L05
L01:nop
jmp L05
L02:nop
jmp L05
L03:nop
jmp L05
L04:nop
jmp L05
L05:nop # epilogue
...
ret
This is a C switch{} statement coded in assembler. Using cmova, we save one conditional jump and avoid the ten penalty cycles for a false "guess" of the branch prediction logic.
Please notice, that I put the jump table into the .data, not the .text segment. As LETNi and AMD clearly state - this is the place data belongs to. Unfortunately, GCC creates all jump tables in the .text segment. To optimise your code, you should move them to the top of the file and put them into the .data segment as shown above.
Keep in mind that 32 bit jump tables must be aligned to a multiple of 4, while 64 bit jump tables must be aligned to a multiple of 8. This is done by putting an appropriate .align statement in front of the (first) jump table.
Please notice, that I put the jump table into the .data, not the .text segment. As LETNi and AMD clearly state - this is the place data belongs to. Unfortunately, GCC creates all jump tables in the .text segment. To optimise your code, you should move them to the top of the file and put them into the .data segment as shown above.
Keep in mind that 32 bit jump tables must be aligned to a multiple of 4, while 64 bit jump tables must be aligned to a multiple of 8. This is done by putting an appropriate .align statement in front of the (first) jump table.
.align
GCC spices source files with tons of .align statements spread all over the text segment. If an .align preceeds a function, you should leave it alone - do not remove it! Because modern processors work with quite small instruction caches (32 byte on Athlon64 machines), it might be necessary to insert an .align 4,,15 in front of a branch target to support the processor's prefetch mechanisms. However, you should avoid to insert .align statements at places where they might be executed. Each .align statement inserts an appropriate number of nops to move the instruction pointer to the next multiple of the cacheline's size, so the next instruction "sits" at the beginning of a new cache line. This is important, if the next instruction is the target of a branch. Because the processor speculatively prefetches the code of branch targets, execution continues at the beginning of a cache line if the branch was taken. Execution is sped up if the processor doesn't have to load the instructions of the branch target before it can continue to execute them.
nop
The nop instruction puts the next free execution pipeline into idle mode for one clock cycle. If you insert it at the proper places, it can improve performance and speed up execution. However, the benefits only can be determined experimentally. You have to test the runtime of
several variants of your code with exceptional care. The rdtsc instruction is a good tool to measure the runtime of test functions with acceptable accuracy. If you write the output to a file, the gathered data might be sufficient to find out which variant is the fastest.
GCC spices source files with tons of .align statements spread all over the text segment. If an .align preceeds a function, you should leave it alone - do not remove it! Because modern processors work with quite small instruction caches (32 byte on Athlon64 machines), it might be necessary to insert an .align 4,,15 in front of a branch target to support the processor's prefetch mechanisms. However, you should avoid to insert .align statements at places where they might be executed. Each .align statement inserts an appropriate number of nops to move the instruction pointer to the next multiple of the cacheline's size, so the next instruction "sits" at the beginning of a new cache line. This is important, if the next instruction is the target of a branch. Because the processor speculatively prefetches the code of branch targets, execution continues at the beginning of a cache line if the branch was taken. Execution is sped up if the processor doesn't have to load the instructions of the branch target before it can continue to execute them.
nop
The nop instruction puts the next free execution pipeline into idle mode for one clock cycle. If you insert it at the proper places, it can improve performance and speed up execution. However, the benefits only can be determined experimentally. You have to test the runtime of
several variants of your code with exceptional care. The rdtsc instruction is a good tool to measure the runtime of test functions with acceptable accuracy. If you write the output to a file, the gathered data might be sufficient to find out which variant is the fastest.
No comments:
Post a Comment