Intelligent Design: 14

ST-Open software is developed with GCC/2 (1994) and the GNU assembler AS. AS is a sophisticated assembler, so nothing is ASSUMEd and no hints like SEGMENT, BYTE PTR and compagnions are required. This saves a lot of typing work and the readability of source files markably grows. A simple .data on top of user data belonging to the DATA segment and a simple .text on top of the code going to the CODE segment is all AS needs to know. On the other hand, AS wants to be fed with code written in AT&T syntax.

Register Set

All register names are written in small letters and a percent sign preceeds the register name as a delimiter. The lists below enumerate the entire register set available for LETNiums and AMD's Athlon64:

8 bit Registers    16 bit Registers
LETNi     AT+T       LETNi     AT+T

AL        %al        AX        %ax
BL        %bl        BX        %bx
CL        %cl        CX        %cx
DL        %dl        DX        %dx
DIL       %dil       DI        %di
SIL       %sil       SI        %si
BPL       %bpl       BP        %bp
SPL       %spl       SP        %sp
R8B       %r8b       R8W       %r8w
R9B       %r9b       R9W       %r9w
R10B      %r10b      R10W      %r10w
R11B      %r11b      R11W      %r11w
R11B      %r12b      R12W      %r12w
R11B      %r13b      R13W      %r13w
R11B      %r14b      R14W      %r14w
R11B      %r15b      R15W      %r15w

32 bit Registers      64 bit Registers
LETNi     AT+T        LETNi     AT+T

EAX       %eax        RAX       %rax
EBX       %ebx        RBX       %rbx
ECX       %ecx        RCX       %rcx
EDX       %edx        RDX       %rdx
EDI       %edi        RDI       %rdi
ESI       %esi        RSI       %rsi
EBP       %ebp        RBP       %rbp
ESP       %esp        RSP       %rsp
R8D       %r8d        R8        %r8
R9D       %r9d        R9        %r9
R10D      %r10d       R10       %r10
R11D      %r11d       R11       %r11
R12D      %r12d       R12       %r12
R13D      %r13d       R13       %r13
R14D      %r14d       R14       %r14
R15D      %r15d       R15       %r15

FP / MMX              SSE / 3Dnow!
LETNi     AT+T    LETNi     AT+T

ST0       %st(0)      XMM0      %xmm0
ST1       %st(1)      XMM1      %xmm1
ST2   %st(2) XMM2      %xmm2
ST3   %st(3)      XMM3      %xmm3
ST4       %st(4)    XMM4      %xmm4
ST5   %st(5) XMM5      %xmm5
ST6       %st(6)      XMM6      %xmm6
ST7   %st(7)    XMM7      %xmm7
MM0       %mm0    XMM8      %xmm8
MM1       %mm1    XMM9      %xmm9
MM2       %mm2        XMM10     %xmm10
MM3       %mm3      XMM11     %xmm11
MM4       %mm4        XMM12     %xmm12
MM5       %mm5      XMM13     %xmm13
MM6       %mm6        XMM14     %xmm14
MM7       %mm7      XMM15     %xmm15

Special              Debug
LETNi    AT+T      LETNi     AT+T

CS       %cs     DB0       %db0
DS       %ds       DB1       %db1
DS       %ds        DB2       %db2
ES       %es       DB3       %db3
FS       %fs         -          -
GS       %gs         -          -
SS       %ss         DB6       %db6
                      DB7       %db7
CR0       %cr0        DB8       %db8
CR1       %cr1      DB9       %db9
CR2       %cr2        DB10      %db10
                      DB11      %db11
TR6       %tr6      DB12      %db12
TR7       %tr7       DB13      %db13
                      DB14      %db14
                      DB15      %db15

Appendices

Data sizes of instructions with operands are specified by "b" (byte), "w" (word), "d" (MMX or XMM for doubleword) or "l" (integer for doubleword) and 'q' (quadword). They replace the hints "byte ptr", "word ptr", "dword ptr" and "qword ptr" used in LETNi syntax:

movb $0x01,%al          # load byte        01 into AL
movw $0x01,%ax          # load word      0001 into AX
movl $0x01,%eax         # load dword 00000001 into EAX

...but:

movsbl $0x81,%eax
(load sign extended byte 81 into EAX, so EAX holds FFFFFF81, now)

movzb $0x81,%eax
(load zero extended byte 81 into EAX, so EAX holds 00000081, now)

Numbers And Addresses

Numbers are preceeded by a Dolar sign '$', addresses are written as plain numbers:

movl $0x01,%eax
(copy 00000001 to EAX)

movl 0x01,%eax

(copy the doubleword found at address 00000001 to EAX; this causes some penalty cycles for accessing an address not divisible by four, then crashes because we try to access protected memory)

Indirect Addressing

The register holding the address is put into round brackets. The offset, in LETNi vocabulary it is called "displacement", is written in front of the leading bracket:

movw 0x04(%esi),%ax
(copy the word found at address [ESI + 0x04] to AX)

Indexed Adressing

The index register follows the register holding the address. The multiplicator, LETNi vocabulary uses the term "scale factor", follows the index register. All three are separated by commata:

movb 0x00(%esi, %edx, 1),%al
(copy the byte at memory location [ESI + 0x04 + (EDX * 1)] to AL)

movl 0x00(, %edx, 4),%eax
(copy the doubleword at memory location [0x00 + (EDX * 4)] to EAX)

Global Variables And Functions

To make a function globally visible, we have to add a .globl declaration in front of the function declaration. To make variables globally visible, we add a .comm in each source file where this variable is required. All global functions and variables must be preceeded by an underscore "_".

To access adresses of functions or variables, their name must be preceeded by a Dollar sign "$". To access the content of a variable (read, write, increment, decrement, compare against, etc.), we write their name "as is":

.align 2,0x90
(only in front of your functions!)

.globl _MyFunction
(make MyFunction globally visible)

_MyFunction: # declaration
... # function body
ret # finished, return to caller

.comm _BNR,4

(reserves 4 byte in the data segment for the global variable _BNR)

movl _BNR,%eax
(copy the content of _BNR to EAX)

movl $_BNR,%eax

(copy the address where _BNR is stored to EAX)

movl $_AllMine,%eax

(copy the address where function _AllMine starts to EAX)

call *%eax
(execute _AllMine)

The instruction call *%eax is equivalent with call _AllMine. However, it wastes one clock cycle with loading the address of _AllMine into EAX. On the other hand, loading a return address into a register can save six clock cycles if we use simple JMP instructions instead of the CALL/RET mechanism. This, of course, is limited to a few local helper functions - the usual CALL/RET is more flexible, because we don't need to know where the called function is stored.

Calls And Jumps

Calls and jumps either can use (global) labels or registers as operands. If a register is used, its name must be preceeded by an asterisk *. While the previous example showed us how to use a register together wit a CALL instruction, the following example shows us how to create a jump table.

    .data
    .align 2,0x00
L99:.long L00            # jump table
    .long L01
    .long L02
    .long L03
    .long L04

    .text
    ...                  # prologue
    movl $0x04,%ebx
    cmpl $0x04,%eax
    cmova %ebx,%eax      # keep valid
    jmp *L99(, %eax, 4) # indexed jump
L00:nop                  # target proc
    jmp L05
L01:nop
    jmp L05
L02:nop
    jmp L05
L03:nop
    jmp L05
L04:nop
    jmp L05
L05:nop                  # epilogue
    ...
    ret

This is a C switch{} statement coded in assembler. Using cmova, we save one conditional jump and avoid the ten penalty cycles for a false "guess" of the branch prediction logic.

Please notice, that I put the jump table into the .data, not the .text segment. As LETNi and AMD clearly state - this is the place data belongs to. Unfortunately, GCC creates all jump tables in the .text segment. To optimise your code, you should move them to the top of the file and put them into the .data segment as shown above.

Keep in mind that 32 bit jump tables must be aligned to a multiple of 4, while 64 bit jump tables must be aligned to a multiple of 8. This is done by putting an appropriate .align statement in front of the (first) jump table.

.align

GCC spices source files with tons of .align statements spread all over the text segment. If an .align preceeds a function, you should leave it alone - do not remove it! Because modern processors work with quite small instruction caches (32 byte on Athlon64 machines), it might be necessary to insert an .align 4,,15 in front of a branch target to support the processor's prefetch mechanisms. However, you should avoid to insert .align statements at places where they might be executed. Each .align statement inserts an appropriate number of nops to move the instruction pointer to the next multiple of the cacheline's size, so the next instruction "sits" at the beginning of a new cache line. This is important, if the next instruction is the target of a branch. Because the processor speculatively prefetches the code of branch targets, execution continues at the beginning of a cache line if the branch was taken. Execution is sped up if the processor doesn't have to load the instructions of the branch target before it can continue to execute them.

nop

The nop instruction puts the next free execution pipeline into idle mode for one clock cycle. If you insert it at the proper places, it can improve performance and speed up execution. However, the benefits only can be determined experimentally. You have to test the runtime of
several variants of your code with exceptional care. The rdtsc instruction is a good tool to measure the runtime of test functions with acceptable accuracy. If you write the output to a file, the gathered data might be sufficient to find out which variant is the fastest.

Wednesday, April 14, 2010

14 - AT&T Syntax

No comments:

Post a Comment