Tuesday, April 13, 2010

12 - Appendix 1

The first appendix supplies you with a collection of examples, showing how to design new functions from scratch. Some of them might be used as templates - just C&P them to your source files. The remaining functions may be used as a hint how to solve some specific poblems. Practice is the best teacher you can get. Read the examples, then start coding your own stuff. While debugging it, you will learn much more than a teacher ever could show you. Without errors, you never get a deeper insight how something really works.


Example 01

No registers, local variables or parameters are required. All three areas therefore have a size of zero and the stackpointer stays untouched:

.globl _Brd
  _Brd:movl 0x04(%esp),%eax   # block address
       addl 0x08(%esp),%eax   # + offset
       movzb 0x00(%eax),%eax  # Byte[block+offset]
       ret

Brd() is was one of the few remaining crutches for C programmers provided by ST-Open's libraries. Even if the function is a convincing example for really bad code, this construct still is ways faster than any equivalent generated by a C compiler.

Assembly language programmers do not need external functions to access data from an offset to a base address, of course. Loading the block address at least three clock cycles before we start to access data relative to it is all we have to do:

       ...
       movl 0x04(%esp),%ecx    # block address
       ...
       ...                     # 3 clocks distance!
       ...
       movzb 0x1234(%ecx),%eax # DB @ 0x1234[block]
       ...

Code like this prevents the other two execution pipes from taking a nap while the first pipe is busy with copying the block address to ECX.

Side Note: 'Three clock cycles away' does not mean 'three instructions away'! Not all instructions have a latency of three clock cycles, so it might be necessary to fill the dotted lines with much more than just two instructions. For example, direct manipulations of registers without memory operands - like xorl %eax,%eax - generally are executed within one clock cycle. We had to insert six of them to feed the other two pipes for the three clock cycles the first pipe is busy with loading the block address into ECX. You have to calculate the latencies of all used instructions and keep them in the proper order to prevent interruption of simultaneouos execution in all three pipes.


Example 02

Here we store two registers and pass for parameters to the API. The size of our stack frame therefore is 8 + 16 = 24 byte:

.globl _WinPP
_WinPP:subl $0x3C,%esp
       nop
       nop
       movdqu 0x40(%esp),%xmm0
       movl %edx,0x10(%esp)
       movl %ecx,0x14(%esp)
       movdqu %xmm0,0x00(%esp)
       call _WinSetPresParam
       movl 0x10(%esp),%edx
       movl 0x14(%esp),%ecx
       addl $0x3C,%esp
       ret

WinnPP() is one of many 'sandboxes', only called to save ECX und EDX and restore them after the API destroyed them. To speed up execution and save six MOV instructions with memory references, two MOVDQU instructions are used. XMM registers can hold four doublewords, so all four parameters can be copied in one gulp.


Example 03

Finally, we store one register, use a 32 byte string and call two functions - the first has three, the second one parameter to pass. Hence, the size of our stack frame is 4 + 32 + 12 = 48 byte. We add a 16 byte safety gap and subtract 64 from ESP:

.globl _GetSz
_GetSz:movl 0x04(%esp),%eax
       subl $0x3C,%esp
       nop
       movl %ebx,0x38(%esp)
       leal 0x0C(%esp),%ebx
       movl %eax,0x00(%esp)
       movl $0x1234,0x04(%esp)
       movl %ebx,0x08(%esp)
       call _QEf
       movl %ebx,0x00(%esp)
       call _SLen
       movl 0x38(%esp),%ebx
       addl $0x3C,%esp
       ret

GetSz(h) is called with the dialogs window handle as only parameter. The content of the entryfield specified by the dialog's window handle and a fixed resource ID (0x1234) is queried via the sandbox QEf(). Our temporary string buffer occupies the area 0x0C[ESP] through 0x3B[ESP]. Entryfields are limited to 32 byte by default (OS/2), so the buffer is large enough to prevent QEf() from overwriting other data. To determine the size of the returned string, SLen(), a function provided by ST-Open's main library, is called. Finally, the string size returned in EAX is passed back to the caller.

We need the window handle for the first call, only, so it is copied to EAX before the stack frame is created. This saves extra code and clock cycles to save and restore an additional register. SLen() is a standard function provided by ST-Open's main library. It could be replaced by this alternative:

       ...
       call _QEf
       xorl %eax,%eax
     0:cmpl $0x00,0x00(%ebx)
       je 1f
       incl %ebx
       incl %eax
       jmp 0b
     1:movl 0x38(%esp),%ebx
       addl $0x3C,%esp
       ret

Replacing SLen() with equivalent code saves 8 clock cycles for the CALL/RET sequence and preloading registers in the called function, again. To reduce overhead, you might consider to replace tasks of less complex functions like SLen() with own code to save external calls.


Example 04

This example shows how local variables can be aligned to 16 byte boundaries, as required for XMM instructions like MOVDQA and friends. Actually, it is impossible to align ESP directly, so we have to sacrifice a general purpose register. Because we do not know, to which multiple of four ESP currently is aligned to, we have to add a 16 byte safety gap to the value we subtract from ESP. In lng.S, a part of ST-Open's libraries, we can find the following code. Please notice that this code does not fully comply to Intelligent Design rules - it creates a stack frame not aligned to a multiple of 64!

       ...
       .align 2,0x90
.globl _MNUtxt
_MNUtxt:
       subl $0x50,%esp
       nop
       nop
       movl %ebp,0x4C(%esp)
       movl %esi,0x48(%esp)
       movl %edi,0x44(%esp)
       movl %ebx,0x40(%esp)
       movl %ecx,0x3C(%esp)
       movl %edx,0x38(%esp)
       movl _BNR,%esi
       leal 0x10(%esp),%edi
       movl 0x58(%esp),%ebx
       movl 0x5C(%esp),%ecx
       movl 0x20(%esi),%edx
       andl $0xFFFFFFF0,%edi
       pxor %xmm0,%xmm0
       subl %ebx,%ecx
       jns 0f
       movl $0x0A,%eax
       jmp L00
       /*
         load field FFFFFF12
       */
     0:andl $0x0F,%edx
       movq %xmm0,0x00(%edi)
       movl $0xFFFFFF12,0x08(%edi)
       movl $0x00000003,0x0C(%edi)
       movdqa %xmm0,0x10(%edi)
       movq %xmm0,0x20(%edi)
       movl %edx,0x20(%esi)
       movl %edi,0x00(%esp)
       call _LDreq
       testl %eax,%eax
       jne L00
       ...

This function passes the address of a LD structure to LDreq(). The LD structure is only required for this call, so we create it in our stack frame on the fly. Because most of the parameters should be set to zero, we clear them with XMM instructions, saving five movs. To align the structure, we consider the following facts: EDI cannot end with any other number than 0, 4, 8 or C - the only possible multiples of four in hexadecimal notation. If it is 0, it already is aligned. If it is any other number x, we have to add the difference (16 - x) to the required offset, moving the offset to the beginning of the next 16 byte boundary. Calculations like this definitely take too much time, so we use a trick and add the largest possible number to EDI, then and the new content of EDI with the pattern 0xFFFFFFF0. If ESP currently points to address 0x0003FEC4, the required structure starts at 0x04[ESP] => 0x0003FEC8. We add 0x0C + 0x04 = 0x10 to move the offset to a safe region. Now we do a leal 0x10(%esp),%edi. This loads 0x0003FED8 into EDI. The final andl $0xFFFFFFF0,%edi clears the lowest 4 bits of the address, leaving 0x0003FED0 - a properly aligned address with sufficient safety distance from the parameter at the bottom of our stack frame.

MNUtxt() is a part of ST-Open's language support. Depending on the current language, entries of the corresponding subfield are copied to the menu items specified by their ID. Up to 16 languages can be stored in fields FFFFFF12 (user) and FFFFFF13 (system) and the user can switch between them in the running program.



No comments:

Post a Comment