Wednesday, November 17, 2010

ST-Open's Wrappers for 64 Bit Windoze

Wrappers generally are used to reduce calls to external functions to one place. In former times, this was done because all external calls were far calls to other code segments. Things changed a little bit with the flat memory model, but API functions still reside in higher priviledged code segments or are call gates to OS functions running in ring 0, 1 or 2.

Another important issue is the encapsulation of dirty functions, known to overwrite registers without restoring them. As shown in my Intelligent Design paper, a dirty environment slows down execution markably, because frequently used parameters must be reloaded after each call to a dirty function.

Unfortunately, the dirtyness of Windoze and Linux grew with the switch to their 64 bit ABI/API. While only two registers were used as garbage pile in 32 bit programming environments (ECX, EDX), we now have to face the abuse of 8 registers in Windoze environments (RCX, RDX, R08, R09, R10, R11, XMM4 and XMM5) or even 11 registers in Linux (RDI, RSI, RCX, RDX, R08, R09, R10 and XMM4...XMM7). Whenever you pass the mentioned registers to an API function, you can be sure they are returned with changed content. To avoid changed registers, you have to use wrappers. If you do not, you have to reload your parameters over and over, again.

Wrappers are a must if you want to keep your programming environment clean. As a positive side effect, you can customize API calls. Doing so can reduce the amount of arguments to pass drastically - a wrapper can manage things like retrieving the HWND for a dialog item much faster than the programmer, because it already has all registers preserved, saving the time to preserve and restore them more than once.

Windoze 64 Bit Wrapper

My programming environment is the TDM/MinGW64 package, date 2010-05-09. Later packages are broken and deny to work properly. All code snippets are AS sources taken from ST-Open's system library. All functions in this environment must be defined as follows:
.text
.globl   _MyFunction
.def     _MyFunction; .scl 2; .type 32; .endef
 _MyFunction:
To tell GCC to put the following output into the code segment, we have to insert a .text statement on top of our code. If your code references static data in the data or DR segment, their definitions should preceed the .text statement. Never put data into the code segment - it can cause drastical perfomance loss.

Next, .globl statements tell the linker this function is globally available (visible) and external functions can call it. If the .globl is missing, that function only can be called from functions residing in the current source file.

A .def statement provides additional debugging information for the compiler. The .scl defines the storage class, .type 32 means it is a function. This statement probably is superfluous in pure assembler programs, but nevertheless is required if the function shall be callable from external HLL functions.

Finally, the real code of MyFunction() begins at _MyFunction. That is: The address of the first instruction following the label _MyFunction is the start address of _MyFunction. Whenever the linker stumbles upon a call _MyFunction, it replaces the label with a reference to that start address.

That much about HLL compatible goobledygook. Let's continue with some basic thoughts about the internal organisation of a wrapper.

Wrapper Designs

There are two different ways to organise a wrapper - either you provide separate functions with endless repetitions of one and the same proplogue and epilogue (stand-alone), or you use one prologue and epilogue for all functions (collection).

Stand-Alone Wrappers

Providing a separate prologue and epilogue for each function has one advantage: The linker can cut the function's code out of a library and add just that code to the program where the function is called. The downside is a library with tons of redundant repetitions of one and the same prologue and epilogue. Hence, programs using the library are kept smaller, while the size of the library is quite large. Depending on the size of the prologue and epilogue, there is a point where the advantage is eaten up by their repetition. For example, the payload of an API wrapper is about ten percent of the entire wrapper, while the remaining 90 percent are occupied by preserving and restoring clobbered registers.

Let us assume our library has 20 functions, resulting in 20 * 0.1 payload and 20 * 0.9 redundant repetitions. As a result, we have 0.5 percent payload and 99.5 percent overhead. This pays off if only one library function is called. With two calls, we have five percent payload and 95 percent overhead, with four functions 2.5 percent payload and 97.5 percent overhead, and so on. As you can see, this concept looks better at the first glance, but turns out to be a bad design for daily use.

Collected Wrappers

Here we go the other way around. This concept adds bloat if only one function is used, but pays off if we call multiple functions. Both, payload and overhead, now have fixed sizes. The more functions we use, the less our overhead becomes. With two functions, the ratio is 20 to 80 percent, with four functions 40 to 60 percent, and so on. In the end, we can reduce the size of our program by some byte if we use a collected wrapper. Therefore, most ST-Open libraries meanwhile use collected rather than stand-alone functions.

Windoze API

Prologue And Epilogue

With the change to 64 bit, microsoft decided to follow the footsteps of Loonix and abuse more registers as garbage pile. Additional to rCX and rDX in 32 bit code, we now have to preserve R08, R09, R10, R11, XMM4 and XMM5, as well if we do not want to reload the contents of these registers after each API call. I have seen a lot of register dumps throughout the last weeks, so I can tell you those registers definitely are destroyed after each API call. The most important parts of any API wrapper therefore are its prologue and epilogue. The prologue looks like this
    ...

    .p2align 4,,15
  0:subq     $0xB8,%rsp
    nop
    nop
    movdqa   %xmm4,0x60(%rsp)
    movdqa   %xmm5,0x70(%rsp)
    movq     %rcx, 0x88(%rsp)
    movq     %rdx, 0x90(%rsp)
    movq     %r8,  0x98(%rsp)
    movq     %r9,  0xA0(%rsp)
    movq     %r10, 0xA8(%rsp)
    movq     %r11, 0xB0(%rsp)
    jmp      *%rax

    ...
where 0 is the entrypoint for the function declarations placed above the prologue. RAX is set to the address of the real function code at the end of the declaration, preceeeing the jump to 0. Even if it looks quite lengthy, saving all registers is done in about 5 clock cycles (RSP correction plus two write combining sequences). Using six pushes for the GPRs and two movdqas for the XMM registers took about 15 clock cycles, because push works with decreasing addresses, so no write combining is triggered. (two clocks for the movdqas, 13 clocks for six pushes - RSP is available after 2 clocks, only the last push needs all three clock cycles.)

The epilogue is quite similar to the prologue, except target and source of the move instructions are exchanged:
    ...

    .p2align 4,,15
XIT:movdqa   0x60(%rsp),%xmm4
    movdqa   0x70(%rsp),%xmm5
    movq     0x88(%rsp),%rcx
    movq     0x90(%rsp),%rdx
    movq     0x98(%rsp),%r8
    movq     0xA0(%rsp),%r9
    movq     0xA8(%rsp),%r10
    movq     0xB0(%rsp),%r11
    addq     $0xB8,%rsp
    ret
This is trivial code and holds no mysteriously hidden secrets. In concurrence to the prologue, register reads have no accelerating mechanisms like write combining, so it takes about 10 clock cycles until the final return is executed - memory reads and writes are limited to one access per clock cycle. With pops, the prologue was executed in 15 clock cycles (two for the movdqas, 13 for the pops, where only the last one needs 3 cycles, the other are ready after 2 clocks).

All mentioned latencies are valid for PhenomII (family 10), only. For older Athlons (family 8), the push version needs 21 and the pop version 27 clock cycles. Latencies for the Intelligent Design versions are the same for both processor families.

Function Declarations

All function declarations use a stereotype pattern, where only the function names change from declaration to declaration. The following snippet is just an excerpt from my original file:
          .text

          .p2align 4,,15
          .globl   _RegClass
          .def     _RegClass; .scl 2; .type 32; .endef
_RegClass:movq     $rclass,%rax
          jmp      0f

          .p2align 4,,15
          .globl   _RgClassX
          .def     _RgClassX; .scl 2; .type 32; .endef
_RgClassX:movq     $rclssx,%rax
          jmp      0f

          .p2align 4,,15
          .globl   _LdIcon
          .def     _LdIcon; .scl 2; .type 32; .endef
  _LdIcon:movq     $ldicon,%rax
          jmp      0f

          ...
I use symbolic names for all local labels, but you could use GCC-style labels like L00...Lxx, as well. Symbolic names are faster to find if the file includes 60 functions like cap.S, though...

Functions

The functions themselves handle all required tasks, call the corresponding API function and pass the API returncode (RC) to the caller:
          ...

          .p2align 4,,15
   rclass:call     *__imp__RegisterClassA(%rip)
          jmp XIT

          .p2align 4,,15
   rclssx:call     *__imp__RegisterClassExA(%rip)
          jmp XIT

          .p2align 4,,15
   ldicon:call     *__imp__LoadIconA(%rip)
          jmp XIT

          ...
The shown functions just pass the received parameters in RCX, RDX, R08 and R09 to the API, but they may pre-process those parameters, before they finally call the API function. An example:
          ...

          .p2align 4,,15
   ctlshw:call     *__imp__GetDlgItem(%rip)
          movq     %rax,%rcx                 # HWND
          movq     0x98(%rsp),%rdx           # flag
          call     *__imp__ShowWindow(%rip)
          jmp XIT

          ...
A call to CtlSh(HWND, id, bool); shows or hides the control specified by its ID and the handle of the control's parent window (probably a dialog). To speed up execution, the wrapper function first retrieves the control's window handle, then calls the API function to show or hide general windows rather than to call a Widoze macro (which does nothing else than CtlSh(), but probably takes the long winded way).

A Real Epilogue...

I hope I could impart some knowledge about the pro's and con's of old fashioned C-style programming techniques and modern alternatives. the entire file cap.S can be downloaded here: wrappers.7z.