Wrappers generally are used to reduce calls to external functions to one place. In former times, this was done because all external calls were far calls to other code segments. Things changed a little bit with the flat memory model, but API functions still reside in higher priviledged code segments or are call gates to OS functions running in ring 0, 1 or 2.
Another important issue is the encapsulation of dirty functions, known to overwrite registers without restoring them. As shown in my Intelligent Design paper, a dirty environment slows down execution markably, because frequently used parameters must be reloaded after each call to a dirty function.
Unfortunately, the dirtyness of Windoze and Linux grew with the switch to their 64 bit ABI/API. While only two registers were used as garbage pile in 32 bit programming environments (ECX, EDX), we now have to face the abuse of 8 registers in Windoze environments (RCX, RDX, R08, R09, R10, R11, XMM4 and XMM5) or even 11 registers in Linux (RDI, RSI, RCX, RDX, R08, R09, R10 and XMM4...XMM7). Whenever you pass the mentioned registers to an API function, you can be sure they are returned with changed content. To avoid changed registers, you have to use wrappers. If you do not, you have to reload your parameters over and over, again.
Wrappers are a must if you want to keep your programming environment clean. As a positive side effect, you can customize API calls. Doing so can reduce the amount of arguments to pass drastically - a wrapper can manage things like retrieving the HWND for a dialog item much faster than the programmer, because it already has all registers preserved, saving the time to preserve and restore them more than once.
Next, .globl statements tell the linker this function is globally available (visible) and external functions can call it. If the .globl is missing, that function only can be called from functions residing in the current source file.
A .def statement provides additional debugging information for the compiler. The .scl defines the storage class, .type 32 means it is a function. This statement probably is superfluous in pure assembler programs, but nevertheless is required if the function shall be callable from external HLL functions.
Finally, the real code of MyFunction() begins at _MyFunction. That is: The address of the first instruction following the label _MyFunction is the start address of _MyFunction. Whenever the linker stumbles upon a call _MyFunction, it replaces the label with a reference to that start address.
That much about HLL compatible goobledygook. Let's continue with some basic thoughts about the internal organisation of a wrapper.
Let us assume our library has 20 functions, resulting in 20 * 0.1 payload and 20 * 0.9 redundant repetitions. As a result, we have 0.5 percent payload and 99.5 percent overhead. This pays off if only one library function is called. With two calls, we have five percent payload and 95 percent overhead, with four functions 2.5 percent payload and 97.5 percent overhead, and so on. As you can see, this concept looks better at the first glance, but turns out to be a bad design for daily use.
The epilogue is quite similar to the prologue, except target and source of the move instructions are exchanged:
All mentioned latencies are valid for PhenomII (family 10), only. For older Athlons (family 8), the push version needs 21 and the pop version 27 clock cycles. Latencies for the Intelligent Design versions are the same for both processor families.
Another important issue is the encapsulation of dirty functions, known to overwrite registers without restoring them. As shown in my Intelligent Design paper, a dirty environment slows down execution markably, because frequently used parameters must be reloaded after each call to a dirty function.
Unfortunately, the dirtyness of Windoze and Linux grew with the switch to their 64 bit ABI/API. While only two registers were used as garbage pile in 32 bit programming environments (ECX, EDX), we now have to face the abuse of 8 registers in Windoze environments (RCX, RDX, R08, R09, R10, R11, XMM4 and XMM5) or even 11 registers in Linux (RDI, RSI, RCX, RDX, R08, R09, R10 and XMM4...XMM7). Whenever you pass the mentioned registers to an API function, you can be sure they are returned with changed content. To avoid changed registers, you have to use wrappers. If you do not, you have to reload your parameters over and over, again.
Wrappers are a must if you want to keep your programming environment clean. As a positive side effect, you can customize API calls. Doing so can reduce the amount of arguments to pass drastically - a wrapper can manage things like retrieving the HWND for a dialog item much faster than the programmer, because it already has all registers preserved, saving the time to preserve and restore them more than once.
Windoze 64 Bit Wrapper
My programming environment is the TDM/MinGW64 package, date 2010-05-09. Later packages are broken and deny to work properly. All code snippets are AS sources taken from ST-Open's system library. All functions in this environment must be defined as follows:.text .globl _MyFunction .def _MyFunction; .scl 2; .type 32; .endef _MyFunction:To tell GCC to put the following output into the code segment, we have to insert a .text statement on top of our code. If your code references static data in the data or DR segment, their definitions should preceed the .text statement. Never put data into the code segment - it can cause drastical perfomance loss.
Next, .globl statements tell the linker this function is globally available (visible) and external functions can call it. If the .globl is missing, that function only can be called from functions residing in the current source file.
A .def statement provides additional debugging information for the compiler. The .scl defines the storage class, .type 32 means it is a function. This statement probably is superfluous in pure assembler programs, but nevertheless is required if the function shall be callable from external HLL functions.
Finally, the real code of MyFunction() begins at _MyFunction. That is: The address of the first instruction following the label _MyFunction is the start address of _MyFunction. Whenever the linker stumbles upon a call _MyFunction, it replaces the label with a reference to that start address.
That much about HLL compatible goobledygook. Let's continue with some basic thoughts about the internal organisation of a wrapper.
Wrapper Designs
There are two different ways to organise a wrapper - either you provide separate functions with endless repetitions of one and the same proplogue and epilogue (stand-alone), or you use one prologue and epilogue for all functions (collection).Stand-Alone Wrappers
Providing a separate prologue and epilogue for each function has one advantage: The linker can cut the function's code out of a library and add just that code to the program where the function is called. The downside is a library with tons of redundant repetitions of one and the same prologue and epilogue. Hence, programs using the library are kept smaller, while the size of the library is quite large. Depending on the size of the prologue and epilogue, there is a point where the advantage is eaten up by their repetition. For example, the payload of an API wrapper is about ten percent of the entire wrapper, while the remaining 90 percent are occupied by preserving and restoring clobbered registers.Let us assume our library has 20 functions, resulting in 20 * 0.1 payload and 20 * 0.9 redundant repetitions. As a result, we have 0.5 percent payload and 99.5 percent overhead. This pays off if only one library function is called. With two calls, we have five percent payload and 95 percent overhead, with four functions 2.5 percent payload and 97.5 percent overhead, and so on. As you can see, this concept looks better at the first glance, but turns out to be a bad design for daily use.
Collected Wrappers
Here we go the other way around. This concept adds bloat if only one function is used, but pays off if we call multiple functions. Both, payload and overhead, now have fixed sizes. The more functions we use, the less our overhead becomes. With two functions, the ratio is 20 to 80 percent, with four functions 40 to 60 percent, and so on. In the end, we can reduce the size of our program by some byte if we use a collected wrapper. Therefore, most ST-Open libraries meanwhile use collected rather than stand-alone functions.Windoze API
Prologue And Epilogue
With the change to 64 bit, microsoft decided to follow the footsteps of Loonix and abuse more registers as garbage pile. Additional to rCX and rDX in 32 bit code, we now have to preserve R08, R09, R10, R11, XMM4 and XMM5, as well if we do not want to reload the contents of these registers after each API call. I have seen a lot of register dumps throughout the last weeks, so I can tell you those registers definitely are destroyed after each API call. The most important parts of any API wrapper therefore are its prologue and epilogue. The prologue looks like this... .p2align 4,,15 0:subq $0xB8,%rsp nop nop movdqa %xmm4,0x60(%rsp) movdqa %xmm5,0x70(%rsp) movq %rcx, 0x88(%rsp) movq %rdx, 0x90(%rsp) movq %r8, 0x98(%rsp) movq %r9, 0xA0(%rsp) movq %r10, 0xA8(%rsp) movq %r11, 0xB0(%rsp) jmp *%rax ...where 0 is the entrypoint for the function declarations placed above the prologue. RAX is set to the address of the real function code at the end of the declaration, preceeeing the jump to 0. Even if it looks quite lengthy, saving all registers is done in about 5 clock cycles (RSP correction plus two write combining sequences). Using six pushes for the GPRs and two movdqas for the XMM registers took about 15 clock cycles, because push works with decreasing addresses, so no write combining is triggered. (two clocks for the movdqas, 13 clocks for six pushes - RSP is available after 2 clocks, only the last push needs all three clock cycles.)
The epilogue is quite similar to the prologue, except target and source of the move instructions are exchanged:
... .p2align 4,,15 XIT:movdqa 0x60(%rsp),%xmm4 movdqa 0x70(%rsp),%xmm5 movq 0x88(%rsp),%rcx movq 0x90(%rsp),%rdx movq 0x98(%rsp),%r8 movq 0xA0(%rsp),%r9 movq 0xA8(%rsp),%r10 movq 0xB0(%rsp),%r11 addq $0xB8,%rsp retThis is trivial code and holds no mysteriously hidden secrets. In concurrence to the prologue, register reads have no accelerating mechanisms like write combining, so it takes about 10 clock cycles until the final return is executed - memory reads and writes are limited to one access per clock cycle. With pops, the prologue was executed in 15 clock cycles (two for the movdqas, 13 for the pops, where only the last one needs 3 cycles, the other are ready after 2 clocks).
All mentioned latencies are valid for PhenomII (family 10), only. For older Athlons (family 8), the push version needs 21 and the pop version 27 clock cycles. Latencies for the Intelligent Design versions are the same for both processor families.
Function Declarations
All function declarations use a stereotype pattern, where only the function names change from declaration to declaration. The following snippet is just an excerpt from my original file:.text .p2align 4,,15 .globl _RegClass .def _RegClass; .scl 2; .type 32; .endef _RegClass:movq $rclass,%rax jmp 0f .p2align 4,,15 .globl _RgClassX .def _RgClassX; .scl 2; .type 32; .endef _RgClassX:movq $rclssx,%rax jmp 0f .p2align 4,,15 .globl _LdIcon .def _LdIcon; .scl 2; .type 32; .endef _LdIcon:movq $ldicon,%rax jmp 0f ...I use symbolic names for all local labels, but you could use GCC-style labels like L00...Lxx, as well. Symbolic names are faster to find if the file includes 60 functions like cap.S, though...
Functions
The functions themselves handle all required tasks, call the corresponding API function and pass the API returncode (RC) to the caller:... .p2align 4,,15 rclass:call *__imp__RegisterClassA(%rip) jmp XIT .p2align 4,,15 rclssx:call *__imp__RegisterClassExA(%rip) jmp XIT .p2align 4,,15 ldicon:call *__imp__LoadIconA(%rip) jmp XIT ...The shown functions just pass the received parameters in RCX, RDX, R08 and R09 to the API, but they may pre-process those parameters, before they finally call the API function. An example:
... .p2align 4,,15 ctlshw:call *__imp__GetDlgItem(%rip) movq %rax,%rcx # HWND movq 0x98(%rsp),%rdx # flag call *__imp__ShowWindow(%rip) jmp XIT ...A call to CtlSh(HWND, id, bool); shows or hides the control specified by its ID and the handle of the control's parent window (probably a dialog). To speed up execution, the wrapper function first retrieves the control's window handle, then calls the API function to show or hide general windows rather than to call a Widoze macro (which does nothing else than CtlSh(), but probably takes the long winded way).
No comments:
Post a Comment