Tuesday, April 13, 2010

09 - First Steps

While the previous pages introduced the basic concepts of Intelligent Design to you, it is time to fill this abstract theortical building with life and show, how the described methods and techniques are applied to real life applications. To get in touch with Intelligent Design, an example function is developed step for step.

Two things should be stressed: All statements and comments assume that this function is run on a recent operating system with support for recent programming techniques like Intelligent Design, e.g. IDEOS. If you run the example on an old fashioned platform like Linux, OS/2 or Windows, there is a 75 percent chance to encounter a nice crash while the first XMM instruction is executed. Furthermore, Intelligent Design is optimised for the most mature processor design, AMD's Athlon64. It is the only machine with three execution pipes for integer and another three execution pipes for floating point instructions. This makes any Athlon64 superior to LETNi's I-am-so-slowium processors with their one and a half execution pipes for all (integer and FP) instructions. As a matter of fact, Intelligent Design code runs faster on a slower Athlon64 than on a faster I-am-so-slowium.

(Sidenote: Unfortunately, I had to remove all comments in the code snippets because of the limited formatting abitlities of the blog engine.)


Problem

Eight entry fields in a dialog are to fill with four hexadecimal (subfields 00...03, IDs 0x1200...0x1203) and two decimal (subfield 04/05, IDs 0x1204/0x1205) numbers as well as two strings (subfield 06/07, IDs 0x1206/0x1207). All data are stored in the given subfields of datafield 0000001E. Entries are selected via keyboard, evaluated and the selected entry number is stored in a global variable with the symbolic name CURSEL (adress 0x0240[BNR]), then our function is called to display all data of the chosen entry in a dialog. The datafield is loaded permanently, its MemoryHandle is stored in a runtime variable with the symbolic name MH_SEL (adress 0x2028[BNR]). It has 1,024 entries with 6 subfields (00-05) of type 03 (doubleword) and 2 subfields (06, 07) of type 07 (dynamic strings with garbage collection).


Solution

Even if this is a trivial problem, there are many ways to solve it. The preferred solution in typical C code is a loop executed eight times where some logic switches between three possible output types (hex, decimal, string). It is not the best way to solve the given problem - comparisons and repeated writes are quite slow.

To avoid repetitive work and speed up execution markably, it's a much better idea to unroll the suggested loop completely. Unfortunately, it is impossible to apply real improvements if we use high level languages like C. Well, we could add some lines inline code, but that's like using tape to fix a broken screw. The only way to get anywhere is writing our code in assembler. No existing compiler can see connections between called functions nor is it able to detect how many dependencies are created with the code it generates. It were possible to write a compiler with capabilities like that, but the result were bloated and, first of all, extremely slow. A human can solve really complex problems much better, especially if the code is written in pure assembler.


Prologue

The usual conventional prologue

.globl _MyFunction
_MyFunction:
        pushl %ebp
        movl %esp,%ebp
        subl $0xC0,%esp
        pushl %ebx
        pushl %edi
        pushl %esi
        ...

looks like this in Intelligent Design

.globl _MyFunction
_MyFunction:
        subl $0xFC,%esp
        nop
        nop
        movl %ebp,0xE4(%esp)
        movl %esi,0xE8(%esp)
        movl %edi,0xEC(%esp)
        movl %edx,0xF0(%esp)
        movl %ecx,0xF4(%esp)
        movl %ebx,0xF8(%esp)
        ...

The difference between both versions is obvious on the first sight. What is going on 'under the hood' cannot be seen at all, but it can be shown in form of a grahical image. Figure 0E shows the snapshot of a conventional stack frame



figure 0F (right) a snapshot of an Intelligent Design stack frame soon after the above  prologues were executed



Because conventional code is based on a flying stack pointer with random content, it is impossible to determine the current alignment of the stack pointer. Conventional code uses old fashioned push - call -addl $x,%esp sequences, as well, so ESP/RSP frequently changes its current content. These disadvantages are the reason why we need a second register to address stack elements with reliable accuracy. The question marks in figure 0E emphasize the undetermined content of the stack pointer - an artificial feature of conventional code. Obviously, the visible difference between conventional and Intelligent Design stack frames is the naturally aligned order of the latter ones, one of many built-in and cost-free (in terms of additional clock cycles) Intelligent Design features.

The second feature of Intelligent Design code is its prologue. It generally starts with the subtraction of the stack frame's size from ESP/RSP. The nops between the sub and mov instructions are required to feed the other two execution pipes while the content of ESP/RSP is set to the bottom of our stack frame. It is important to keep the remaining pipes busy for exactly this one clock cycle, because we need ESP/RSP to address the stack elements where our registers are stored. Inserting two nops keeps all pipes busy. The simultaneous flow in all execution pipes is not interrupted, so up to three instructions are executed at any time. Hint: Both nops should be seen as placeholder for more useful instructions. For example, we could replace them with prefetch n instructions to preload memory areas to the L1 cache, so we can access these areas immediately after saving all used registers.

Step two: Every Intelligent Design prologue saves all used registers. These are EBX, ECX, EDX, EDI, ESI, EBP and XMM4 through XMM15. The return address to the calling function is written to the topmost element in our stack frame while the call _MyFunction instruction in our callers code is executed. Storing the content of EIP/RIP in this element automatically loads the topmost cache line of the new stack frame into the L1 cache, so this cache line should be present at the time we start to save our registers there. Writing to L1 cache is the fastest way to store data, writing data in ascending order to continuous locations speeds up execution further, because a mechanism called write combining is forced, allowing to write an entire cache line in one gulp. Therefore, we store registers in ascending order to speed up our function. As shown on the next page, we try to fill the topmost stack elements up to the return address without gaps to force the processor to use write combining rather than single movs. Applying all these improvements, the conventional prologue is not gaining huge advantages over the Intelligent Design prologue. Even if we only push one half of the registers, the conventional prologue is executed mere two or three clock cycles faster than the Intelligent Design prologue, Moving twice as much registers onto the stack. This small advantage disappears completely if the first parameter must be reloaded, because ECX and EDX were destroyed by a called function. If parameters must be reloaded more than once, the advantage turns into a handicap, losing some clocks with every reload. Conventional techniques do not speed up functions - they slow them down.


Preparation

Let's start with some real work. The standard prologue was modified slightly to suit our needs. It looks like this, now:

.globl _ShowUs
_ShowUs:subl $0xFC,%esp
        movl _BNR,%eax
        nop
        movl %ebp,0xE4(%esp)
        movl %esi,0xE8(%esp)
        movl %edi,0xEC(%esp)
        movl %edx,0xF0(%esp)
        movl %ecx,0xF4(%esp)
        movl %ebx,0xF8(%esp)
        movl MH_SEL(%eax),%esi
        movl CURSEL(%eax),%ebp
        movdqa %xmm6,0xC0(%esp)
        movdqa %xmm7,0xD0(%esp)
        ...



We need eight registers to preload all required parameters for later use. The set of general purpose registers is limited, so two XMM registers are used to fill the gap (nice to have them). We just need the lowest 32 bit of XMM6 and XMM7, but the used movd instruction clears all bits above (32 through 127), as well. Whenever the contents of XMM4 through XMM15 change, these registers must be saved initially and restored on exit (like all general purpose registers except EAX). Any of them might hold preloaded parameters of another function, so it were a bad idea to overwrite them with garbage, because this might cause fatal malfunctions or crashes whenever we return to that function. Keep in mind that FP instructions are executed in a separate unit, so both movdqa instructions are executed simultaneously with the six mov instructions. As a matter of fact - they do not cost any additional clock cycle!

Working with ST-Open's libraries, the base address of the global variables (_BNR or _GVAR) is required in most cases. This datafield is automatically loaded by _LDinit and is available for the entire runtime of the program. If it is shut down, _LDexit is called during termination and the permanent portion of the datafield (4,096 byte = 1,024 variables) is stored before the datafield is released. In ST-Open slang, this datafield is called 'SystemNumerics'. It allows to bypass the stack by storing frequently used parameters at defined locations, so we save some time for passing those parameters on the stack.

To load the eight parameters, some tricks requiring basic knowledge about ST-Open's database engine are used. A less informed programmer (or a compiler) had to preload parameter for parameter with repeated calls:

        ...
        movl %esi,0x00(%esp)
        movl %ebp,0x04(%esp)
        movl $0x07,0x08(%esp)
        movl $0x07,0x0C(%esp)
        call _FDacc
        movl %eax,%esi
        decl 0x08(%esp)
        call _FDacc
        movl %eax,%edi
        decl 0x08(%esp)
        movl $0x01,0x0C(%esp)
        call _FDacc
        movl %eax,0xBC(%esp)
        decl 0x08(%esp)
        call _FDacc
        movl %eax,0xB8(%esp)
        decl 0x08(%esp)
        call _FDacc
        movl %eax,%edx
        movd 0xB8(%esp),%xmm6
        movd 0xBC(%esp),%xmm7
        decl 0x08(%esp)
        call _FDacc
        movl %eax,%ecx
        decl 0x08(%esp)
        call _FDacc
        movl %eax,%ebx
        leal 0x20(%esp),%ebp
        decl 0x08(%esp)
        call _FDacc
        ...

Compared to conventional code, where all 6 parameters for _FDacc() repeatedly werepushed onto the stack for each call, the improvements introduced by Intelligent Design still keep the above code dense and fast, but this is not the most elegant solution.

Recapitulating some knowledge about ST-Open's database engine and ST-Open's Loader, there is a much faster way to load parameters stored in a datafield. Okay, we start with a delay because we load an address into ESI which is required for the next and all following load operations. But this still is faster than a single call, not to talk of the eight calls performed by the above code. The advantage of the below code is obvious - its only caveat is the distance of 4,096 byte between any of the accessed memory locations. It is not very likely that all these areas are present in the L1 cache - we should assume long delays, because all data surely are taken from main memory. But that is true for all other ways to read these data, as well - the below code is the fastest possible way to load our parameters within the given premises. It were possible to organise the field in blocks rather than in sequential subfields, so all data of a dataset could be kept in a cache line - but this is out of the frame of our exercise.

A field's MemoryHandle is a pointer into the Loader table _BNR, where all parameters of a loaded datafield are stored. The first parameter at 00[MemHandle] is the real address (EA) of the allocated memory block. Loading this address into ESI, we now can access all required parameters by using the appropriate offset to ESI. Because we do not know, which entry we have to read next, we access them using the indexed, indirect addressing mode provided by the instruction set of any x86 processor. This is a simple exercise, because all subfields have a distance of exactly 0x1000 from each other. Oh, there are two more complex actions, of course. String addresses are stored as offsets to the field's base address, so we have to preload two registers with this address prior to adding the offset of the string to them. Following the definition, empty strings are stored with an offset of zero. Because the first 32 bit of each datafield are reset by default, 00[EA] always holds an empty string...

        ...
        movl 0x00(%esi),%esi
        movl %esi,%edi
        movl 0x0100(%esi, %ebp, 4),%eax
        movl 0x1100(%esi, %ebp, 4),%ebx
        movl 0x2100(%esi, %ebp, 4),%ecx
        movl 0x3100(%esi, %ebp, 4),%edx
        movd 0x4100(%esi, %ebp, 4),%xmm6
        movd 0x5100(%esi, %ebp, 4),%xmm7
        addl 0x6100(%esi, %ebp, 4),%edi
        addl 0x7100(%esi, %ebp, 4),%esi
        leal 0x20(%esp),%ebp
        ...

Collecting similar tasks to groups and executing them one after the other is a good way to speed up a function markably. It often saves a lot of redundant operations, because we can mov a parameter onto the stack, once, then perform multiple calls to one and the same function repeatedly without having to pass this parameter over and over again. As shown in the first code snippet (the less elegant try to solve our problem), only two out of six parameters are updated between seven out of eight _FDacc() calls. Three out of six parameters are static, staying unchanged for all eight calls - it is a waste of time to push all six parameters onto the stack eight times as practised in conventional code.

Back to the last line of the improved source code. While EBP was uses as index for eight mov instructions, this index is not required any longer after the last parameter was loaded to ESI. It's recommended to preload all required parameters as soon as possible to keep an appropriate distance from instructions accessing them. Therefore, we preload the address 20[ESP] into EBP immediately after its previous content is not required any longer. EBP now holds the address of the buffer where the converted numeric values of our parameters are stored as strings. This speeds up our function a little bit - simultaneouos execution flow in all three pipes is supported if our source code provides intelligently distributed instructions.



Hexadecimal Numbers

ST-Open's libraries provide some functions to convert numeric data into hexadecimal strings. D2str() is the right choice to convert doublewords into formatted strings, the format is 'nnnn nnnn'. D2str() awaits two parameters: The number to convert and the address of the buffer where the output string shall be stored.

        ...
        movl %eax,0x00(%esp)
        movl %ebp,0x04(%esp)
        call _D2str
        movl %ebx,0x00(%esp)
        addl $0x10,0x04(%esp)
        call _D2str
        movl %ecx,0x00(%esp)
        addl $0x10,0x04(%esp)
        call _D2str
        movl %edx,0x00(%esp)
        addl $0x10,0x04(%esp)
        call _D2str
        ...

After converting the four numbers into hexadecimal strings, the stack now looks like this:



This is trivial code, so there is no need for lengthy descriptions. There's only one thing deserving some words: Have a look at the three lines addl $0x10,0x04(%esp) in the above code snippet. These instructions have a latency of four clocks, each. We could replace them with sequences like addl $0x10,%ebp / movl %ebp,0x04(%esp). The replacements have a latency of four clock cycles, as well, but the mov instruction has to wait until the content of EBP was updated. It is obvious that the alternative method interrupts simultaneous execution and adds some bloat to our source code. Another negative side-effect is the changed content of EBP. We need the address of the first string later on, so we had to subtract all added values before we could use EBP to address the output strings. A general rule of thumb: If something can be done with one instruction, we should not split it up into multiple instructions doing the same job.



Decimal Numbers

To convert a doubleword into a formatted decimal string, we use the function D2dec() provided by ST-Open's libraries. This function awaits a doubleword to convert, the address of a buffer where the output shall be stored and a doubleword defining the output format. Only the lower two bytes of this doubleword are recognised. The used code is 0000ffii, where ii is the amount of integer and ff the amount of pseudo floating point digits. Beware - this is just a char inserted somewhere in a string, not a real floating point! We do not need a floating point, so the second byte is set to zero. The integer digits should cover the full range between 0 and 4 294 967 295, so the appropriate value is ten. Because D2dec() treats a formatting dword with the value of zero as 0x0000000A, we either may pass zero or ten as third parameter.

        ...
        movd %xmm6,0x00(%esp)
        addl $0x10,0x04(%esp)
        movl $0x0A,0x08(%esp)
        call _D2dec
        movl 0x0100(%esp),%edx
        movd %xmm7,0x00(%esp)
        addl $0x10,0x04(%esp)
        call _D2dec
        ...

This is trivial code, so we skip lengthy explanations. EDX is loaded between both calls, because we need the WindowHandle (HWND) later on to write our strings to the corresponding entry fields. Preloading registers some instructions before they are used supports parallel execution in all three pipes and we should place these loads soon after the content of one of our registers is not required any longer. The more practised, the faster the resulting code.

Well, our stack looks like this after all conversions are done:




Strings

To avoid redundant copy operations, the addresses of both strings were stored in EDI and ESI while we preloaded our eight parameters. It does not make sense to copy a string from memory to the stack, first, if we have to copy it from the stack to another memory location later on. Passing strings always means to pass the address where the first byte of the string can be found - it is a waste of time to copy a string to a buffer instead of just passing the address where this string actually is stored.

At this point, all conversions are done and we can start to fill those entry fields with some data. Because our strings are passed via their address, the stack remains untouched.



Setting The Entry Fields

All data are converted and accessible, now. It's time to fill our eight entry fields with the corresponding strings:

        ...
        movl %edx,0x00(%esp)
        movl $0x1200,0x04(%esp)
        movl %ebp,0x08(%esp)
        call _SEf
        incl 0x04(%esp)
        addl $0x10,0x08(%esp)
        call _SEf
        incl 0x04(%esp)
        addl $0x10,0x08(%esp)
        call _SEf
        incl 0x04(%esp)
        addl $0x10,0x08(%esp)
        call _SEf
        incl 0x04(%esp)
        addl $0x10,0x08(%esp)
        call _SEf
        incl 0x04(%esp)
        addl $0x10,0x08(%esp)
        call _SEf
        incl 0x04(%esp)
        movl %edi,0x08(%esp)
        call _SEf
        incl 0x04(%esp)
        movl %esi,0x08(%esp)
        call _SEf
        ...


This is the most dense code. It were possible to replace all incl 0x04(%esp) with movl $0x120*,0x04(%esp) instructions, where the asterisk '*' stands for the current ID. This alternative does not speed up execution, but blows up our code by 32 byte. Why that? Well, each addl $0x10,0x08(%esp) has a latency of four clock cycles. Pairing a three clocks with a four clocks latency does not result in higher speed. All three pipes are engaged for four clock cycles and the last pipe is idling all of the time. If we replace the first instruction with the alternative instruction, the first pipe - together with the last pipe - is idling for one clock cycle. Hence, we win nothing than 32 byte additional code.

With this step, all tasks of our exercise are done, only the epilogue is missing.


Epilogue

Before we return to the calling function, we restore all used registers. This includes the garbage pile of conventional code (ECX and EDX) as well as XMM4-XMM15. When all registers contain what they contained before our prologue was executed, we might want to set EAX to a specific return value, if the function declaration says so. Whether we set EAX or not, the next step is to add the same value we subtracted with the very first instruction in our function's prologue. ESP should point to the address of the instruction following call _ShowUs, now. After executing the final ret, EIP should contain that address and the processor should continue to execute the code of the function calling ShowUs().


        ...
        movl 0xC0(%esp),%xmm6
        movl 0xD0(%esp),%xmm7
        movl 0xE4(%esp),%ebp
        movl 0xE8(%esp),%esi
        movl 0xEC(%esp),%edi
        movl 0xF0(%esp),%edx
        movl 0xF4(%esp),%ecx
        movl 0xF8(%esp),%ebx
        addl $0xFC,%esp
        xorl %eax,%eax
        ret



With the correction of the stack pointer ESP, all stack elements below the address stored in ESP are 'no-man's-land', again. The next function is free to reserve some byte for its private stack frame as required.


Notes

It still might not be clear, why grouping of similar tasks is much better than doing everything step by step, as we are taught by some gurus preaching the old fashioned methods and techniques. The advantages of Intelligent Design are based on a lot of small improvements. Each of them supports the other. In the end, all of them sum up to a new quality, making as much as possible use of the full power of modern processors.

As a matter of fact, most parameters never change in repetitive calls to one and the same function. Hence, we can save a lot of redundant writes if we just mov changing parameters to their proper position rather than push all parameters onto the stack over and over, again. In our example, parameter 1 is a MemoryHandle for the first call, the address of a buffer for the second call and a WindowHandle for the third call. Hence, we had to push these changing parameters onto the stack eight times. Parameters 2 and 3 also change with every call. We finally had to  perform 6*8 + 4*2 + 2*3 + 3*8 = 86 push instructions to solve the given problem. If we count all mov instructions in the listed example code, we get a result of 38 - less than one half of the instructions required for conventional code. But: That's just counting instructions. Having a closer look at them, there is much more than the pure count. Analysing both versions, conventional code is not as efficient as Intelligent Design. Conventional techniques not only prevent a processor to execute multiple instructions simultaneously, because they create many avoidable dependencies. They also create obstacles by default, because their rules stipulate us to omit important things like preserving ECX and EDX. Putting it all together, conventional programming techniques are as outdated as ox carts, while Intelligent Design is the programming technique of the 21st century.

Go to the next post (10 - Summary)


No comments:

Post a Comment