Tuesday, April 13, 2010

08 - Intelligent Design

If we put all discussed aspects together, we can state that conventional programming techniques ignore improvements introduced with modern processor design completely. Well, conclusions of our analysis suggested to replace all leave, pop and push with mov instructions, because three of them can be executed simultaneously. The advantages of these replacements are a static stack pointer, an additional general purpose register (EBP), a naturally aligned stack (if supported by the operating system), read and write access to any stack element via ESP, and much more. Based on the facts we worked out until now, a concept was developed and tested for usability in everday's applications.

Because we use no base pointer any longer, the stack is addressed via ESP, only. As shown before, we can reserve a stack area for our private use by subtracting the required size from the stack pointer ESP. The reserved area is safe from being used by called functions (except those compiled by GCC 3.3.5.). Let us recapitulate how the stack looks like after a function is called:



The current stack element holds the return address - the address of the instruction following the call in the calling function. If the calling function has to pass parameters to our funktion, they are stored above the current stack element in ascending order. Because we want to use mov instructions, only, a sufficient stack area must be reserved where we can save used registers, store local variables and pass parameters to called functions. Working with a static stack pointer, we just have to add the required sizes of these three parts (registers, local area and parameters), then subtract the sum from ESP:



After this subtraction, the content of ESP does not change until our function is terminated. Therefore, ESP can be used to address any element within our current stack frame. Elements outside our private area are taboo - you may read them whenever you want, but you must not write to them under any circumstances!


Calculating The Required Size

The size of our stack frame depends on the amount of registers we have to save, the size for our local variables and the amount of parameters we have to pass.


Registers

The contents of all used registers are stored at the top of our private area. All Intelligent Design functions do save ECX and EDX, as well! The following table shows the required size for 16, 32 and 64 bit functions:


Register16 Bit32 Bit64 Bit
10x020x040x08
20x040x080x10
30x060x0C0x18
40x080x100x20
50x0A0x140x28
60x0C0x180x30
7--0x38
8--0x40
9--0x48
10--0x50
11--0x58
12--0x60
13--0x68
14--0x70
15--0x78


If no registers are used, the required size is zero, of course. Rows 7 through 15 are not available in 16 and 32 bit mode, because these modes do not recognise the extended 64 bit register set. Eventually, you might need to save some MMX or XMM registers. Just add 8 for each MMX and 16 for each XMM register. It is recommended to avoid using MMX registers, because they internally are mirrored on the original FPU registers ST(0) through ST(7). Older software libraries and operating systems do not know anything about MMX registers, so you might encounter weird problems if you do not write an explicit emms or femms before you call external functions. Well, emms and femms generally destroy the contents of all MMX registers, so it is a good idea to use XMM instead of MMX registers.


Local Variables

Between saved registers at the top of our stack and the area where parameters for called functions are stored, we have to reserve the space required for local data (variables, structures, strings). If the size is no multiple of the standard data size, it is recommended to expand the required size to the next higher multiple of the standard data size. It's very important to keep the stack aligned, because odd addresses in ESP trigger penalty cycles for every unaligned memory access. If you don't keep care and add an even size in your epilogue, the program definitely will crash, because EIP probably is set to a faulty return address.

If you want to store strings on the stack, it is recommended to define a sufficient size, so the function reading characters from the input device has no chance to overwrite parameters or register contents stored in the stack elements above.


Parameters

Parameters are stored at the bottom of our stack frame. They are moved to the stack elements in ascending order, starting at 00[ESP]. Because we mov parameters to the corresponding stack elements rather than to push them any longer, we should reserve an area equal to the size required by the function with the largest amount of parameters. If we have three functions awaiting three parameters and one function awaiting five parameters, we need a size of 20 byte (32 bit code) to pass five parameters - the three parameters for the other three functions automatically fit into the reserved 20 byte. The required size can be taken from the above table for registers. If more than 6 (15) parameters must be passed, the size easily can be calculated with the formula parameters * datasize, where data size is 2 (16 bit), 4 (32 bit) or 8 (64 bit). For example, the required size for the 13 parameters we have to pass to WinCreateWindow() is 13 * 4 = 52 byte. The bottom of the corresponding stack frame looks like this:



Watch out: Parameter 1 in figure 08 is identical with the leftmost parameter we are writing within the paranthesis of a C or C++ function. In general, parameters are moved onto the stack in ascending order: Parameter 1 @ 00[ESP], parameter 2 @ 04[ESP] and so on, until the last parameter was moved. If any parameter does not change between two or more consecutive call instructions, it just has to be moved the very first time it is used. For following calls, the parameter already is stored at the right place, so we do not need to store it at the same location, again - it is stored there until we overwrite it!


Adding Sizes
After we determined the required sizes of all three parts (registers, local area, parameters to pass), we calculate their sum, then round the result up to a size ending with 3C, 7C, BC or FC (32 bit mode), respective 38, 78, B8 or F8 (64 bit mode). Applying this trick, our stack frame automatically is aligned to the beginning of the next cache line. Finally, we create our new stack frame by subtracting the rounded up sum from rSP. As described in detail, this subtraction reserves the created stack frame from being overwritten by other functions. Writing data to stack locations outside the current stack frame violates the rules of conventional programming as well as the rulework defined for Intelligent Design.

Setting the first stack element to the beginning of a cache line, we benefit from a bunch of advantages. First, no time consuming workarounds are required to align stack locations to store or load XMM registers. It is just one side effect of the Intelligent Design prologue that rSP is naturally aligned by default without a single line of additional code. Secondly, Intelligent Design code is able to benefit from bandwith and accellerating mechanisms provided by modern processors. At the latest, the first access to any stack element loads the stack frame into the L1 cache, moving all following accesses to the fastest area in the machine's memory hierarchy. Thirdly, writing data to continuous memory locations in ascending order triggers a mechanism called write combining, where up to 16(!) doublewords are stored on the stack in one gulp. Much more advantages like avoiding repetitive stores for never changed parameters come along with this fresh and modern design. Some of them are shown later on, some of them are not revealed, yet...

If all Intelligent Design rules were applied properly, the smallest possible stack frame automatically looks like this in 32 bit code



or like this in 64 bit code


The real size can be varied by adding multiples of 64 to the smallest value of 0x3C (0x38 in 64 bit code) to match your individual requirements perfectly.


Due to the subtraction of a predefined size from rSP, the stack frames of all Intelligent Design functions automatically are set to the beginning of a cache line. In general, functions are called. Looking at the mechanism of the call instruction, the return address is pushed onto the stack before execution is continued with the first instruction of the called function. Hence, rSP points to an address ending with 3C, 7C, BC or FC in 32 bit code, respective 38, 78, B8 or F8 in 64 bit code - exactly the size we have to subtract from rSP.


Destroying The Stack Frame

After our function's code was executed, all registers must be restored to the content they had as they were passed to us. The last step before returning to the calling function is to add the size we subtracted to create our stack frame to rSP. With the final ret, rIP is taken from the stack and execution is continued with the instruction following the call of our function. At this point, our stack looks like this, again:


Because the area between stack bottom and the address currently loaded into rSP is 'no-mans-land', the next function is free to reserve a part of it to save used registers, to store its local variables and structures or to put some parameters for called functions onto the stack. On principle, the content of a stack element is undefined until something was written to it. Writing data into untouched stack elements is called 'initialisation' in C and C++ terminology.


Side Notes About Alignment

Using XMM instructions, most 128 bit stores and loads explicitely expect addresses aligned to 16 byte boundaries, in other words: The last digit of the address must be zero. No existing operating system cares about aligned addresses at all. Using excessive push orgies to put the parameters for called functions onto the stack, the chance to enter that function with an aligned stack pointer is about 25 percent. Because 16 / 4 = 4, the chance ESP ends with 04, 08 or 0C is about 75 percent. It's left to the programmer or compiler to handle this problem properly, so you have to add redundant, time consuming code to align the stack pointer in every function working with XMM instructions.

As shown, Intelligent Design provides some mechanisms to naturally align all stack frames to the beginning of a cache line. But - even the best mechanism is of no use if the operating system does not support it. In most cases, an ID compliant function is called by the message transporting system within the operating system. Whenever a message is sent or posted to our application's message loop, we do not know which instance of which function of the operating system called our message loop. Hence, we do not know how many things were pushed onto our stack nor can we rely on something like an aligned stack pointer. Even if we aligned the stack pointer at the begin of our main() function, it surely is not aligned any longer if the code in our message loop is executed the very first time. Therefore, one of the most important improvents introduced with Intelligent Design is not available as long as we run our code on any existing platform. Because all
 existing operating systems put dirt into our gears, no machine ever will run at full speed.

No application programmer is able to do anything against this counter-productive behaviour of existing operating systems. Either we bow our heads, obey and apply all those pointless work-arounds recommended by processor manufacturers, or we give up programming to get rid of them. Because it is a fact that work-arounds are not very practicable in many cases, there's a third way as an alternative for this 'take it or leave it' choice. Instead of searching for ways to bypass 'built in' restrictions, speed brakes and other obstacles of existing operating systems, it might be a much better idea to throw away these remains of old fashioned software design and create something new. Something supporting innovations and capabilities of modern processors instead of ignoring them. Something making use of the immense computational power and speed of recent quad core machines. Something called IDEOS (the Intelligent Design Easy2Use Operating System)...

Go to the next post (09 - First Steps)


No comments:

Post a Comment