Copyright Note
The programming techniques introduced in this paper are mental property of Bernhard Schornak. They are protected by international copyrights, published under the terms of the FT4FP-License. Any commercial use, trade or other forms of exploitation to gain profit are strictly prohibited. Knowledge is a common property and should be freely available for every human. It must not be abused as a proprietary ware, only available for those who can afford to feed a few greedy individuals with money.
This document was written for the ST-Open homepage in 2006. It was slightly modified for this blog.
Compared against conventional programming techniques, Intelligent Design resembles a quality leap. However, some knowledge about the creation and management of a conventional stack is required to understand the important difference between old fashioned programming techniques and Intelligent Design. To impart the knowledge about conventional methods to the reader, the next pages offer a detailed introduction to stacks, stack frames and how they are managed. Without this knowledge, it probably is impossible to understand the alternative methods and techniques introduced with Intelligent Design. Old fashioned programming techniques never kept pace with recent processors - the standards of so called high level languages are designed to work with the first generation of microprocessors as well as most recent quad-core machines. Unfortunatelly, computational power of processors grew by several powers of ten, while software standards never followed any technical evolution. Today, we have mature high speed processors driven by never grown software toddlers. It is quite counter-productive to slow down high speed devices because their 'drivers' cannot handle most of the controls.
Whenever you compile a program, you pass a definition file with the extension .def to the linker (LINK.EXE or similar). The definition file holds some important information about the program for the session manager of your operating system. While the compiled program is started, the operating system reserves three independent memory blocks (code, data, stack) for the new process and the segment registers CS, DS+ES and SS are set to the address of one of these blocks. Whatever you defined as STACKSIZE in the definition file, exactly that size is allocated for your stack segment. After allocating the required memory blocks for those three segments, the program code is copied to the code segment, all defined global variables are written to the data segment - they are 'initialised' - and rSP is set to the top of the stack segment. Finally, the address of the array with the command line parameters and the argument count are pushed onto the virgin stack before the session manager calls the main() function of our program. Entering main(), the processor starts to execute the code found there, until it stumbles upon the final ret instruction and passes control back to the session manager.
Entering our program's main() function, the stack looks like this:
This document was written for the ST-Open homepage in 2006. It was slightly modified for this blog.
Introduction
Most people probably associate the term Intelligent Design with the movement of Creationism rather than a new, revolutionary programming technique. The usurpation of this term is an intended sidesweep. Whatever invented gods and godesses were able to do - a smart programmer can do it much better. This paper is an introduction to the next generation of programming, superior to old fashioned conventions and programming techniques.Compared against conventional programming techniques, Intelligent Design resembles a quality leap. However, some knowledge about the creation and management of a conventional stack is required to understand the important difference between old fashioned programming techniques and Intelligent Design. To impart the knowledge about conventional methods to the reader, the next pages offer a detailed introduction to stacks, stack frames and how they are managed. Without this knowledge, it probably is impossible to understand the alternative methods and techniques introduced with Intelligent Design. Old fashioned programming techniques never kept pace with recent processors - the standards of so called high level languages are designed to work with the first generation of microprocessors as well as most recent quad-core machines. Unfortunatelly, computational power of processors grew by several powers of ten, while software standards never followed any technical evolution. Today, we have mature high speed processors driven by never grown software toddlers. It is quite counter-productive to slow down high speed devices because their 'drivers' cannot handle most of the controls.
Basics
The assembly language dialect used in this paper is not the one commonly used in the world of Windows, also known as LETNi syntax. It is AT&T syntax, known in the world of Linux and Unix. When I began to write code for the x86 platform in 1993, GCC was the only free development tool one could get, so I had to use AT&T syntax. If you only know LETNi syntax, there is a short introduction to AT&T syntax to learn the difference between the both. If you worked with AS for a short time, you don't want to return to the complicated and perversed (or was that reversed?) LETNi syntax anymore. The programming techniques introduced in this paper do not rely on a specific syntax. However, knowing AT&T syntax might help you to understand the sample code. Reaching the goal is what really counts. How we get there is another, slightly different problem.The Stack
Every x86 processor works as a stack machine. Parameters we have to pass to called functions, return addresses to calling functions, local variables and structures are put onto or read from the stack. Unfortunately, most of today's programmers just are used to click together some of those prefabricated code fragments coming along with the daily 20 GB version upgrade for their favourite VisualXYZ(plus-minus-dotnet) development suite. If you ask any of them 'What is a stack?', they probably tell you something about hay or - of course! - money. Sad, but true. Exceptionally sad, because the mechanisms of a software stack are that simple, one were attempted to call the big idea behind nothing else than 'brilliant'.Whenever you compile a program, you pass a definition file with the extension .def to the linker (LINK.EXE or similar). The definition file holds some important information about the program for the session manager of your operating system. While the compiled program is started, the operating system reserves three independent memory blocks (code, data, stack) for the new process and the segment registers CS, DS+ES and SS are set to the address of one of these blocks. Whatever you defined as STACKSIZE in the definition file, exactly that size is allocated for your stack segment. After allocating the required memory blocks for those three segments, the program code is copied to the code segment, all defined global variables are written to the data segment - they are 'initialised' - and rSP is set to the top of the stack segment. Finally, the address of the array with the command line parameters and the argument count are pushed onto the virgin stack before the session manager calls the main() function of our program. Entering main(), the processor starts to execute the code found there, until it stumbles upon the final ret instruction and passes control back to the session manager.
Entering our program's main() function, the stack looks like this:
The top stack element holds the argument vector argv[] (parameter 2), the next lower stack element holds the argument count argc (parameter 1). The current stack element, the one ESP points to, holds the return address to the session manager. Whenever the program is terminated, the instruction pointer is loaded with this address and the terminating sequence of the session manager is executed. It first frees those resources our program eventually reserved for itself, e.g. allocated memory blocks, open files, open devices, and so on. Finally, it releases the allocated segments and cleans up all structures holding control data of our program.
In the running program, the content of the stack pointer is decreased with every push instruction, the creation of a stack frame or the call to another function. It is increased whenever we pop data from the stack, release a stack frame or return to a calling function. In 32 bit functions, four byte are subtracted from ESP with every push or call, while four byte are added to ESP with every pop or ret. All subtractions or additions are done by the processor automatically, because they are an integral part of the mentioned instructions. Using a picturesque language, we might state: The stack grows with every push or call and shrinks with every pop or ret.
Using The Stack
To make sensible use of the stack, any x86 processor provides the instructions call, enter, leave, pop, push and ret. All these instructions update the content of ESP automatically. Besides these special instructions, ESP can be used like any other register, as well. We are free to add or subtract immediate values or the content of another register to/from the stack pointer and do other funny things with ESP. However, the most utilised action probably is the subtraction of an immediate value from ESP to create a stack frame and the addition of exactly the same value to release that stack frame if we do not need it any longer. A detailed description follows later on, see About Stackframes. For now, we focus our attention on some more basic things like the instructions mentioned above. To understand the concept of conventional programming methods, it is very important to know how these instructions work and how they manipulate the stack and ESP.Call
If the processor encounters a call instruction, it first subtracts two, four or eight (depending on the standard datasize) from ESP, then stores the address of the instruction following the call on the stack. Next, the address passed as a part of the call instruction is loaded into the instruction pointer. Execution now is continued with the instructions found at the new location, until the processor stumbles upon a ret.... call DoNothing # call function DoNothing xorl %eax,%eax # <- the address of this instruction ... # is stored on the stack and loaded ... # into EIP with the RET instruction DoNothing: # local function DoNothing ret # return to caller...
Enter
Please do not use this vector path instruction - it blocks the processor for entire 13 clock cycles. The usual replacement... pushl %ebp # save EBP movl %esp,%ebp # save ESP subl $0x10,%esp # create stack frame ...is executed in four clock cycles. Saving nine clock cycles is a speed improvement of 325 percent. You should prefer the replacement code over enter under any cicumstances! The 16 byte are just a randomly chosen example to keep the sample code valid. The real size you have to subtract from ESP depends on the amount of local variables and other temporary data your function has to store on the stack.
Leave
The leave instruction is equivalent with the following two instructions:... movl %ebp,%esp # restore ESP (DP 1, 2 byte) popl %ebp # restore EBP (VP 4, 1 byte) ...Like pop, leave is a vector path instruction. Because pop blocks the processor for four clock cycles and also has to wait for the valid result of the preceeding mov instruction, leave actually is two clock cycles faster. Moreover, the one byte leave is shorter than its three byte replacement. Hence, you should prefer leave over the alternative method.
Pop
Pop copies the content of the current stack element to a register or memory location, then adds, depending on the processor mode and an optional prefix, two, four or eight to the content of the stack pointer. Unlike the direct path push, pop is a vector path instruction. It is executed in pipes 0 and 1, while pipe 2 is blocked for the time the instruction is processed. Every pop instruction lasts four clock cycles. Pop is a special vector path mov instruction, where ESP automatically is updated after it was used to address the source of a copy operation. In general, pop is used to restore the content of a register. You should keep track of the stack pointer, because it is quite difficult to find errors caused by asymmetrically executed push and pop instructions. Especially, if some of them are inside a loop while others are not.
... popl %ebx # restore EBX [=> ESP + 4(!)] ...
Push
Push first adds, depending on the processor mode and an optional prefix, two, four or eight to the stack pointer, then copies an imediate value, the content of a register or the content of a memory location into the stack element the updated ESP points to. Push is a special direct path mov instruction, where ESP automatically is updated before it is used to address the target of a copy operation. While the entire execution time of each push instruction is 3 clock cycles, ESP is available one clock earlier (after 2 cycles) for the following instructions.
In general, push is used to put parameters or register contents onto the stack. It is a good idea to remove 'used' parameters from the stack as soon as possible, because they decrease the available stack size, cause avoidable stack pointer arithmetics, and so on. To remove them, we do not have to pop them - beware! - from the stack. We just have to add the appropriate amount of byte to the stack pointer. If we pushed two parameters in a 32 bit function as shown in the below example, we add 8 byte to ESP. If it were eight parameters, we had to add 8 * 4 = 32 byte to ESP, and so on.
Regardless of the lazy behavior of HLL (high level language) compilers, it is a good idea to correct the content of ESP after each call instruction if you passed parameters to the called function. Humans have minor problems to keep track of the current content of the stack pointer, especially, if the function body exceeds the size of their display. Compilers can do that much better. However, their output is less optimised than the code written by a human.
... pushl %eax # put parameter 2 onto the stack pushl %ebx # put parameter 1 onto the stack call _helpling # call another function addl $0x08,%esp # correct ESP directly after CALL ...
Ret(urn)
Whenever the processor stumbles upon a ret instruction, it copies the address stored in the current stack element into the instruction pointer EIP, then adds the standard datasize (2, 4 or 8 byte) to the stack pointer. After ESP was updated, execution continues with the instruction found at the address EIP now points to.... ret # return to caller ...
Stack Management
As mentioned in the descriptions of the single instructions, keeping track of the current content of the stack pointer is a must with highest priority. Because passing of parameters generally is done with push instructions, while parameters are taken from the stack by adding the appropriate amount of byte to the stack pointer ESP, it is very important to handle these operations with exceptional care. Especially the necessary corrections of the stack pointer bear some potential to be erroneous. Unfortunately, such errors cause unexpected crashes and malfunctions, and it is quite hard to track them down until the culprit causing the mess is found. This reaches a new dimension with the creation of a stack frame. What is this already mentioned sinister stack frame?Stack Frames
When the session manager passes control to our main() function, we surely create a message queue and a frame window, but we also might play some music or those mega-cool animations which are a must for all recent applications swimming along the main stream. Whatever programs might do when they get control over the computer, the first lines of any function, including main(), always follow the set routine. First: A stack frame is built. Second: All used registers are saved on the stack. Actually, ECX and EDX never are saved because the C-conventions say so (see Conventions). Third: The code found in the function body performs all tasks the function is coded for. Fourth: All saved registers are restored. Fifth: The stack frame is destroyed (released). Sixth: The final ret is executed and the function returns to its caller. One thing should be mentioned explicitely: We only need a stack frame if we have to store local variables or other temporary data structures on the stack. Functions without local variables do not need a stack frame at all. Building a stack frame lasts at least 6 clock cycles and occupies 10 byte of code. To save superfluous activities, you should switch on the fomit-frame-pointer option (GCC) by default. It skips building stack frames where they are not required. Grasping how a base pointer is used and how it works is the key to understand conventional programming techniques. Because this is a very important issue, all explanations are very detailed. Some may find them too lengthy, but we should give others the chance to get in touch with all aspects, so (hopefully) anyone is able to gather the knowledge required to apply the learned stuff in real life.Example
The following example shows how conventional stack frames are created. It is trivial code used in every program. Even huge monster applications like operating systems, browsers, audio studios, et cetera use the same pattern. They only differ in the size of their stack frames.pushl %ebp # save old base pointer movl %esp,%ebp # load new base pointer subl $0x08,%esp # reserve 2 local variables pushl %ebx # save EBX movl 0x08(%ebp),%eax # EAX = argument count movl 0x0C(%ebp),%ebx # EBX = argument vector movl $0x00,-0x04(%ebp) # local variable 1 = 0 movl $0x20,-0x08(%ebp) # local variable 2 = 32 pushl %eax # copy EAX to -0x10(%ebp) pushl %ebx # copy EBX to -0x14(%ebp) ...
The Stack
After execution of the above instructions, the stack looks like this:ESP always points to the current stack element. In our case, this is the address where the content of EBX was copied to. Depending on the instructions in the function body, ESP moves down towards stack bottom or back towards stack top. In properly designed functions, the content of ESP never can be greater than or equal to the content of EBP. This tells us that the base pointer not only is used to address stack elements. It also marks the border between the current stack frame and the stack frame of the calling function.
pushl %ebp
The first step to create a new stack frame is saving the content of the old base pointer of the calling function on the stack. The content of the old base pointer must not be changed under any circumstances! Otherwise, the calling function uses an invalid basis to address its local variables. It will crash if it tries to read or write data to an address stored in a local variable, because that address, formerly stored at -0x0C[EBP], now might be found at 0x3C[EBP]. At the latest, the calling function commits suicide while trying to return to its caller. It fills the base pointer with random data, then loads other random data into the instruction pointer. The attempt to execute that 'code' raises one of those exceptions and an 'access violation' is reported on the user's screen.movl %esp,%ebp
In the second step, we copy the address of the current stack element to EBP. The new base pointer now contains the address, where the content of the old base pointer is stored. The next stack element below the new base pointer is the topmost stack element we are allowed to write to. All stack elements above EBP - including 0x00[EBP]! - only should be used to read from, but never to write to! From here on, we use the base pointer to address local variables and parameters passed to our function. In other words, these data are addressed with offsets to the 'frozen' base pointer. As shown in figure 03, the old base pointer is stored at offset zero to the current base pointer. We write this as 0x00[EBP], where the 0x tells us this is a hexadecimal number. Everything related to programming should be written in hexadecimal notation. It is much easier to read, numbers are shorter and always formatted correctly. For example, 0xF100 tells us at first sight: 'It's the second page in memory block 0xF000.' Its decimal equivalent 61696 does not tell us anything. We have to start a calculator to translate it into 0xF100, wasting precious time. Back to the base pointer. Passed parameters and the return address to the calling function are stored in stack elements above the one the base pointer points to. Therefore, they are addressed with positive offsets. Our return address is stored at 0x04[EBP], followed by one or more optional parameters, where parameter 1 always is stored at 0x08[EBP], parameter 2 at 0x0C[EBP], and so on. The 'top down' order is caused by a C-convention, saying that all parameters are pushed onto the stack back to forth, starting with the last parameter the compiler finds inside those round brackets following the name of the called function.subl $0x08,%esp
The real creation of a stack frame is done with the third and last step. First, we calculate the required size for all variables and structures, then we subtract the result from ESP. Of course, the result must be rounded up to the next multiple of the standard datasize, before we subtract it from the stack pointer. If we subtracted an 'odd' number, the stack became corrupted and was not valid any longer. Everything addressed via ESP, e.g. call or push, was misaligned, causing a lot of penalty cycles. If ESP accidentally was set to 0xFFE2 instead of 0xFFE4 (e.g. by subtracting 0x0A rather than the proper value 0x08) in our sample code, then we push two local variables 0x89ABCDEF and 0x01234567 onto the stack and copy variable 2 from -0x08[EBP] to EBX, EBX finally contained the invalid number 0xCDEF0123 - the doubleword currently stored at address 0xFFE4. With this subtraction, we reserve the subtracted amount of byte on the stack. This reserved area is safe from being overwritten by any following call or push instruction, because ESP always is set to the next lower stack element before a write operation. Because all local variables are lying below the stack element where the old base pointer is stored, they are addressed with negative offsets. The first is stored at address -0x04[EBP], the 2nd at -0x08[EBP], and so on. It neither matters how you name your local variables nor if you count them from top to bottom or vice versa. The only important thing is to remember where which of them is stored.pushl %ebx
Saves the content of EBX on the stack. Following the C-conventions, the content of EBX, EDI and ESI must be saved before overwriting them and must be restored before returning to the calling function. In other words: The content of these registers must be preserved by all functions (see Conventions).Urgent Needs
In conventional code, the content of EBP never must be changed. If you cannot avoid to use EBP for general purposes, because you urgently need an extra register, you have to save it before using it. You should restore EBP immediately after passing the bottleneck. Not worth mentioning, but, nevertheless: EBP cannot be used to adress local variables while it holds your private data...The Function Body
ESP always points to the address of the current stack element. Looking at our example, it is quite obvious that we are going to call another function. Our local variable 1 might be a counter which is incremented each time the called function returns TRUE. Local variable 2 might be a loop counter, so the function might be called 32 times, counting how many times a specific condition was met.Stack Frame Destruction
When all instructions in the function body are executed, we first have to restore the saved registers. Next, we have to destroy the stack frame to release the area we reserved for our private use. Finally, we return to the calling function. There are two possible ways to perform these tasks:... addl $0x08,%esp # correction after 2 PUSH instructions popl %ebx # restore EBX leave # destroy stack frame (VP 3, 1 byte) ret # return to calleror
... addl $0x08,%esp # correction after 2 PUSH instructions popl %ebx # restore EBX movl %ebp,%esp # restore ESP (DP 1, 2 byte) popl %ebp # restore EBP (VP 4, 1 byte) ret # return to callerBoth variants restore ESP and EBP. However, leave is two clock cycles faster and two byte smaller. The correction of ESP is required to restore the preserved content of EBX. Because we pushed two parameters onto the stack after we pushed EBX, ESP is eight byte below the address where the content of EBX is stored. To POP the proper content into EBX, we have to add these eight byte to ESP. If no registers were saved on the stack, no correction of ESP is required and leave and ret are the only instructions we need to restore ESP and EBP before we return to the caller.
Return To Caller
In our example, ret copies the return address to the session manager of the operating system to EIP and continues execution of its code. The session manager closes our program, frees still allocated resources and passes the content we stored in EAX to the command interpreter.What Did We Learn?
All introduced mechanisms are, except the size of the stack frame and the amount of preserved registers, identical for all functions following the C-conventions. As far as I know, all of the known operating systems follow these coventions. The advantages are obvious. Once coded functions are portable from one version to the next or from one operating system to another. In the latter case, minor changes must be applied if functions of the target OS have other names or await parameters in different order. Putting it all together, conventions simplify re-use and porting of existing code. But - every object in our universe has at least two (or more) sides. The disadvantages of the introduced methods, mechanisms and conventions are not as obvious as the advantages, because they are hidden in the deepest and darkest parts of the machine room. Only experienced, well trained technicians are able to find out why the machine chokes and does not run as fast as expected. Time to analyse what C-conventions do on assembly language level.Caveats
As mentioned before, conventions, methods and techniques introduced with the programming language C have a lot of advantages over monolithic code we used to write applications for DOS and other ancient operating systems. All functions easily are portable to other operating systems or processor architectures and can be used multiple times, leading to versatile applications for multiple platforms. Unfortunately, after the C conventions were established, they never were updated or revised to keep pace with the development of processor architectures. If we compare an 8086 against a recent quad-core Athlon, a picturesque comparison were a cart dragged by a tired ox versus an Airbus A380. While no person with sane mind ever wasted one thought about equipping an A380 with a harness and motivate it to lift off with reins, whip and loud 'hee' and 'ho' shouts, software designers all over the world practise such weird things every day. They still create 'flying' stack frames, use slow leave, pop and push instructions with the obligatory update of the stack pointer instead of the faster mov and continue to abuse valuable resources by using EBP as base pointer. This reduces the too small register set of the x86 architecture by 1/7th, forcing the programmer to use slow memory reads and writes instead of much faster register operations.To demonstrate the caveats, we analyse a dialog procedure taken from an existing ST-Open program. The dialog consists of four groups of radiobuttons, the three buttons 'Abort', 'Move', 'Help' and some static texts. DLGtxt() sets all texts in this dialog to strings taken from a subfield with the current language (multi-lingual dialog and menu texts are an integral part of ST-Open's libraries). The dialog is used to move datasets within a datafield, the target is selected with the radiobuttons.
.text .p2align 4,,15 .globl MoveDlg MoveDlg:pushl %ebp movl %esp, %ebp pushl %esi pushl %ebx movl 12(%ebp), %edx movl 8(%ebp), %esi movl 16(%ebp), %ecx movl 20(%ebp), %ebx cmpl $48, %edx je L12 ja L39 cmpl $32, %edx je L4 L37:movl %ebx, 20(%ebp) movl %ecx, 16(%ebp) movl %edx, 12(%ebp) movl %esi, 8(%ebp) leal -8(%ebp), %esp popl %ebx popl %esi popl %ebp jmp _DefDPCompared against GCC 2.6.1. (1993), GCC 3.3.5. (2006) generates much worse code, spiced with a lot of counterproductive, superfluous instructions. Writing back parameters to the stack violates some basic rules. First, there's absolutely no need to write data back to the locations we took them from some instructions before. Secondly, writes to stack locations above ESP violate all rules of proper programming. If any function starts to write data to the stack frames of other functions, it is just a question of time until we encounter desastrous malfunctions.
.p2align 4,,7 L4:movl %ecx, %eax andl $65535, %eax cmpl $4658, %eax je L7 jg L11 cmpl $4657, %eax je L6Because we only need the low word stored in message parameter 1, it was a good idea to extract this lower word with the instruction movzwl 0x10(%ebp),%ecx rather than to waste valuable clock cycles with the code sequence shown above. Recent processors have three, not just one execution pipe(s). The choosen way sends two execution pipes to sleep while one pipe is busy to extract data from a register. This is repeated two times. We could switch off two execution pipes while this code is executed, because we created two avoidable dependencies.
L9:movl %ebx, 20(%ebp) movl %ecx, 16(%ebp) movl %edx, 12(%ebp) movl %esi, 8(%ebp) leal -8(%ebp), %esp popl %ebx popl %esi popl %ebp jmp _DefDPObviously, L37 and L9 provide identical code. Using our brain, these eighteen redundant (potentially pointless) lines could be reduced to five really necessary instructions.
L6:movl _GVAR, %eax # [1] subl $12, %esp movl $0, 10464(%eax) L42:pushl %esi call _WinDD L41:addl $16, %esp .p2align 4,,7Up to seven nops are executed every time we branch to this part of code. It is a good way to slow down execution flow as much as possible.
L2:leal -8(%ebp), %esp xorl %eax, %eax popl %ebx popl %esi popl %ebp ret L11:cmpl $4659, %eax jne L9 subl $12, %esp pushl $17 call _Help jmp L41 L7:pushl %edx pushl %edx pushl $18 movl _GVAR, %eax # [1] addl $10464, %eax pushl %eax call _FlgS popl %eax jmp L42The branch prediction logic assumes every first branch as false if there is no entry in its internal table. Almost all of the above branches trigger the obligatory ten penalty cycles, because the wrong branch instructions were chosen.
.p2align 4,,7 L39:cmpl $59, %edx jne L37 pushl $233 pushl $211 pushl $210 pushl %esi call _DLGtxt movl $0, (%esp) pushl $-1 pushl $288 pushl $4672 pushl %esi call _SnDIM addl $20, %esp pushl $0 pushl $-1 pushl $288 pushl $4680 pushl %esi call _SnDIM addl $20, %esp pushl $0 pushl $-1 pushl $288 pushl $4688 pushl %esi call _SnDIM addl $20, %esp pushl $0 pushl $-1 pushl $288 pushl $4696 pushl %esi call _SnDIM addl $24, %espOnly the second parameter changes for the four consecutive calls of SnDIM(). Twelve of these twenty push instructions (60 percent) are redundant.
pushl %esi pushl $0 pushl $2 pushl $0 pushl $11 movl _GVAR, %eax # [1] movl 7252(%eax), %eax pushl %eax call _FDacc movl _GVAR, %eax # [1] addl $24, %esp movl 472(%eax), %ebx pushl %ebx pushl $0 pushl $2 pushl $4 pushl $11 movl 7252(%eax), %ecx pushl %ecx call _FDaccOnly parameters 3 and 6 change for the both calls of FDacc(). While we are bound to push instructions, there is no way to change just these parameters. We have to push all six parameters, again, because the parameter on top must be changed.
addl $20, %esp movl _GVAR, %eax # [1] movl $0, 10464(%eax) pushl %esi call _CtrWn movl %esi, (%esp) call _DlgShow jmp L41 .p2align 4,,7 L12:movl %ecx, %eax andl $65535, %eax subl $4672, %eax cmpl $30, %eax ja L37 jmp *L36(,%eax,4)Again, it were better to extract the low word via one movzwl 0x10(%ebp),%ecx rather than to use the chosen way. Probably, parallel execution was considered to be too fast?
.p2align 2 .align 2,0xccI don't know what the second .align is good for. Any hints? My jump table is too large, because C programmers do not think about side effects like blowing up code while assigning resource IDs to 'straight' numbers. Due to the gaps between those IDs, there are a lot of superfluous entries in this jump table.
L36:.long L14 .long L15 .long L16 .long L17 .long L2 .long L37 .long L37 .long L37 .long L18 .long L19 .long L37 .long L37 .long L37 .long L37 .long L37 .long L37 .long L21 .long L23 .long L25 .long L27 .long L29 .long L31 .long L33 .long L37 .long L21 .long L23 .long L25 .long L27 .long L29 .long L31 .long L33By the way: Jump tables belong to the .data, not to the .code segment. AS supports jump tables in the .data segment, so there's no need to violate the recommendations of AMD and LETNi as all versions of GCC do...
L14:movl _GVAR, %edx # [1], [2] movl 10464(%edx), %eax andb $225, %ah orb $16, %ah .p2align 4,,7 L40:movl %eax, 10464(%edx) jmp L2 L15:movl _GVAR, %edx # [1], [2] movl 10464(%edx), %eax andb $225, %ah orb $8, %ah jmp L40 L16:movl _GVAR, %edx # [1], [2] movl 10464(%edx), %eax andb $225, %ah orb $4, %ah jmp L40 L17:movl _GVAR, %edx # [1], [2] movl 10464(%edx), %eax andb $225, %ah orb $2, %ah jmp L40 L18:movl _GVAR, %edx # [1], [2] movl 10464(%edx), %eax andl $-385, %eax orb $1, %ah jmp L40 L19:movl _GVAR, %edx # [1], [2] movl 10464(%edx), %eax andl $-385, %eax orb $-128, %al jmp L40 L21:movl _GVAR, %edx # [1], [2] movl 10464(%edx), %eax andl $-128, %eax orl $64, %eax jmp L40 L23:movl _GVAR, %edx # [1], [2] movl 10464(%edx), %eax andl $-384, %eax orl $32, %eax jmp L40 L25:movl _GVAR, %edx # [1], [2] movl 10464(%edx), %eax andl $-384, %eax orl $16, %eax jmp L40 L27:movl _GVAR, %edx # [1], [2] movl 10464(%edx), %eax andl $-384, %eax orl $8, %eax jmp L40 L29:movl _GVAR, %edx # [1], [2] movl 10464(%edx), %eax andl $-384, %eax orl $4, %eax jmp L40 L31:movl _GVAR, %edx # [1], [2] movl 10464(%edx), %eax andl $-384, %eax orl $2, %eax jmp L40 L33:movl _GVAR, %edx # [1], [2] movl 10464(%edx), %eax andl $-384, %eax orl $1, %eax jmp L40Many redundant instructions unnecessarily blow up code size. The first two lines of all jump targets could be reduced to a common read before the distributor branches to the selected table entry.
.comm _hab, 16 .comm _DEBUG, 16 .comm _USE_LDF, 16 .comm _LDR_AVAIL, 16 .comm _MSGLD, 16 .comm _BMM, 16 .comm _BNR, 16 .comm _GVAR, 16 .comm _BST, 16 .comm _BBF, 16 .comm _TST, 16 .comm _MHSTR, 16 .comm _LDF, 16 .comm _DUMPLINE, 16 .comm _DUMPCNT, 16 .comm _OLH_MODE, 16 .comm _SEC, 16 .comm _XXX, 16 .comm _FLD_XXX, 16 .comm _FLD_SEC, 16We only use _GVAR in this file - enumerating all global variables is quite stupid. All globals are defined as doublewords (32 bit), so we might ask why GCC expands all these variables to a size of 128 bit, filling up the .data segment with garbage.
Footnotes
[1] This is the typical side effect of the C convention saying 'Thou shalt not save ECX and EDX'. Using two of six registers as a sanitary fill for temporary data, we reduce the set of safe registers to three, forcing us to reload frequently required parameters from memory over and over again. Reloading parameters more than two times eats up the advantage of omitting the two push and pop instructions to save and restore the content of ECX and EDX. After the third reload operation, we are on the losing side. Adding unnecessary loads of parameters to our code slows down our functions and does not speed them up.[2] Loading _GVAR into EDX and the flags MVP_FLAGS into EAX could be done in L12. Applying some brain didn't just save 24 lines of code, it also sped up the 13 target functions, because loading and evaluating MVP_FLAGS were separated and the dependency chain were reduced to two single rather than three consecutive dependencies.
Improvements
Obviously, the code generated by GCC 3.3.5. is anything else than optimised. Even if you do not know anything about reading source code, you surely are able to grasp what those comments say. This document is not an introduction to programming, so you have to rely on my words, but you can be sure: I definitely know what I am talking about. Rearranging some parts and reducing code sequences to really required instructions does shrink GCC's draft markably. Applying some human brain, the remaining (optimised) code should run about 30 percent faster now..data .p2align 4,0x00 jt0:.long L02 .long L03 .long L04 .long L05 .long L16 .long L15 .long L15 .long L15 .long L06 .long L07 .long L15 .long L15 .long L15 .long L15 .long L15 .long L15 .long L08 .long L09 .long L10 .long L11 .long L12 .long L13 .long L14 .long L15 .long L08 .long L09 .long L10 .long L11 .long L12 .long L13 .long L14Following recommendations of AMD and LETNi, the jump table is moved to the .data segment. This is much better than mixing code and data in the .code segment.
.text .p2align 4,,7 .globl MoveDlg MoveDlg:pushl %ebp movl %esp,%ebp pushl %edi pushl %esi movl 0x08(%ebp),%edi movl 0x0C(%ebp),%eax movzwl 0x10(%ebp),%ecx movl _GVAR,%esi cmpl $0x30,%eax je L01 cmpl $0x20,%eax je L00 cmpl $0x3B,%eax jne L15The distributor was optimised for the branch prediction logic. WM_CONTROL was put on top of the distributor, because most sent messages are WM_CONTROL messages. WM_COMMAND only is sent if the user pushes a button. No user is able to recognise delays of about 5 ns, so we can live with a ten cycles penalty if a branch target is misprediced. WM_INITDLG is sent only once. While the 1st comparison is 'guessed' as not taken, the branch does not trigger a penalty. The 2nd and all following comparisons are assumed to be taken, so the branch to the default routine (DefDP()) does not trigger penalties, as well.
pushl $0xE9 pushl $0xD3 pushl $0xD2 pushl %edi call _DLGtxt pushl $0x00 pushl $-0x01 pushl $0x0120 pushl $0x1240 pushl %edi call _SnDIM addl $0x08,%esp pushl $0x1248 pushl %edi call _SnDIM addl $0x08,%esp pushl $0x1250 pushl %edi call _SnDIM addl $0x08,%esp pushl $0x1258 pushl %edi call _SnDIMThe three parameters on top are pushed for the first call, only. This saves nine redundant instructions.
addl $0x14,%esp movl 0x1C54(%esi),%ecx movl 0x01D8(%esi),%edxBoth parameters can be preloaded at this point, because FDacc() is a function taken from ST-Open's library. Functions in my libraries restore all registers (including ECX and EDX) by default - they are 'clean'. But - watch out: MoveDlg() is a function following the C conventions. ECX and EDX neither are saved nor restored - MoveDlg() is a 'dirty' function.
pushl %edi pushl $0x00 pushl $0x02 pushl $0x00 pushl $0x0B pushl %ecx call _FDacc pushl %edx pushl $0x00 pushl $0x02 pushl $0x04 pushl $0x0B pushl %ecx call _FDacc addl $0x36,%esp movl $0x00,0x28E0(%esi) pushl %edi call _CtrWn call _DlgShow jmp 3fThe next distributor was optimised for code reduction. Because none of the three buttons is pushed more than once (in general..), we can live with a ten cycles penalty for one or two mispredicted branch(es). The delay added by the penalty is at least six powers of ten faster than anything human senses could perceive.
L00:subl $0x1231,%ecx je 0f decl %ecx je 1f decl %ecx jne L15 pushl $0x11 call _Help jmp 3f 0:movl $0x00,0x28E0(%esi) jmp 2f 1:orl $0x00040000,0x28E0(%esi) 2:pushl %edi call _WinDD 3:addl $0x04,%esp jmp L16 L01:subl $0x1240,%ecx js L15 cmpl $0x1E,%ecx ja L15 movl 0x28E0(%esi),%eax jmp *jt0(, %ecx, 4)The jump table was moved to the .data segment. To keep an overwiew, your symbols for jump tables generally should be marked with special names. ST-Open uses the symbol 'jtX', where X is the number of the current jump table.
L02:andl $0xFFFFE1FF,%eax orl $0x1000,%eax jmp 0f L03:andl $0xFFFFE1FF,%eax orl $0x0800,%eax jmp 0f L04:andl $0xFFFFE1FF,%eax orl $0x0400,%eax jmp 0f L05:andl $0xFFFFE1FF,%eax orl $0x0200,%eax jmp 0f L06:andl $0xFFFFFE7F,%eax orl $0x0100,%eax jmp 0f L07:andl $0xFFFFFE7F,%eax orl $0x80,%eax jmp 0f L08:andl $0xFFFFFF80,%eax orl $0x40,%eax jmp 0f L09:andl $0xFFFFFE80,%eax orl $0x20,%eax jmp 0f L10:andl $0xFFFFFE80,%eax orl $0x10,%eax jmp 0f L11:andl $0xFFFFFE80,%eax orl $0x08,%eax jmp 0f L12:andl $0xFFFFFE80,%eax orl $0x04,%eax jmp 0f L13:andl $0xFFFFFE80,%eax orl $0x02,%eax jmp 0f L14:andl $0xFFFFFE80,%eax orl $0x01,%eax 0:movl %eax,0x28E0(%esi) jmp L16This part surely could be reduced further if I could remember what all those flags are good for...
L15:popl %ebx popl %edi popl %esi popl %ebp jmp _DefDP L16:xorl %eax, %eax popl %ebx popl %edi popl %esi popl %ebp ret'Exits' belong to the bottom of a function. First, the processor does not have to jump back and forth to random locations within the instruction chain. Secondly, human senses perceive structured (sorted) input much faster than random patterns spread all over the screen.
.comm _GVAR, 4One (used) out of all (unused). The size of the global variable was reset to the proper size of 32 bit, so four - rather than one - variable(s) will fit into one paragraph (16 byte), again.
Analysis
The source code generated by GCC 3.3.5 perfectly reveals the greatest weaknesses of the C conventions. The first step to slow down C code markably is the abuse of EBP as base pointer, reducing our set of available registers to 6: EAX, EBX, ECX, EDX, EDI and ESI. One of these remaining registers, EAX, is used to pass results or error codes from a called function to the calling function, reducing our register set to five. By default, ECX and EDX neither are saved nor restored, so every time we call another function their content is sent to the great formatter, or, in other words: If we stored frequently used parameters in ECX or EDX prior to the call, they probably will be overwritten by the called function. Due to counterproductive conventions, only three registers are left to store our frequently used parameters. If four or more parameters are required throughout a function, parameters 4 and up must be reloaded whenever we call another function, because the content of EAX, ECX and EDX is changed by the called function. Reloading parameters from the slower memory subsystem instead of preloading them in registers to perform much faster operations is quite time consuming. It is not about those five additional clock cycles we waste with reloading a parameter over and over again. The most negative side-effect is the immediate interruption of parallel execution, forcing two execution pipes to idle until the required parameter was reloaded, again. It's one of the reasons why C and C++ applications are that slow. An exemplatory sequence is the following code snippet taken from GCC's output. It demonstrates how strict implementation of the C conventions slows down the execution of standard C and C++ code:... movl _GVAR, %eax movl 7252(%eax), %eax pushl %eax ...This sequence turns any parallel execution off. Instead of executing up to 3 instructions in one of the 3 available execution pipes, the code listed above is executed sequentially, or, in other words: Two execution pipes have to wait until the previous instruction is executed and its result is available. If all data are present in L1 cache, it takes (approximately) nine clock cycles to execute the three lines listed above. Two execution pipes are idling while _GVAR (_GVAR is BNR defined as an array of dwords to satisfy GCC) is loaded into EAX. Next, two pipes are idling while 1C54[BNR] is loaded into EAX. Finally, EAX is pushed onto the stack, blocking ESP for two clock cycles. If some data were not loaded into L1 cache, yet, execution time is extended markably. Reading data from the L2 cache costs about six clock cycles, loads from main memory consume about 27 clock cycles. In the end, our primary intention, saving 14 clock cycles to push and pop ECX and EDX in every function's prologue and epilogue, turns out to be a bad idea. In the best case, the assumed advantage is eaten up after the second reload of a frequently used parameter. In general, there is no advantage at all - The saved 14 clock cycles are wasted with the first reloading from main memory. This is the main reason why conventional software is creeping through the execution pipes of John Doe's hypermodern Gigantium processor like thick dough. Perhaps it were a good idea to purify the sap called software to give John Doe the chance to partially enjoy the overwhelming computational powers of his expensive Gigantium machine? Another obstacle to unleash the powers of recent processors is the flying stack pointer ESP. Modern software design should use the capabilities of modern processors instead of blocking them with outdated mechanisms and ancient techniques. A pointlessly floating stack pointer with random content cannot be used seriously to address stack elements.
Considerations
As discussed in great detail, any conventional stack frame generally looks like this:Programmers and compilers are free to write any kind of data to stack locations between the stack's bottom and the stack element below ESP (-0x04[ESP]). To reserve a part of the stack for the private use of the current function, we have to subtract the required size from ESP. Moving ESP towards stack bottom, we reserve the corresponding area for our private use. No function (except it is compiled with GCC 3.3.5.) ever writes data to stack locations above -04[ESP]. It is very important to understand the connection between the subtraction of the required stack frame size from ESP and exclusive ownership of the reserved area. My alternative method introduced in this paper as well as conventional programming techniques are based on this principle. You just break an absolute taboo if you write to stack locations above -04[ESP]! Because conventional programming methods push data onto the stack, respectively pop them from the stack, the content of ESP is changing continuously. Due to its randomly changing content, it is impossible to use ESP as base for adressing specific stack elements directly. An additional register, the base pointer EBP, is required for this task. It points to the current top of the stack all of the time, so we safely can adress all stack elements regardless of the current content of ESP. If we analyse the described process and have a closer look at its details, one thing is quite obvious: We could save all time consuming contortions with additional registers if we replaced the usual push - call - add n,%esp sequenzes with something else. A short look into the data sheets of modern processors reveals us some additional caveats. Let us recapitulate what we know about push and pop instructions and compare them against a bunch of simple mov instructions:
PUSH
Depending on the standard data size, either two, four or eight is subtracted from the stack pointer rSP, setting it to the next lower stack element. After updating rSP, the content of a register, a memory location or an immediate value is copied to the current stack element (00[rSP]). A push reg or push imm instruction is executed within three clock cycles. ESP is blocked for the time it needs to update its content and deliver the stack location where the given data shall be stored. This requires two clock cycles, so consecutive push instructions have a latency of at least two clock cycles. As an alternative, we might replace one push with three mov instructions. If none of them depends on the result of one of the other two movs, all of them are executed simultaneously in three clock cycles, as well. If we write to continuous memory locations, all writes are done in one gulp, because a mechanism called write combining is triggered.
POP
The content of the current stack element is copied to the register or memory location specified by the pop instruction, then two, four or eight is added to the stack pointer rSP. The current stack element was 'taken from the stack' and the stack pointer now points to one stack element above (our new stack bottom). Other than the direct path push, pop is a vector path instruction. These instructions always are fed to execution pipes 0 and 1, while execution pipe 2 is blocked while the instruction is executed. The latency of each pop instruction is 4 clock cycles. As an alternative, we might replace one pop with four mov instructions. Three of them are executed simultanuously in three clock cycles, the fourth mov starts execution in the fourth clock cycle.
Conclusions
As the analysis of both instructions shows, the only rational consequence is to replace all pop and push with mov instructions. Doing so not only speeds up passing of parameters, it also turns the flying stack pointer into a static one. As a positive side-effect of the frozen stack pointer, we do not need EBP as base pointer any longer, freeing one valuable register for general purposes. Using no base pointer, we get rid of that mess with positive and negative offsets to EBP - a source of errors, making source code less readable.Addendum
With Athlon64, family 10h (aka Phenom), the latency for pop instructions was reduced to three clock cycles direct path single. Nonetheless, replacing push and pop with mov instructions is a much better improvement, because of the expected positive side-effects coming along with the replacement automatically. Even if Intelligent Design functions will win any race around clock cycles with ease, its concept offers much more improvements than just counting clocks.Intelligent Design
If we put all discussed aspects together, we can state that conventional programming techniques ignore improvements introduced with modern processor design completely. Well, conclusions of our analysis suggested to replace all leave, pop and push with mov instructions, because three of them can be executed simultaneously. The advantages of these replacements are a static stack pointer, an additional general purpose register (EBP), a naturally aligned stack (if supported by the operating system), read and write access to any stack element via ESP, and much more. Based on the facts we worked out until now, a concept was developed and tested for usability in everday's applications. Because we use no base pointer any longer, the stack is addressed via ESP, only. As shown before, we can reserve a stack area for our private use by subtracting the required size from the stack pointer ESP. The reserved area is safe from being used by called functions (except those compiled by GCC 3.3.5.). Let us recapitulate how the stack looks like after a function is called:The current stack element holds the return address - the address of the instruction following the call in the calling function. If the calling function has to pass parameters to our funktion, they are stored above the current stack element in ascending order. Because we want to use mov instructions, only, a sufficient stack area must be reserved where we can save used registers, store local variables and pass parameters to called functions. Working with a static stack pointer, we just have to add the required sizes of these three parts (registers, local area and parameters), then subtract the sum from ESP:
After this subtraction, the content of ESP does not change until our function is terminated. Therefore, ESP can be used to address any element within our current stack frame. Elements outside our private area are taboo - you may read them whenever you want, but you must not write to them under any circumstances!
Calculating The Required Size
The size of our stack frame depends on the amount of registers we have to save, the size for our local variables and the amount of parameters we have to pass.Registers
The contents of all used registers are stored at the top of our private area. All Intelligent Design functions do save ECX and EDX, as well! The following table shows the required size for 16, 32 and 64 bit functions:Registers | 16 Bit | 32 Bit | 64 Bit |
---|---|---|---|
1 | 0x02 | 0x04 | 0x08 |
2 | 0x04 | 0x08 | 0x10 |
3 | 0x06 | 0x0C | 0x18 |
4 | 0x08 | 0x10 | 0x20 |
5 | 0x0A | 0x14 | 0x28 |
6 | 0x0C | 0x18 | 0x30 |
7 | 0x0E | 0x1C | 0x38 |
8 | - | - | 0x40 |
9 | - | - | 0x48 |
10 | - | - | 0x50 |
11 | - | - | 0x58 |
12 | - | - | 0x60 |
13 | - | - | 0x68 |
14 | - | - | 0x70 |
15 | - | - | 0x78 |
If no registers are used, the required size is zero, of course. Rows 7 through 15 are not available in 16 and 32 bit mode, because these modes do not recognise the extended 64 bit register set. Eventually, you might need to save some MMX or XMM registers. Just add 8 for each MMX and 16 for each XMM register. It is recommended to avoid using MMX registers, because they internally are mirrored on the original FPU registers ST(0) through ST(7). Older software libraries and operating systems do not know anything about MMX registers, so you might encounter weird problems if you do not write an explicit emms or femms before you call external functions. Well, emms and femms generally destroy the contents of all MMX registers, so it is a good idea to use XMM instead of MMX registers.
Local Variables
Between saved registers at the top of our stack and the area where parameters for called functions are stored, we have to reserve the space required for local data (variables, structures, strings). If the size is no multiple of the standard data size, it is recommended to expand the required size to the next higher multiple of the standard data size. It's very important to keep the stack aligned, because odd addresses in ESP trigger penalty cycles for every unaligned memory access. If you don't keep care and add an even size in your epilogue, the program definitely will crash, because EIP probably is set to a faulty return address. If you want to store strings on the stack, it is recommended to define a sufficient size, so the function reading characters from the input device has no chance to overwrite parameters or register contents stored in the stack elements above.Parameters
Parameters are stored at the bottom of our stack frame. They are moved to the stack elements in ascending order, starting at 00[ESP]. Because we mov parameters to the corresponding stack elements rather than to push them any longer, we should reserve an area equal to the size required by the function with the largest amount of parameters. If we have three functions awaiting three parameters and one function awaiting five parameters, we need a size of 20 byte (32 bit code) to pass five parameters - the three parameters for the other three functions automatically fit into the reserved 20 byte. The required size can be taken from the above table for registers. If more than 6 (15) parameters must be passed, the size easily can be calculated with the formula parameters * datasize, where data size is 2 (16 bit), 4 (32 bit) or 8 (64 bit). For example, the required size for the 13 parameters we have to pass to WinCreateWindow() is 13 * 4 = 52 byte. The bottom of the corresponding stack frame looks like this:Watch out: Parameter 1 in figure 08 is identical with the leftmost parameter we are writing within the paranthesis of a C or C++ function. In general, parameters are moved onto the stack in ascending order: Parameter 1 @ 00[ESP], parameter 2 @ 04[ESP] and so on, until the last parameter was moved. If any parameter does not change between two or more consecutive call instructions, it just has to be moved the very first time it is used. For following calls, the parameter already is stored at the right place, so we do not need to store it at the same location, again - it is stored there until we overwrite it!
Adding Sizes
After we determined the required sizes of all three parts (registers, local area, parameters to pass), we calculate their sum, then round the result up to a size ending with 3C, 7C, BC or FC (32 bit mode), respective 38, 78, B8 or F8 (64 bit mode). Applying this trick, our stack frame automatically is aligned to the beginning of the next cache line. Finally, we create our new stack frame by subtracting the rounded up sum from rSP. As described in detail, this subtraction reserves the created stack frame from being overwritten by other functions. Writing data to stack locations outside the current stack frame violates the rules of conventional programming as well as the rulework defined for Intelligent Design. Setting the first stack element to the beginning of a cache line, we benefit from a bunch of advantages. First, no time consuming workarounds are required to align stack locations to store or load XMM registers. It is just one side effect of the Intelligent Design prologue that rSP is naturally aligned by default without a single line of additional code. Secondly, Intelligent Design code is able to benefit from bandwith and accellerating mechanisms provided by modern processors. At the latest, the first access to any stack element loads the stack frame into the L1 cache, moving all following accesses to the fastest area in the machine's memory hierarchy. Thirdly, writing data to continuous memory locations in ascending order triggers a mechanism called write combining, where up to 16(!) doublewords are stored on the stack in one gulp. Much more advantages like avoiding repetitive stores for never changed parameters come along with this fresh and modern design. Some of them are shown later on, some of them are not revealed, yet... If all Intelligent Design rules were applied properly, the smallest possible stack frame automatically looks like this in 32 bit codeor like this in 64 bit code
The real size can be varied by adding multiples of 64 to the smallest value of 0x3C (0x38 in 64 bit code) to match your individual requirements perfectly. Due to the subtraction of a predefined size from rSP, the stack frames of all Intelligent Design functions automatically are set to the beginning of a cache line. In general, functions are called. Looking at the mechanism of the call instruction, the return address is pushed onto the stack before execution is continued with the first instruction of the called function. Hence, rSP points to an address ending with 3C, 7C, BC or FC in 32 bit code, respective 38, 78, B8 or F8 in 64 bit code - exactly the size we have to subtract from rSP.
Destroying The Stack Frame
After our function's code was executed, all registers must be restored to the content they had as they were passed to us. The last step before returning to the calling function is to add the size we subtracted to create our stack frame to rSP. With the final ret, rIP is taken from the stack and execution is continued with the instruction following the call of our function. At this point, our stack looks like this, again:Because the area between stack bottom and the address currently loaded into rSP is 'no-mans-land', the next function is free to reserve a part of it to save used registers, to store its local variables and structures or to put some parameters for called functions onto the stack. On principle, the content of a stack element is undefined until something was written to it. Writing data into untouched stack elements is called 'initialisation' in C and C++ terminology.
Side Notes About Alignment
Using XMM instructions, most 128 bit stores and loads explicitely expect addresses aligned to 16 byte boundaries, in other words: The last digit of the address must be zero. No existing operating system cares about aligned addresses at all. Using excessive push orgies to put the parameters for called functions onto the stack, the chance to enter that function with an aligned stack pointer is about 25 percent. Because 16 / 4 = 4, the chance ESP ends with 04, 08 or 0C is about 75 percent. It's left to the programmer or compiler to handle this problem properly, so you have to add redundant, time consuming code to align the stack pointer in every function working with XMM instructions.As shown, Intelligent Design provides some mechanisms to naturally align all stack frames to the beginning of a cache line, but - even the best mechanism is of no use if the operating system does not support it. In most cases, an ID compliant function is called by the message transporting system within the operating system. Whenever a message is sent or posted to our application's message loop, we do not know which instance of which function of the operating system called our message loop. Hence, we do not know how many things were pushed onto our stack nor can we rely on something like an aligned stack pointer. Even if we aligned the stack pointer at the begin of our main() function, it surely is not aligned any longer if the code in our message loop is executed the very first time. Therefore, one of the most important improvents introduced with Intelligent Design is not available as long as we run our code on any existing platform. Because all existing operating systems put dirt into our gears, no machine ever will run at full speed.
No application programmer is able to do anything against this counter-productive behaviour of existing operating systems. Either we bow our heads, obey and apply all those pointless work-arounds recommended by processor manufacturers, or we give up programming to get rid of them. Because it is a fact that work-arounds are not very practicable in many cases, there's a third way as an alternative for this 'take it or leave it' choice. Instead of searching for ways to bypass 'built in' restrictions, speed brakes and other obstacles of existing operating systems, it might be a much better idea to throw away these remains of old fashioned software design and create something new. Something supporting innovations and capabilities of modern processors instead of ignoring them. Something making use of the immense computational power and speed of recent quad core machines. Something called IDEOS (the Intelligent Design Easy2Use Operating System)...
Summary
I hope I could convince you of various advantages introduced with the development of Intelligent Design. Compared against conventional programming techniques (as discussed in this document), Intelligent Design wins in all disciplines - be it sheer speed, high code density, simplified development or efficiency. Due to the strict renunciation of slow leave, pop and push instructions, coming along with many ill side-effects like an unpredictable stack pointer, these caveats of conventional software design easily are turned into some really welcome side-effects by replacing those slow instructions with simple moves.The most welcome side-effect surely is the fact that we get EBP back as a general purpose register. One additional register is quite a lot if only six of them are available, because EBP was abused as base pointer. Each additional register is an advantage, because we can pair reads to copy parameters from memory to registers and access these registers rather than reloading single registers with some frequently used parameters over and over again.
Another advantage is that we mov parameters to the right location rather than to push them onto the stack. While move instructions can store changed parameters anywhere, a push always addresses the next lower stack element (a quite limited range). If we call a function repeatedly, and only the last parameter changes from call to call, we had to push all parameters after each call to update this topmost parameter with the current data. Moreover, the correction of ESP is mandatory after each call, blocking the following push for an entire clock cycle. Using move instructions, all these problems vanish. We save many unnecessary instructions and keep the three execution pipes busy most of the time. Of course, there are a lot of cases where only one or two instructions are required between two calls, but this still is faster than a conventional push orgy.
Yet another advantage is the fact we can move parameters and other data onto the stack anywhere within our source file. Pairing multiple reads in groups, we can avoid direct dependencies completely. Pairing multiple writes, we trigger write combining. Keeping proper distances between reads and writes can eliminate most dependencies with few exceptions. Applying these tricks as often as possible speeds up code markably.
Putting it all together, Intelligent Design is the up-to-date alternative to some old-fashioned conventional programming techniques. Conventional software design did not change too much in the last thirty years. We neither can create the most simple integrated circuit with stone age tools, nor are we able to create software making full use of the capabilities provided by modern processors with conventional code. While old fashioned code is spiced with mandatory brake pads and compilers generate tons of counter-productive dependencies, because the convention allows to abuse the most valuable resources - our registers - as garbage pile to save four instructions at the cost of multiple reloads, Intelligent Design replaces these obstacles with straight and clean rules, providing full support for features and improvements of recent processors.
Intelligent Design is an up-to-date concept for the next generation of operating systems and applications, introducing a new quality to software design. Moreover, it is much easier to handle a single stack pointer than to manage two registers, a separate stack and base pointer, simultaneously. Getting rid of the institution called base pointer, we finally can say good bye to positive and negative offsets and error prone corrections of the stack pointer after each of the obligatory push orgies. Another positive side-effect is the return of a valuable resource (EBP) to the very sparse pool of general purpose registers - the worst conceptual flaw of LETNi's x86 architecture. All in all, Intelligent Design points the way ahead to fastest code with high density, causing less head aches for stressed programmers. Best of all: This is not just a theoretical construction, floating around in the head of a weird, unwordly developer - it exists for real! ST-Open's libraries as well as DatTools and the SRE editor were ported to Intelligent Design, working as expected: Really fast and reliable.
Rules
Whenever we want to build a greater whole out of many independent parts created by individual contributors, we have to use a common set of rules. Anyone contributing a part to the whole has to obey these rules. If all contributors applied their own rules, we finally ended up with a lot of parts, but none of them were able to work together with any other part. Using different interfaces, single parts either could not communicate at all or misunderstood received messages completely. Therefore, we need a set of rules before we start to create parts of a greater whole. This set of rules, like a common language, defines a common interface, forcing any contributor to use standardised communication protocols, so other parts are able to understand all sent messages and do the right work. The big idea behind any set of rules lives or dies with strict obedience.Rule 1
The instructions enter, leave, push and pop are archaic remains of the past era. They must not be used in Intelligent Design code.Any of these instructions change the content of the stack pointer automatically. All Intelligent Design functions depend on one principle: The stack pointer must not change its content for their entire runtime. We either can enjoy the advantages of a static stack pointer or we can waste precious time with obligatory corrections and repetitive writes of one and the same parameter. Using a 'flying' stack pointer and its urgently required compagnion called base pointer is mutual exclusive with Intelligent Design. Both concepts use the stack pointer in a very different way - there is no chance to use a mix of both concepts in one and the same function.
Rule 2
Stack frames are created by subtracting 0x3C in 32 bit functions or 0x38 in 64 bit functions from the stack pointer ESP/RSP. If larger stack frames are required, the basic value 0x3C or 0x38 can be expanded in 64 byte steps (0x7C, 0xBC, 0xFC, ... in 32 bit functions, 0x78, 0xB8, 0xF8, ... in 64 bit functions).This rule is valid for IDEOS, only. Any other operating system will not benefit from the advantages of a properly aligned stack. Because the stack generally is not aligned in old fashioned operating systems, you have to align it with some lines of redundant bloat if you use XMM registers. If you write IDEOS code, you have to apply this rule. The only exception is the rare case that you do not use other registers than rAX and XMM0 through XMM3. If there are registers to preserve, you need a stack frame. The advantage of this rule is a properly aligned stack pointer, starting at the beginning of a cache line. IDEOS's concept guarantees that the stack pointer is naturally aligned to a multiple of 64 before any function is called. Because the call instruction stores EIP/RIP in the next stack element, subtracting the correct value automatically aligns the stack pointer to a multiple of 64 and the corresponding cache line is preloaded into L1 cache without a single line of additional code.
Rule 3
All registers used in a function must be saved before they are overwritten and must be restored before returning to the caller.Any software running on modern processors is most efficient if it can access as much hardware resources as possible. The most valuable resources of a x86 processor are its registers. The more frequently used parameters are held in registers, the less time is required to perform a given task. If we reload parameters with slow memory reads rather than to keep them in registers, we slow down our code with superfluous operations and interrupt simultaneous execution completely. Preloading all required parameters into registers in one gulp reduces all dependencies, supporting parallel execution, keeping all three pipes busy all of the time. Saving precious clock cycles with much faster register operations outweighs the cost of the few clock cycles needed to save and restore all used registers by far. Many functions work with a subset of all available registers. The less registers we use, the faster they are saved and restored. Another welcome side-effekt is based on the fact that continuous writes to ascending addresses trigger write combining, where up to 16 doublewords are written to a cache line in one gulp. This surely is much faster than 16 single writes. Moreover, pairing of multiple reads or writes is a good opportunity to support simultaneous execition flow as long as possible.
Rule 4
The registers RAX, XMM0, XMM1, XMM2 and XMM3 are reserved for special purposes - they neither are saved nor restored by default.This rule partially overrides rule 3: Per definitionem, Register RAX is used to pass return values or errorcodes to the calling function. It does not make sense to save a register that is overwritten in all epilogues by default. If a function declaration defines its return value as VOID, rAX must be cleared with a simple xor %rAX,%rAX on exit.
Registers XMM0 through XMM3 with their size of 4 * 128 bit = 64 byte cover an entire cache line. They are used to clear, move or manipulate large data areas in steps of 64 byte, preferably aligned to a multiple of 64. In general, these operations are executed in small loops where no other functions are called, so we can skip to save and restore these registers per se. This saves a lot of clock cycles if your application makes heavy use of such operations. If your function calls others while manipulating large memory blocks, the called function might use XMM0 through XMM3 as temporary storage, overwriting whatever was stored in these registers. Hence, it is recommended to use XMM4 through XMM15 if preloaded values must not change while other functions are called. As stipulated by rule 3, you have to save and restore XMM4 through XMM15 if you use them.
Rule 5
Accessing registers is much faster than accessing memory locations. The most frequently used parameters in a function should be preloaded into registers as soon as these registers were saved on the stack.Even if old-fashioned programmers aren't used to work with multiple execution pipes: Obeying to this rule can turn an ox cart into a rocket (see Rule 3).
Rule 6
Modern processors execute three (LETNi: 1.5) instructions simultaneously. Good code should make use of these capabilities rather than to ignore them. Dependency chains interrupt simultaneous execution. While one instruction is executed in one pipe, the other two pipes are waiting for the result. Sending two of three pipes to sleep for some clocks wastes two thirds of the available resources.This extraordinary product of a classical compiler
... pushl %esi pushl $0 pushl $2 pushl $0 pushl $11 movl _GVAR, %eax movl 7252(%eax), %eax pushl %eax call _FDacc movl _GVAR, %eax addl $24, %esp movl 472(%eax), %ebx pushl %ebx pushl $0 pushl $2 pushl $4 pushl $11 movl 7252(%eax), %ecx pushl %ecx call _FDacc addl $20, %esp movl _GVAR, %eax movl $0, 10464(%eax) ...can be improved with a few useful changes
... movl _GVAR,%esi # at least 3 clocks ahead ... ... ... movl 0x1C54(%esi),%eax movl 0x01D8(%esi),%ecx movl $0x00,0x28E0(%esi) movl %eax,0x00(%esp) movl $0x0B,0x04(%esp) movl $0x00,0x08(%esp) movl $0x02,0x0C(%esp) movl $0x00,0x10(%esp) movl %edi,0x14(%esp) call _FDacc movl $0x04,0x08(%esp) movl %ecx,0x14(%esp) call _FDacc ...to run at least twice as fast, now. Removing all superfluous instructions, code size could be reduced from 23 to 14 instructions. Placing all instructions in the proper order removes dependencies and keeps all pipes busy without delays.
Rule 7
Jump tables belong to the data segment. AS is able to build jump tables in the data segment, so there's absolutely no reason to pollute the code segment with data of any kind.No comment is required for this rule. Just do it!
Appendices
AT&T-SyntaxAppendix 1 - Examples
Appendix 2 - Optimisations
No comments:
Post a Comment