<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-4412535206273251260</id><updated>2012-02-16T19:59:34.236+01:00</updated><category term='intelligent design'/><category term='improvement'/><category term='optimisation'/><category term='assembler'/><category term='programming'/><title type='text'>Intelligent Design</title><subtitle type='html'>Intelligent Design is an advanced programming technique, utilising many accelleration mechanisms provided by modern processors.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://st-intelligentdesign.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://st-intelligentdesign.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Bernhard Schornak</name><uri>http://www.blogger.com/profile/07864510983569379361</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_Z2WbH3F-E_Q/S4KZO1ynfyI/AAAAAAAAABQ/jgXB2qAV2Xc/S220/BS.png'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>17</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-4412535206273251260.post-1390845217413686297</id><published>2010-11-17T21:28:00.000+01:00</published><updated>2010-11-17T21:28:29.521+01:00</updated><title type='text'>ST-Open's Wrappers for 64 Bit Windoze</title><content type='html'>&lt;div align="justify"&gt;Wrappers generally are used to reduce calls to external functions to one place. In former times, this was done because all external calls were &lt;i&gt;far calls&lt;/i&gt; to other &lt;b&gt;code segments&lt;/b&gt;. Things changed a little bit with the &lt;b&gt;flat memory&lt;/b&gt; model, but API functions still reside in higher priviledged &lt;b&gt;code segments&lt;/b&gt; or are &lt;b&gt;call gates&lt;/b&gt; to OS functions running in ring 0, 1 or 2.&lt;br /&gt;&lt;br /&gt;Another important issue is the encapsulation of &lt;i&gt;dirty&lt;/i&gt; functions, known to overwrite registers without restoring them. As shown in my &lt;a href="http://st-intelligentdesign.blogspot.com/2010/11/intelligent-design-in-one-piece.html"&gt;Intelligent Design&lt;/a&gt; paper, a &lt;i&gt;dirty&lt;/i&gt; environment slows down execution markably, because frequently used parameters must be reloaded after each call to a &lt;i&gt;dirty&lt;/i&gt; function.&lt;br /&gt;&lt;br /&gt;Unfortunately, the &lt;i&gt;dirtyness&lt;/i&gt; of Windoze and Linux grew with the switch to their 64 bit ABI/API. While only two registers were used as garbage pile in 32 bit programming environments (ECX, EDX), we now have to face the abuse of 8 registers in Windoze environments (RCX, RDX, R08, R09, R10, R11, XMM4 and XMM5) or even 11 registers in Linux (RDI, RSI, RCX, RDX, R08, R09, R10 and XMM4...XMM7). Whenever you pass the mentioned registers to an API function, you can be sure they are returned with changed content. To avoid changed registers, you have to use wrappers. If you do not, you have to reload your parameters over and over, again.&lt;br /&gt;&lt;br /&gt;Wrappers are a must if you want to keep your programming environment &lt;i&gt;clean&lt;/i&gt;. As a positive side effect, you can customize API calls. Doing so can reduce the amount of arguments to pass drastically - a wrapper can manage things like retrieving the HWND for a dialog item much faster than the programmer, because it already has all registers preserved, saving the time to preserve and restore them more than once.&lt;br /&gt;&lt;br /&gt;&lt;h1&gt;Windoze 64 Bit Wrapper&lt;/h1&gt;My programming environment is the TDM/MinGW64 package, date 2010-05-09. Later packages are broken and deny to work properly. All code snippets are AS sources taken from ST-Open's system library. All functions in this environment must be defined as follows:&lt;br /&gt;&lt;pre&gt;.text&lt;br /&gt;.globl   _MyFunction&lt;br /&gt;.def     _MyFunction; .scl 2; .type 32; .endef&lt;br /&gt; _MyFunction:&lt;br /&gt;&lt;/pre&gt;To tell GCC to put the following output into the &lt;b&gt;code segment&lt;/b&gt;, we have to insert a &lt;i&gt;.text&lt;/i&gt; statement on top of our code. If your code references static data in the &lt;b&gt;data&lt;/b&gt; or &lt;b&gt;DR&lt;/b&gt; segment, their definitions should preceed the &lt;i&gt;.text&lt;/i&gt; statement. &lt;b&gt;Never&lt;/b&gt; put data into the &lt;b&gt;code segment&lt;/b&gt; - it can cause drastical perfomance loss.&lt;br /&gt;&lt;br /&gt;Next, &lt;i&gt;.globl&lt;/i&gt; statements tell the linker this function is globally available (visible) and external functions can call it. If the &lt;i&gt;.globl&lt;/i&gt; is missing, that function only can be called from functions residing in the current source file.&lt;br /&gt;&lt;br /&gt;A &lt;i&gt;.def&lt;/i&gt; statement provides additional debugging information for the compiler. The &lt;i&gt;.scl&lt;/i&gt; defines the storage class, &lt;i&gt;.type 32&lt;/i&gt; means it is a function. This statement probably is superfluous in pure assembler programs, but nevertheless is required if the function shall be callable from external HLL functions.&lt;br /&gt;&lt;br /&gt;Finally, the real code of MyFunction() begins at &lt;b&gt;_MyFunction&lt;/b&gt;. That is: The address of the first instruction following the label &lt;b&gt;_MyFunction&lt;/b&gt; is the start address of &lt;b&gt;_MyFunction&lt;/b&gt;. Whenever the linker stumbles upon a &lt;i&gt;call _MyFunction&lt;/i&gt;, it replaces the label with a reference to that start address.&lt;br /&gt;&lt;br /&gt;That much about HLL compatible goobledygook. Let's continue with some basic thoughts about the internal organisation of a wrapper.&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;Wrapper Designs&lt;/h2&gt;There are two different ways to organise a wrapper - either you provide separate functions with endless repetitions of one and the same proplogue and epilogue (stand-alone), or you use one prologue and epilogue for all functions (collection).&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Stand-Alone Wrappers&lt;/h3&gt;Providing a separate prologue and epilogue for each function has one advantage: The linker can cut the function's code out of a library and add just that code to the program where the function is called. The downside is a library with tons of redundant repetitions of one and the same prologue and epilogue. Hence, programs using the library are kept smaller, while the size of the library is quite large. Depending on the size of the prologue and epilogue, there is a point where the advantage is eaten up by their repetition. For example, the payload of an API wrapper is about ten percent of the entire wrapper, while the remaining 90 percent are occupied by preserving and restoring clobbered registers.&lt;br /&gt;&lt;br /&gt;Let us assume our library has 20 functions, resulting in 20 * 0.1 payload and 20 * 0.9 redundant repetitions. As a result, we have 0.5 percent payload and 99.5 percent overhead. This pays off if only one library function is called. With two calls, we have five percent payload and 95 percent overhead, with four functions 2.5 percent payload and 97.5 percent overhead, and so on. As you can see, this concept looks better at the first glance, but turns out to be a bad design for daily use.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Collected Wrappers&lt;/h3&gt;Here we go the other way around. This concept adds bloat if only one function is used, but pays off if we call multiple functions. Both, payload and overhead, now have fixed sizes. The more functions we use, the less our overhead becomes. With two functions, the ratio is 20 to 80 percent, with four functions 40 to 60 percent, and so on. In the end, we can reduce the size of our program by &lt;i&gt;some&lt;/i&gt; byte if we use a collected wrapper. Therefore, most ST-Open libraries meanwhile use collected rather than stand-alone functions.&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;Windoze API&lt;/h2&gt;&lt;h3&gt;Prologue And Epilogue&lt;/h3&gt;With the change to 64 bit, &lt;small&gt;microsoft&lt;/small&gt; decided to follow the footsteps of Loonix and abuse more registers as garbage pile. Additional to rCX and rDX in 32 bit code, we now have to preserve R08, R09, R10, R11, XMM4 and XMM5, as well if we do not want to reload the contents of these registers after each API call. I have seen a lot of register dumps throughout the last weeks, so I can tell you those registers definitely are destroyed after each API call. The most important parts of any API wrapper therefore are its prologue and epilogue. The prologue looks like this&lt;br /&gt;&lt;pre&gt;    ...&lt;br /&gt;&lt;br /&gt;    .p2align 4,,15&lt;br /&gt;  0:subq     $0xB8,%rsp&lt;br /&gt;    nop&lt;br /&gt;    nop&lt;br /&gt;    movdqa   %xmm4,0x60(%rsp)&lt;br /&gt;    movdqa   %xmm5,0x70(%rsp)&lt;br /&gt;    movq     %rcx, 0x88(%rsp)&lt;br /&gt;    movq     %rdx, 0x90(%rsp)&lt;br /&gt;    movq     %r8,  0x98(%rsp)&lt;br /&gt;    movq     %r9,  0xA0(%rsp)&lt;br /&gt;    movq     %r10, 0xA8(%rsp)&lt;br /&gt;    movq     %r11, 0xB0(%rsp)&lt;br /&gt;    jmp      *%rax&lt;br /&gt;&lt;br /&gt;    ...&lt;br /&gt;&lt;/pre&gt;where &lt;b&gt;0&lt;/b&gt; is the entrypoint for the function declarations placed above the prologue. RAX is set to the address of the real function code at the end of the declaration, preceeeing the jump to &lt;b&gt;0&lt;/b&gt;. Even if it looks quite lengthy, saving all registers is done in about 5 clock cycles (RSP correction plus two write combining sequences). Using six &lt;i&gt;push&lt;/i&gt;es for the GPRs and two &lt;i&gt;movdqa&lt;/i&gt;s for the XMM registers took about 15 clock cycles, because &lt;i&gt;push&lt;/i&gt; works with decreasing addresses, so no write combining is triggered. (two clocks for the &lt;i&gt;movdqa&lt;/i&gt;s, 13 clocks for six &lt;i&gt;push&lt;/i&gt;es - RSP is available after 2 clocks, only the last &lt;i&gt;push&lt;/i&gt; needs all three clock cycles.)&lt;br /&gt;&lt;br /&gt;The epilogue is quite similar to the prologue, except target and source of the &lt;i&gt;mov&lt;/i&gt;e instructions are exchanged:&lt;br /&gt;&lt;pre&gt;    ...&lt;br /&gt;&lt;br /&gt;    .p2align 4,,15&lt;br /&gt;XIT:movdqa   0x60(%rsp),%xmm4&lt;br /&gt;    movdqa   0x70(%rsp),%xmm5&lt;br /&gt;    movq     0x88(%rsp),%rcx&lt;br /&gt;    movq     0x90(%rsp),%rdx&lt;br /&gt;    movq     0x98(%rsp),%r8&lt;br /&gt;    movq     0xA0(%rsp),%r9&lt;br /&gt;    movq     0xA8(%rsp),%r10&lt;br /&gt;    movq     0xB0(%rsp),%r11&lt;br /&gt;    addq     $0xB8,%rsp&lt;br /&gt;    ret&lt;br /&gt;&lt;/pre&gt;This is trivial code and holds no mysteriously hidden secrets. In concurrence to the prologue, register reads have no accelerating mechanisms like write combining, so it takes about 10 clock cycles until the final &lt;i&gt;ret&lt;/i&gt;urn is executed - memory reads and writes are limited to one access per clock cycle. With &lt;i&gt;pop&lt;/i&gt;s, the prologue was executed in 15 clock cycles (two for the &lt;i&gt;movdqa&lt;/i&gt;s, 13 for the &lt;i&gt;pop&lt;/i&gt;s, where only the last one needs 3 cycles, the other are ready after 2 clocks).&lt;br /&gt;&lt;br /&gt;All mentioned latencies are valid for PhenomII (family 10), only. For older Athlons (family 8), the &lt;i&gt;push&lt;/i&gt; version needs 21 and the &lt;i&gt;pop&lt;/i&gt; version 27 clock cycles. Latencies for the Intelligent Design versions are the same for both processor families.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Function Declarations&lt;/h3&gt;All function declarations use a stereotype pattern, where only the function names change from declaration to declaration. The following snippet is just an excerpt from my original file:&lt;br /&gt;&lt;pre&gt;          .text&lt;br /&gt;&lt;br /&gt;          .p2align 4,,15&lt;br /&gt;          .globl   _RegClass&lt;br /&gt;          .def     _RegClass; .scl 2; .type 32; .endef&lt;br /&gt;_RegClass:movq     $rclass,%rax&lt;br /&gt;          jmp      0f&lt;br /&gt;&lt;br /&gt;          .p2align 4,,15&lt;br /&gt;          .globl   _RgClassX&lt;br /&gt;          .def     _RgClassX; .scl 2; .type 32; .endef&lt;br /&gt;_RgClassX:movq     $rclssx,%rax&lt;br /&gt;          jmp      0f&lt;br /&gt;&lt;br /&gt;          .p2align 4,,15&lt;br /&gt;          .globl   _LdIcon&lt;br /&gt;          .def     _LdIcon; .scl 2; .type 32; .endef&lt;br /&gt;  _LdIcon:movq     $ldicon,%rax&lt;br /&gt;          jmp      0f&lt;br /&gt;&lt;br /&gt;          ...&lt;br /&gt;&lt;/pre&gt;I use symbolic names for all local labels, but you could use GCC-style labels like L00...Lxx, as well. Symbolic names are faster to find if the file includes 60 functions like &lt;i&gt;cap.S&lt;/i&gt;, though...&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Functions&lt;/h3&gt;The functions themselves handle all required tasks, call the corresponding API function and pass the API returncode (RC) to the caller:&lt;br /&gt;&lt;pre&gt;          ...&lt;br /&gt;&lt;br /&gt;          .p2align 4,,15&lt;br /&gt;   rclass:call     *__imp__RegisterClassA(%rip)&lt;br /&gt;          jmp XIT&lt;br /&gt;&lt;br /&gt;          .p2align 4,,15&lt;br /&gt;   rclssx:call     *__imp__RegisterClassExA(%rip)&lt;br /&gt;          jmp XIT&lt;br /&gt;&lt;br /&gt;          .p2align 4,,15&lt;br /&gt;   ldicon:call     *__imp__LoadIconA(%rip)&lt;br /&gt;          jmp XIT&lt;br /&gt;&lt;br /&gt;          ...&lt;br /&gt;&lt;/pre&gt;The shown functions just pass the received parameters in RCX, RDX, R08 and R09 to the API, but they may pre-process those parameters, before they finally call the API function. An example:&lt;br /&gt;&lt;pre&gt;          ...&lt;br /&gt;&lt;br /&gt;          .p2align 4,,15&lt;br /&gt;   ctlshw:call     *__imp__GetDlgItem(%rip)&lt;br /&gt;          movq     %rax,%rcx                 # HWND&lt;br /&gt;          movq     0x98(%rsp),%rdx           # flag&lt;br /&gt;          call     *__imp__ShowWindow(%rip)&lt;br /&gt;          jmp XIT&lt;br /&gt;&lt;br /&gt;          ...&lt;br /&gt;&lt;/pre&gt;A call to CtlSh(HWND, id, bool); shows or hides the control specified by its ID and the handle of the control's parent window (probably a dialog). To speed up execution, the wrapper function first retrieves the control's window handle, then calls the API function to show or hide  general windows rather than to call a Widoze macro (which does nothing else than CtlSh(), but probably takes the long winded way).&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;A Real Epilogue...&lt;/h2&gt;I hope I could impart some knowledge about the pro's and con's of old fashioned C-style programming techniques and modern alternatives. the entire file &lt;i&gt;cap.S&lt;/i&gt; can be downloaded here: &lt;a href="http://code.google.com/p/st-open/downloads/list"&gt;wrappers.zip&lt;/a&gt;.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4412535206273251260-1390845217413686297?l=st-intelligentdesign.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://st-intelligentdesign.blogspot.com/feeds/1390845217413686297/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/11/st-opens-wrappers-for-64-bit-windoze.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/1390845217413686297'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/1390845217413686297'/><link rel='alternate' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/11/st-opens-wrappers-for-64-bit-windoze.html' title='ST-Open&apos;s Wrappers for 64 Bit Windoze'/><author><name>Bernhard Schornak</name><uri>http://www.blogger.com/profile/07864510983569379361</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_Z2WbH3F-E_Q/S4KZO1ynfyI/AAAAAAAAABQ/jgXB2qAV2Xc/S220/BS.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4412535206273251260.post-1988357940673642572</id><published>2010-11-14T11:20:00.001+01:00</published><updated>2010-11-15T04:29:31.788+01:00</updated><title type='text'>De-mystifying Windoze UpDown-Controls</title><content type='html'>&lt;div align="justify"&gt;Have you ever asked yourself why things are &lt;i&gt;that&lt;/i&gt; complicated? If so, you might want to read further...&lt;br /&gt;&lt;h1&gt;Windoze UpDown-Controls&lt;/h1&gt;Well, &lt;small&gt;microsoft&lt;/small&gt; better had called them DownDown-Controls, because they are a cheap imitation of OS/2's sophisticated Spinbutton Controls. When I saw &lt;small&gt;microsoft&lt;/small&gt;'s documentation the very first time, I just skipped the implementation of my advanced spinbuttons after reading two or three pages. Meanwhile, I started to recreate DatTools, because I urgently need a tool to manage my datafields. I surely can do it with a hex editor, but - it is quite time consuming to enter strings this way...&lt;br /&gt;&lt;br /&gt;Being forced to dive deeper into the complicated matter of Windoze' controlled Ups and Downs, I started to insert some experimental code into my raw DatTools skeleton. After writing a tool to retrieve all received messages and put them into a dump file, I finally got a clue how these controls really work. As it turned out, there is absolutely &lt;i&gt;nothing&lt;/i&gt; complicated about it - except the HLL hocus pocus which hides really simple stuff behind important sounding goobledygook to keep normal people away from using these controls. Actually, Windoze' UpDown-Controls are less complicated than OS/2's Spinbuttons!&lt;br /&gt;&lt;br /&gt;Umm ... this introduction is getting too large, let us start to learn something new. There are several ways to create an UpDown-Control. You can do it the really complicated way and use &lt;b&gt;WinCreateUpDownControl()&lt;/b&gt; to create the control, but it is a mess to retrieve the proper x, y, w and h values for this function. The least complicated way to retrieve course parameters is to query the rectangle of the associated edit control, add a few pixel to &lt;b&gt;(x + w)&lt;/b&gt; as new &lt;b&gt;x&lt;/b&gt; position, use the same &lt;b&gt;y&lt;/b&gt; and &lt;b&gt;h&lt;/b&gt; and take &lt;b&gt;h&lt;/b&gt; plus a few pixel as &lt;b&gt;w&lt;/b&gt; parameter. The easiest way (for really lazy sods like me...) is to add one (or several) line(s) like this&lt;br /&gt;&lt;pre&gt;CONTROL "", 0x138B, "msctls_updown32", 0x50010044,  85,  18,  15,  12&lt;br /&gt;&lt;/pre&gt;to your &lt;i&gt;whatever.dlg&lt;/i&gt; file. The hexadecimal &lt;b&gt;0x50010044&lt;/b&gt; is a short version of&lt;br /&gt;&lt;pre&gt;UDS_ALIGNRIGHT | UDS_HORZ | WS_CHILD | WS_VISIBLE | WS_TABSTOP&lt;br /&gt;&lt;/pre&gt;These styles are defined for UpDown-Controls:&lt;br /&gt;&lt;pre&gt;UDS_WRAP                        0x0001&lt;br /&gt;UDS_SETBUDDYINT                 0x0002&lt;br /&gt;UDS_ALIGNRIGHT                  0x0004&lt;br /&gt;UDS_ALIGNLEFT                   0x0008&lt;br /&gt;UDS_AUTOBUDDY                   0x0010&lt;br /&gt;UDS_ARROWKEYS                   0x0020&lt;br /&gt;UDS_HORZ                        0x0040&lt;br /&gt;UDS_NOTHOUSANDS                 0x0080&lt;br /&gt;UDS_HOTTRACK                    0x0100&lt;br /&gt;&lt;/pre&gt;Having done that, we can leave HeLL and start with some real work. Everything begins with processing the WM_INITDIALOG message. In general, you might want to set the upper and lower limits via something like&lt;br /&gt;&lt;pre&gt;xorl     %eax,   %eax&lt;br /&gt;xorl     %r9d,   %r9d&lt;br /&gt;movq     %rdi,   %rcx&lt;br /&gt;movl     $0x138B,%edx&lt;br /&gt;incl     %eax&lt;br /&gt;movl     $0x046F,%r8d&lt;br /&gt;decl     %r9d&lt;br /&gt;movq     %rax,   0x20(%rsp)&lt;br /&gt;call     _SnDIM&lt;br /&gt;&lt;/pre&gt;SnDIM() is a wrapper for SendDlgItemMessageA(), preserving and restoring the eight registers destroyed by Win's API. RDI is my fixed storage for the dialog's HWND - I used RCX previously to pass parameters to other functions not shown here. After setting the lower and upper limits, you might want to tell the UpDown-Control where to start:&lt;br /&gt;&lt;pre&gt;incl     %r9d&lt;br /&gt;addl     $0x02,%edx&lt;br /&gt;movq     %r9d, 0x20(%rsp)&lt;br /&gt;call     _SnDIM&lt;br /&gt;&lt;/pre&gt;As you can see, wrappers save reloading all destroyed registers after each API call. Keep care to pass a zero in R09 (WPARAM), as well! The hexadecimal in RDX is the resource ID of the UpDown, the hexadecimal in R08 the message sent to the UpDown-Control. These are all messages you can send to an UpDown:&lt;br /&gt;&lt;pre&gt;UDM_SETRANGE                    0x0465&lt;br /&gt;UDM_GETRANGE                    0x0466&lt;br /&gt;UDM_SETPOS                      0x0467&lt;br /&gt;UDM_GETPOS                      0x0468&lt;br /&gt;UDM_SETBUDDY                    0x0469&lt;br /&gt;UDM_GETBUDDY                    0x046A&lt;br /&gt;UDM_SETACCEL                    0x046B&lt;br /&gt;UDM_GETACCEL                    0x046C&lt;br /&gt;UDM_SETBASE                     0x046D&lt;br /&gt;UDM_GETBASE                     0x046E&lt;br /&gt;UDM_SETRANGE32                  0x046F&lt;br /&gt;UDM_GETRANGE32                  0x0470&lt;br /&gt;UDM_SETPOS32                    0x0471&lt;br /&gt;UDM_GETPOS32                    0x0472&lt;br /&gt;UDM_SETUNICODEFORMAT            0x2005&lt;br /&gt;UDM_GETUNICODEFORMAT            0x2006&lt;br /&gt;&lt;/pre&gt;This is almost all of the complicated stuff to do. Oh, yes, not to forget - the only thing we have to evaluate after initialising the dialog is the &lt;b&gt;WM_NOTIFY&lt;/b&gt; message. Whenever RDX holds &lt;b&gt;0x4E&lt;/b&gt;, we should check if the low word of R08 (WPARAM) is the ID of one of our UpDown-Controls. If so, R09 (LPARAM) holds the address of a 32 byte wide stack location where the following parameters are parked:&lt;br /&gt;&lt;pre&gt;00   DQ   hwndFrom  control HWND&lt;br /&gt;08   DQ   idFrom            ID&lt;br /&gt;10   DQ   code      0xFFFFFD2E (UDN_DELTAPOS)&lt;br /&gt;18   SD   iPos      position current&lt;br /&gt;1C   SD   iDelta             new&lt;br /&gt;&lt;/pre&gt;You either can use &lt;i&gt;iPos&lt;/i&gt; directly or add &lt;i&gt;iDelta&lt;/i&gt; to the current value of your own parameter. Converting the new value to a hexadecimal or decimal string, selecting an entry from a string table or whatever else you want to do with the returned result is trivial and might be discussed in another post. There are a few messages UpDown-Controls may send in the &lt;i&gt;code&lt;/i&gt; quadword:&lt;br /&gt;&lt;pre&gt;UDN_FIRST                   0xFFFFFD2F&lt;br /&gt;UDN_LAST                    0xFFFFFD27&lt;br /&gt;UDN_DELTAPOS                0xFFFFFD2E&lt;br /&gt;&lt;/pre&gt;Now that you learned &lt;i&gt;all&lt;/i&gt; important facts about UpDowns, It is time for some remarks.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;Tips And Tricks&lt;/h2&gt;You might associate your UpDowns with the previous window (&lt;i&gt;buddy window&lt;/i&gt;) in the dialog's window hierarchy, which probably is an edit control. If you want to implement some kind of true control over your controls, I strongly recommend &lt;b&gt;not&lt;/b&gt; to associate your UpDowns with a &lt;i&gt;buddy window&lt;/i&gt;. Let them work as simple controls with the ability to send &lt;b&gt;WM_NOTIFY&lt;/b&gt; messages repeatedly as long as one of the arrow keys is pressed.&lt;br /&gt;&lt;br /&gt;The most interesting detail is the &lt;i&gt;iDelta&lt;/i&gt; parameter. As a matter of fact, this is the only useful thing an UpDown-Control emits. It keeps us informed about which of the buttons is held down currently. It is &lt;b&gt;FFFFFFFF&lt;/b&gt; for DOWN and &lt;b&gt;00000001&lt;/b&gt; for UP, when the UpDown starts spinning, and might be increased if one of the arrow keys is held down for a while. You can control this behaviour by setting the &lt;b&gt;ACCEL&lt;/b&gt; structures associated with the UpDown to values suiting your needs. There are at least three &lt;b&gt;ACCEL&lt;/b&gt; structure for each UpDown-Control. With these few parameters, it is easy to create very complex structures controlling multiple 'slave' windows with a single UpDown-Control.&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;The Future Is Less Than A Picosecond Away&lt;/h2&gt;It will take more than a few picoseconds to port my advanced spinbuttons to Windoze, but, nevertheless - let me introduce ST-Open's Spinbutton Library. Like all ST-Open Libraries, spinbuttons are controlled via datafields and provide an easy to use interface for assembler programmers as well as for C-style coders. All libraries follow standard C calling conventions. The entire programming interface consists of a single function, one common structure and a few line-filling definitions to keep C(+-*#?) programmers happy. What did we do if we were not forced to scroll the text on our 250 * 15,000 pixel screen sidewards? Quite unbelievable that any line of simple code was not sufficient to fill even 30,000 pixel wide screens, isn't it? But: Such code definitely exists, see above!&lt;br /&gt;&lt;br /&gt;Back to business. The old function awaited four parameters&lt;br /&gt;&lt;pre&gt;spin number&lt;br /&gt;SPN_* command&lt;br /&gt;numeric input    (optional)&lt;br /&gt;address in/out   (optional)&lt;br /&gt;&lt;/pre&gt;where &lt;i&gt;command&lt;/i&gt; was defined as one of these:&lt;br /&gt;&lt;pre&gt;SPN_SET             0x08&lt;br /&gt;SPN_GETCUR          0x07&lt;br /&gt;SPN_GETID           0x06&lt;br /&gt;SPN_GETSTRUC        0x05&lt;br /&gt;SPN_QUERY           0x04&lt;br /&gt;SPN_END             0x03&lt;br /&gt;SPN_DN              0x02&lt;br /&gt;SPN_UP              0x01&lt;br /&gt;SPN_INIT            0x00&lt;br /&gt;&lt;/pre&gt;As mentioned above, the UpDown delivers all necessary parameters without lengthy requests to API functions. You can pass R08 and R09 through as you got them for SPN_NOTIFY and SPN_EDITED, while you have to provide one parameter for the other commands which are reduced to&lt;br /&gt;&lt;pre&gt;SPN_SETSTRUC        0x06&lt;br /&gt;SPN_GETSTRUC        0x05&lt;br /&gt;SPN_QUERY           0x04&lt;br /&gt;SPN_SET             0x03&lt;br /&gt;SPN_EDITED          0x02&lt;br /&gt;SPN_NOTIFY          0x01&lt;br /&gt;SPN_INIT            0x00&lt;br /&gt;&lt;/pre&gt;where SPN_GETSTRUC and SPN_SETSTRUC only are required for HLL freaks (real programmers know how to read memory blocks without contorted manoeuvres). The positive side effects of datafield driven spinbuttons:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Due to the advanced conceptual design, code is greatly reduced and most tasks are perfomed by the highly optimised library function.&lt;/li&gt;&lt;li&gt;The spinbutton datafield is loaded automatically with the first SPN_INIT command if it is not present, yet.&lt;/li&gt;&lt;li&gt;All spinbuttons automatically start with the values of the last session without any line of extra code.&lt;/li&gt;&lt;li&gt;You have to set minimum and maximum values for each spinbutton only once.&lt;/li&gt;&lt;li&gt;Changing parameters, including the spinbutton type, is comfortably done with DatTools' spinbutton editor.&lt;/li&gt;&lt;/ul&gt;That much about the bread and butter side of ST-Open's Spinbuttons. Now the honey pot - the currently available spinbutton types:&lt;br /&gt;&lt;pre&gt;SPN_STR             0x08&lt;br /&gt;SPN_DATE            0x07&lt;br /&gt;SPN_TIME            0x06&lt;br /&gt;SPN_HEX64           0x05&lt;br /&gt;SPN_HEX32           0x04&lt;br /&gt;SPN_HEX16           0x03&lt;br /&gt;SPN_HEX08           0x02&lt;br /&gt;SPN_DEC64           0x01&lt;br /&gt;SPN_DEC32           0x00&lt;br /&gt;&lt;/pre&gt;Are there any wishes left? Oh, well, there are no floating point spinbuttons, yet. As long as there are no FP conversions in my libraries, there are no FP spinbuttons. I never will code one more wrapper to call functions of a C library. Keep dirty programming where it belongs to (e.g. &lt;small&gt;microsoft&lt;/small&gt;...). Fullstop.&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;Final Words&lt;/h2&gt;ST-Open's Libraries Version 8.0.0. (64 bit Win) will be available in a few months. If all libraries are tested, they will be uploaded to &lt;a href="http://code.google.com/p/st-open/downloads/list"&gt;Google Code&lt;/a&gt; and I finally can start the development of IDEOS, the 21st century's operating system.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Addendum&lt;/h3&gt;Corrected some errors caused by mixed use of decimals and hexadacimals in some header files...&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4412535206273251260-1988357940673642572?l=st-intelligentdesign.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://st-intelligentdesign.blogspot.com/feeds/1988357940673642572/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/11/de-mystifying-windoze-updown-controls.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/1988357940673642572'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/1988357940673642572'/><link rel='alternate' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/11/de-mystifying-windoze-updown-controls.html' title='De-mystifying Windoze UpDown-Controls'/><author><name>Bernhard Schornak</name><uri>http://www.blogger.com/profile/07864510983569379361</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_Z2WbH3F-E_Q/S4KZO1ynfyI/AAAAAAAAABQ/jgXB2qAV2Xc/S220/BS.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4412535206273251260.post-3458883301139700788</id><published>2010-11-01T16:45:00.000+01:00</published><updated>2010-11-01T22:07:23.050+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='assembler'/><category scheme='http://www.blogger.com/atom/ns#' term='optimisation'/><category scheme='http://www.blogger.com/atom/ns#' term='improvement'/><category scheme='http://www.blogger.com/atom/ns#' term='programming'/><category scheme='http://www.blogger.com/atom/ns#' term='intelligent design'/><title type='text'>Intelligent Design in one piece</title><content type='html'>&lt;h3&gt;Copyright Note&lt;/h3&gt;&lt;div align="justify"&gt;The programming techniques introduced in this paper are mental property of &lt;b&gt;Bernhard Schornak&lt;/b&gt;. They are protected by international copyrights, published under the terms of the &lt;b&gt;&lt;a href="http://ft4fp.blogspot.com/p/ft4fp-license.html"&gt;FT4FP&lt;/a&gt;-License&lt;/b&gt;. Any commercial use, trade or other forms of exploitation to gain profit are strictly prohibited. Knowledge is a common property and should be freely available for every human. It must not be abused as a proprietary ware, only available for those who can afford to feed a few greedy individuals with money.&lt;br /&gt;&lt;br /&gt;This document was written for the ST-Open homepage in 2006. It was slightly modified for this blog.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h1&gt;Introduction&lt;/h1&gt;Most people probably associate the term &lt;i&gt;Intelligent Design&lt;/i&gt; with the movement of &lt;i&gt;Creationism&lt;/i&gt; rather than a new, revolutionary programming technique. The usurpation of this term is an intended sidesweep. Whatever invented gods and godesses were able to do - a smart programmer can do it much better. This paper is an introduction to the next generation of programming, superior to old fashioned conventions and programming techniques.&lt;br /&gt;&lt;br /&gt;Compared against conventional programming techniques, &lt;i&gt;Intelligent Design&lt;/i&gt; resembles a quality leap. However, some knowledge about the creation and management of a conventional stack is required to understand the important difference between old fashioned programming techniques and&lt;i&gt; Intelligent Design&lt;/i&gt;. To impart the knowledge about conventional methods to the reader, the next pages offer a detailed introduction to stacks, stack frames and how they are managed. Without this knowledge, it probably is impossible to understand the alternative methods and techniques introduced with&lt;i&gt; Intelligent Design&lt;/i&gt;. Old fashioned programming techniques never kept pace with recent processors - the standards of so called &lt;i&gt;high level languages&lt;/i&gt; are designed to work with the first generation of microprocessors as well as most recent quad-core machines. Unfortunatelly, computational power of processors grew by several powers of ten, while software standards never followed any technical evolution. Today, we have mature high speed processors driven by never grown software toddlers. It is quite counter-productive to slow down high speed devices because their 'drivers' cannot handle most of the controls.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h1&gt;Basics&lt;/h1&gt;The assembly language dialect used in this paper is not the one commonly used in the world of Windows, also known as &lt;i&gt;LETNi&lt;/i&gt; syntax. It is &lt;a href="http://st-intelligentdesign.blogspot.com/2010/04/14-at-syntax.html"&gt;&lt;i&gt;AT&amp;amp;T&lt;/i&gt;&lt;/a&gt; syntax, known in the world of Linux and Unix. When I began to write code for the x86 platform in 1993, &lt;i&gt;GCC&lt;/i&gt; was the only free development tool one could get, so I had to use &lt;a href="http://st-intelligentdesign.blogspot.com/2010/04/14-at-syntax.html"&gt;&lt;i&gt;AT&amp;amp;T&lt;/i&gt;&lt;/a&gt; syntax. If you only know &lt;i&gt;LETNi&lt;/i&gt; syntax, there is a short introduction to &lt;a href="http://st-intelligentdesign.blogspot.com/2010/04/14-at-syntax.html"&gt;&lt;i&gt;AT&amp;amp;T&lt;/i&gt;&lt;/a&gt; syntax to learn the difference between the both. If you worked with &lt;i&gt;AS&lt;/i&gt; for a short time, you don't want to return to the complicated and perversed (or was that reversed?) &lt;i&gt;LETNi &lt;/i&gt;syntax anymore. The programming techniques introduced in this paper do not rely on a specific syntax. However, knowing &lt;a href="http://st-intelligentdesign.blogspot.com/2010/04/14-at-syntax.html"&gt;&lt;i&gt;AT&amp;amp;T&lt;/i&gt;&lt;/a&gt; syntax might help you to understand the sample code. Reaching the goal is what really counts. How we get there is another, slightly different problem.&lt;br /&gt;&lt;h2&gt;The Stack&lt;/h2&gt;Every &lt;i&gt;x86&lt;/i&gt; processor works as a stack machine. Parameters we have to pass to called functions, return addresses to calling functions, local variables and structures are put onto or read from the stack. Unfortunately, most of today's programmers just are used to click together some of those prefabricated code fragments coming along with the daily 20 GB version upgrade for their favourite VisualXYZ(plus-minus-dotnet) development suite. If you ask any of them 'What is a stack?', they probably tell you something about hay or - of course! - money. Sad, but true. Exceptionally sad, because the mechanisms of a software stack are that simple, one were attempted to call the big idea behind nothing else than 'brilliant'.&lt;br /&gt;&lt;br /&gt;Whenever you compile a program, you pass a definition file with the extension &lt;i&gt;.def&lt;/i&gt; to the linker (LINK.EXE or similar). The definition file holds some important information about the program for the session manager of your operating system. While the compiled program is started, the operating system reserves three independent memory blocks (code, data, stack) for the new process and the segment registers CS, DS+ES and SS are set to the address of one of these blocks. Whatever you defined as &lt;b&gt;STACKSIZE&lt;/b&gt; in the definition file, exactly that size is allocated for your stack segment. After allocating the required memory blocks for those three segments, the program code is copied to the code segment, all defined global variables are written to the data segment - they are 'initialised' - and rSP is set to the top of the stack segment. Finally, the address of the array with the command line parameters and the argument count are pushed onto the virgin stack before the session manager calls the &lt;i&gt;main()&lt;/i&gt; function of our program. Entering &lt;i&gt;main()&lt;/i&gt;, the processor starts to execute the code found there, until it stumbles upon the final &lt;i&gt;ret&lt;/i&gt; instruction and passes control back to the session manager.&lt;br /&gt;&lt;br /&gt;Entering our program's main() function, the stack looks like this:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://lh6.ggpht.com/_Z2WbH3F-E_Q/S8Ovd6SnBxI/AAAAAAAAAFw/BKBVYRLUuFk/stack4.png" style="max-width: 800px;" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The top stack element holds the &lt;i&gt;argument vector&lt;/i&gt; argv[] (parameter 2), the next lower stack element holds the &lt;i&gt;argument count&lt;/i&gt; argc (parameter 1). The current stack element, the one ESP points to, holds the return address to the session manager. Whenever the program is terminated, the instruction pointer is loaded with this address and the terminating sequence of the session manager is executed. It first frees those resources our program eventually reserved  for itself, e.g. allocated memory blocks, open files, open devices, and so on. Finally, it releases the allocated segments and cleans up all structures holding control data of our program.&lt;br /&gt;&lt;br /&gt;In the running program, the content of the stack pointer is decreased with every &lt;i&gt;push&lt;/i&gt; instruction, the creation of a stack frame or the call to another function. It is increased whenever we &lt;i&gt;pop&lt;/i&gt; data from the stack, release a stack frame or &lt;i&gt;ret&lt;/i&gt;urn to a calling function. In 32 bit functions, four byte are subtracted from ESP with every &lt;i&gt;push&lt;/i&gt; or &lt;i&gt;call&lt;/i&gt;, while four byte are added to ESP with every &lt;i&gt;pop&lt;/i&gt; or &lt;i&gt;ret&lt;/i&gt;. All subtractions or additions are done by the processor automatically, because they are an integral part of the mentioned instructions. Using a picturesque language, we might state: The stack grows with every &lt;i&gt;push&lt;/i&gt; or &lt;i&gt;call&lt;/i&gt; and shrinks with every &lt;i&gt;pop&lt;/i&gt; or &lt;i&gt;ret&lt;/i&gt;.&lt;br /&gt;&lt;h3&gt;Using The Stack&lt;/h3&gt;To make sensible use of the stack, any &lt;i&gt;x86&lt;/i&gt; processor provides the instructions &lt;i&gt;call&lt;/i&gt;, &lt;i&gt;enter&lt;/i&gt;, &lt;i&gt;leave&lt;/i&gt;, &lt;i&gt;pop&lt;/i&gt;, &lt;i&gt;push&lt;/i&gt; and &lt;i&gt;ret&lt;/i&gt;. All these instructions update the content of ESP automatically. Besides these special instructions, ESP can be used like any other register, as well. We are free to add or subtract immediate values or the content of another register to/from the stack pointer and do other funny things with ESP. However, the most utilised action probably is the subtraction of an immediate value from ESP to create a stack frame and the addition of exactly the same value to release that stack frame if we do not need it any longer. A detailed description follows later  on, see &lt;i&gt;About Stackframes&lt;/i&gt;. For now, we focus our attention on some more basic things like the instructions mentioned above. To understand the concept of conventional programming methods, it is very important to know how these instructions work and how they manipulate the stack and ESP.&lt;br /&gt;&lt;h4&gt;Call&lt;/h4&gt;If the processor encounters a &lt;i&gt;call&lt;/i&gt; instruction, it first subtracts two, four or eight (depending on the standard datasize) from ESP, then stores the address of the instruction following the &lt;i&gt;call&lt;/i&gt; on the stack. Next, the address passed as a part of the &lt;i&gt;call&lt;/i&gt; instruction is loaded into the instruction pointer. Execution now is continued with the instructions found at the new location, until the processor stumbles upon a &lt;i&gt;ret&lt;/i&gt;.&lt;br /&gt;&lt;pre&gt;...&lt;br /&gt;          call DoNothing  # call function DoNothing&lt;br /&gt;          xorl %eax,%eax  # &amp;lt;- the address of this instruction&lt;br /&gt;          ...             # is stored on the stack and loaded&lt;br /&gt;          ...             # into EIP with the RET instruction&lt;br /&gt;DoNothing:                # local function DoNothing&lt;br /&gt;          ret             # return to caller...&lt;br /&gt;&lt;/pre&gt;&lt;h4&gt;Enter&lt;/h4&gt;Please do not use this vector path instruction - it blocks the processor for entire 13 clock cycles. The usual replacement&lt;br /&gt;&lt;pre&gt;...&lt;br /&gt;pushl %ebp       # save EBP&lt;br /&gt;movl %esp,%ebp   # save ESP&lt;br /&gt;subl $0x10,%esp  # create stack frame&lt;br /&gt;...&lt;br /&gt;&lt;/pre&gt;is executed in four clock cycles. Saving nine clock cycles is a speed improvement of 325 percent. You should prefer the replacement code over &lt;i&gt;enter&lt;/i&gt; under any cicumstances! The 16 byte are just a randomly chosen example to keep the sample code valid. The real size you have to subtract from ESP depends on the amount of local variables and other temporary data your function has to store on the stack.  &lt;br /&gt;&lt;h4&gt;Leave&lt;/h4&gt;The &lt;i&gt;leave&lt;/i&gt; instruction is equivalent with the following two instructions:&lt;br /&gt;&lt;pre&gt;...&lt;br /&gt;movl %ebp,%esp   # restore ESP (DP 1, 2 byte)&lt;br /&gt;popl %ebp        # restore EBP (VP 4, 1 byte)&lt;br /&gt;...&lt;br /&gt;&lt;/pre&gt;Like &lt;i&gt;pop&lt;/i&gt;, &lt;i&gt;leave&lt;/i&gt; is a vector path instruction. Because &lt;i&gt;pop&lt;/i&gt; blocks the processor for four clock cycles and also has to wait for the valid result of the preceeding &lt;i&gt;mov&lt;/i&gt; instruction, &lt;i&gt;leave&lt;/i&gt; actually is two clock cycles faster. Moreover, the one byte &lt;i&gt;leave&lt;/i&gt; is shorter than its three byte replacement. Hence, you should prefer &lt;i&gt;leave&lt;/i&gt; over the alternative method.  &lt;br /&gt;&lt;h4&gt;Pop&lt;/h4&gt;&lt;img src="http://lh4.ggpht.com/_Z2WbH3F-E_Q/S8OvdQoL8LI/AAAAAAAAAFw/fi9qhf5hGrg/stack1.png" style="max-width: 800px;" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Pop&lt;/i&gt; copies the content of the current stack element to a register or memory location, then adds, depending on the processor mode and an optional prefix, two, four or eight to the content of the stack pointer. Unlike the direct path &lt;i&gt;push&lt;/i&gt;, &lt;i&gt;pop&lt;/i&gt; is a vector path instruction. It is executed in pipes 0 and 1, while pipe 2 is blocked for the time the instruction is processed. Every &lt;i&gt;pop&lt;/i&gt; instruction lasts four clock cycles. &lt;i&gt;Pop&lt;/i&gt; is a special vector path &lt;i&gt;mov&lt;/i&gt; instruction, where ESP automatically is updated &lt;i&gt;after&lt;/i&gt; it was used to address the source of a copy operation.  In general, &lt;i&gt;pop&lt;/i&gt; is used to restore the content of a register. You should keep track of the stack pointer, because it is quite difficult to find errors caused by asymmetrically executed &lt;i&gt;push&lt;/i&gt; and &lt;i&gt;pop&lt;/i&gt; instructions. Especially, if some of them are inside a loop while others are not.&lt;br /&gt;&lt;pre&gt;...&lt;br /&gt;popl %ebx        # restore EBX [=&amp;gt; ESP + 4(!)]&lt;br /&gt;...&lt;br /&gt;&lt;/pre&gt;&lt;h4&gt;Push&lt;/h4&gt;&lt;img src="http://lh5.ggpht.com/_Z2WbH3F-E_Q/S8OvdGDhGyI/AAAAAAAAAFw/vjfwM4Eb5lo/stack0.png" style="max-width: 800px;" /&gt;  &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Push&lt;/i&gt; first adds, depending on the processor mode and an optional prefix, two, four or eight to the stack pointer, then copies an imediate value, the content of a register or the content of a memory location into the stack element the updated ESP points to. &lt;i&gt;Push&lt;/i&gt; is a special direct path &lt;i&gt;mov&lt;/i&gt; instruction, where ESP automatically is updated &lt;i&gt;before&lt;/i&gt; it is used to address the target of a copy operation. While the entire execution time of each &lt;i&gt;push&lt;/i&gt; instruction is 3 clock cycles, ESP is available one clock earlier (after 2 cycles) for the following instructions.&lt;br /&gt;&lt;br /&gt;In general, &lt;i&gt;push&lt;/i&gt; is used to put parameters or register contents onto the stack. It is a good idea to remove 'used' parameters from the stack as soon as possible, because they decrease the available stack size, cause avoidable stack pointer arithmetics, and so on. To remove them, we do not have to &lt;i&gt;pop&lt;/i&gt; them - beware! - from the stack. We just have to add the appropriate amount of byte to the stack pointer. If we &lt;i&gt;push&lt;/i&gt;ed two parameters in a 32 bit function as shown in the below example, we add 8 byte to ESP. If it were eight parameters, we had to add 8 * 4 = 32 byte to ESP, and so on.&lt;br /&gt;&lt;br /&gt;Regardless of the lazy behavior of HLL (high level language) compilers, it is a good idea to correct the content of ESP after each &lt;i&gt;call&lt;/i&gt; instruction if you passed parameters to the called function. Humans have minor problems to keep track of the current content of the stack pointer, especially, if the function body exceeds the size of their display. Compilers can do that much better. However, their output is less optimised than the code written by a human.&lt;br /&gt;&lt;pre&gt;...&lt;br /&gt;pushl %eax       # put parameter 2 onto the stack&lt;br /&gt;pushl %ebx       # put parameter 1 onto the stack&lt;br /&gt;call _helpling   # call another function&lt;br /&gt;addl $0x08,%esp  # correct ESP directly after CALL&lt;br /&gt;...&lt;br /&gt;&lt;/pre&gt;&lt;h4&gt;Ret(urn)&lt;/h4&gt;Whenever the processor stumbles upon a &lt;i&gt;ret&lt;/i&gt; instruction, it copies the address stored in the current stack element into the instruction pointer EIP, then adds the standard datasize (2, 4 or 8 byte) to the stack pointer. After ESP was updated, execution continues with the instruction found at the address EIP now points to.&lt;br /&gt;&lt;pre&gt;...&lt;br /&gt;ret              # return to caller&lt;br /&gt;...&lt;br /&gt;&lt;/pre&gt;&lt;h3&gt;Stack Management&lt;/h3&gt;As mentioned in the descriptions of the single instructions, keeping track of the current content of the stack pointer is a &lt;i&gt;must&lt;/i&gt; with highest priority. Because passing of parameters generally is done with &lt;i&gt;push&lt;/i&gt; instructions, while parameters are taken from the stack by adding the appropriate amount of byte to the stack pointer ESP, it is very important to handle these operations with exceptional care. Especially the necessary corrections of the stack pointer bear some potential to be erroneous. Unfortunately, such errors cause unexpected crashes and malfunctions, and it is quite hard to track them down until the culprit causing the mess is found. This reaches a new dimension with the creation of a stack frame. What is this already mentioned sinister stack frame?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h1&gt;Stack Frames&lt;/h1&gt;When the session manager passes control to our main() function, we surely create a message queue and a frame window, but we also might play some music or those mega-cool animations which are a must for all recent applications swimming along the main stream. Whatever programs might do when they get control over the computer, the first lines of any function, including &lt;i&gt;main()&lt;/i&gt;, always follow the set routine. First: A stack frame is built. Second: All used registers are saved on the stack. Actually, ECX and EDX never are saved because the C-conventions say so (see &lt;i&gt;Conventions&lt;/i&gt;). Third: The code found in the function body performs all tasks the function is coded for. Fourth: All saved registers are restored. Fifth: The stack frame is destroyed (released). Sixth: The final &lt;i&gt;ret&lt;/i&gt; is executed and the function returns to its caller.  One thing should be mentioned explicitely: We only need a stack frame if we have to store local variables or other temporary data structures on the stack. Functions without local variables do not need a stack frame at all. Building a stack frame lasts at least 6 clock cycles and occupies 10 byte of code. To save superfluous activities, you should switch on the &lt;b&gt;fomit-frame-pointer&lt;/b&gt; option (&lt;i&gt;GCC&lt;/i&gt;) by default. It skips building stack frames where they are not required.  Grasping how a base pointer is used and how it works is the key to understand conventional programming techniques. Because this is a very important issue, all explanations are very detailed. Some may find them too lengthy, but we should give others the chance to get in touch with all aspects, so (hopefully) anyone is able to gather the knowledge required to apply the learned stuff in real life.  &lt;br /&gt;&lt;h2&gt;Example&lt;/h2&gt;The following example shows how conventional stack frames are created. It is trivial code used in every program. Even huge monster applications like operating systems, browsers, audio studios, et cetera use the same pattern. They only differ in the size of their stack frames.&lt;br /&gt;&lt;pre&gt;pushl %ebp               # save old base pointer&lt;br /&gt;movl %esp,%ebp           # load new base pointer&lt;br /&gt;subl $0x08,%esp          # reserve 2 local variables&lt;br /&gt;pushl %ebx               # save EBX&lt;br /&gt;movl 0x08(%ebp),%eax     # EAX = argument count&lt;br /&gt;movl 0x0C(%ebp),%ebx     # EBX = argument vector&lt;br /&gt;movl $0x00,-0x04(%ebp)   # local variable 1 =  0&lt;br /&gt;movl $0x20,-0x08(%ebp)   # local variable 2 = 32&lt;br /&gt;pushl %eax               # copy EAX to -0x10(%ebp)&lt;br /&gt;pushl %ebx               # copy EBX to -0x14(%ebp)&lt;br /&gt;...&lt;br /&gt;&lt;/pre&gt;&lt;h3&gt;The Stack&lt;/h3&gt;After execution of the above instructions, the stack looks like this:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://lh4.ggpht.com/_Z2WbH3F-E_Q/S8Ovd2D9lgI/AAAAAAAAAFw/l56NA8zCofU/stack2.png" style="max-width: 800px;" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;ESP always points to the current stack element. In our case, this is the address where the content of EBX was copied to. Depending on the instructions in the function body, ESP moves down towards stack bottom or back towards stack top. In properly designed functions, the content of ESP never can be greater than or equal to the content of EBP. This tells us that the base pointer not only is used to address stack elements. It also marks the border between the current stack frame and the stack frame of the calling function.  &lt;br /&gt;&lt;h4&gt;pushl %ebp&lt;/h4&gt;The first step to create a new stack frame is saving the content of the old base pointer of the calling function on the stack. The content of the old base pointer &lt;i&gt;must not&lt;/i&gt; be changed under any circumstances! Otherwise, the calling function uses an invalid basis to address its local variables. It will crash if it tries to read or write data to an address stored in a local variable, because that address, formerly stored at -0x0C[EBP], now might be found at 0x3C[EBP]. At the latest, the calling function commits suicide while trying to return to its caller. It fills the base pointer with random data, then loads other random data into the instruction pointer. The attempt to execute that 'code' raises one of those exceptions and an 'access violation' is reported on the user's screen.  &lt;br /&gt;&lt;h4&gt;movl %esp,%ebp&lt;/h4&gt;In the second step, we copy the address of the current stack element to EBP. The new base pointer now contains the address, where the content of the old base pointer is stored. The next stack element below the new base pointer is the topmost stack element we are allowed to write to. All stack elements above EBP - including 0x00[EBP]! - only should be used to read from, but never to write to! From here on, we use the base pointer to address local variables and parameters passed to our function. In other words, these data are addressed with offsets to the 'frozen' base pointer.  As shown in &lt;i&gt;figure 03&lt;/i&gt;, the old base pointer is stored at offset zero to the current base pointer. We write this as 0x00[EBP], where the &lt;b&gt;0x&lt;/b&gt; tells us this is a hexadecimal number. Everything related to programming should be written in hexadecimal notation. It is much easier to read, numbers are shorter and always formatted correctly. For example, 0xF100 tells us at first sight: 'It's the second page in memory block 0xF000.' Its decimal equivalent 61696 does not tell us anything. We have to start a calculator to translate it into 0xF100, wasting precious time.  Back to the base pointer. Passed parameters and the return address to the calling function are stored in stack elements above the one the base pointer points to. Therefore, they are addressed with positive offsets. Our return address is stored at 0x04[EBP], followed by one or more optional parameters, where parameter 1 always is stored at 0x08[EBP], parameter 2 at 0x0C[EBP], and so on. The 'top down' order is caused by a C-convention, saying that all parameters are &lt;i&gt;push&lt;/i&gt;ed onto the stack back to forth, starting with the last parameter the compiler finds inside those round brackets following the name of the called function.  &lt;br /&gt;&lt;h4&gt;subl $0x08,%esp&lt;/h4&gt;The real creation of a stack frame is done with the third and last step. First, we calculate the required size for all variables and structures, then we subtract the result from ESP. Of course, the result must be rounded up to the next multiple of the standard datasize, before we subtract it from the stack pointer. If we subtracted an 'odd' number, the stack became corrupted and was not valid any longer. Everything addressed via ESP, e.g. &lt;i&gt;call&lt;/i&gt; or &lt;i&gt;push&lt;/i&gt;, was misaligned, causing a lot of penalty cycles. If ESP accidentally was set to 0xFFE2 instead of 0xFFE4 (e.g. by subtracting 0x0A rather than the proper value 0x08) in our sample code, then we &lt;i&gt;push&lt;/i&gt; two local variables 0x89ABCDEF and 0x01234567 onto the stack and copy variable 2 from -0x08[EBP] to EBX, EBX finally contained the invalid number 0xCDEF0123 - the doubleword currently stored at address 0xFFE4.  With this subtraction, we reserve the subtracted amount of byte on the stack. This reserved area is safe from being overwritten by any following &lt;i&gt;call&lt;/i&gt; or &lt;i&gt;push&lt;/i&gt; instruction, because ESP always is set to the next lower stack element before a write operation. Because all local variables are lying below the stack element where the old base pointer is stored, they are addressed with negative offsets. The first is stored at address -0x04[EBP], the 2nd at -0x08[EBP], and so on. It neither matters how you name your local variables nor if you count them from top to bottom or vice versa. The only important thing is to remember where which of them is stored.  &lt;br /&gt;&lt;h4&gt;pushl %ebx&lt;/h4&gt;Saves the content of EBX on the stack. Following the C-conventions, the content of EBX, EDI and ESI must be saved before overwriting them and must be restored before returning to the calling function. In other words: The content of these registers must be preserved by all functions (see &lt;i&gt;Conventions&lt;/i&gt;).  &lt;br /&gt;&lt;h5&gt;Urgent Needs&lt;/h5&gt;In conventional code, the content of EBP never must be changed. If you cannot avoid to use EBP for general purposes, because you urgently need an extra register, you have to save it before using it. You should restore EBP immediately after passing the bottleneck. Not worth mentioning, but, nevertheless: EBP cannot be used to adress local variables while it holds your private data...  &lt;br /&gt;&lt;h5&gt;The Function Body&lt;/h5&gt;ESP always points to the address of the current stack element. Looking at our example, it is quite obvious that we are going to call another function. Our local variable 1 might be a counter which is incremented each time the called function returns TRUE. Local variable 2 might be a loop counter, so the function might be called 32 times, counting how many times a specific condition was met.  &lt;br /&gt;&lt;h4&gt;Stack Frame Destruction&lt;/h4&gt;When all instructions in the function body are executed, we first have to restore the saved registers. Next, we have to destroy the stack frame to release the area we reserved for our private use. Finally, we return to the calling function. There are two possible ways to perform these tasks:&lt;br /&gt;&lt;pre&gt;...&lt;br /&gt;addl $0x08,%esp   # correction after 2 PUSH instructions&lt;br /&gt;popl %ebx         # restore EBX&lt;br /&gt;leave             # destroy stack frame (VP 3, 1 byte)&lt;br /&gt;ret               # return to caller&lt;br /&gt;&lt;/pre&gt;or &lt;br /&gt;&lt;pre&gt;...&lt;br /&gt;addl $0x08,%esp   # correction after 2 PUSH instructions&lt;br /&gt;popl %ebx         # restore EBX&lt;br /&gt;movl %ebp,%esp    # restore ESP (DP 1, 2 byte)&lt;br /&gt;popl %ebp         # restore EBP (VP 4, 1 byte)&lt;br /&gt;ret               # return to caller&lt;br /&gt;&lt;/pre&gt;Both variants restore ESP and EBP. However, &lt;i&gt;leave&lt;/i&gt; is two clock cycles faster and two byte smaller. The correction of ESP is required to restore the preserved content of EBX. Because we &lt;i&gt;push&lt;/i&gt;ed two parameters onto the stack after we &lt;i&gt;push&lt;/i&gt;ed EBX, ESP is eight byte below the address where the content of EBX is stored. To POP the proper content into EBX, we have to add these eight byte to ESP.  If no registers were saved on the stack, no correction of ESP is required and &lt;i&gt;leave&lt;/i&gt; and &lt;i&gt;ret&lt;/i&gt; are the only instructions we need to restore ESP and EBP before we return to the caller.  &lt;br /&gt;&lt;h4&gt;Return To Caller&lt;/h4&gt;In our example, &lt;i&gt;ret&lt;/i&gt; copies the return address to the session manager of the operating system to EIP and continues execution of its code. The session manager closes our program, frees still allocated resources and passes the content we stored in EAX to the command interpreter.  &lt;br /&gt;&lt;h3&gt;What Did We Learn?&lt;/h3&gt;All introduced mechanisms are, except the size of the stack frame and the amount of preserved registers, identical for all functions following the C-conventions. As far as I know, all of the known operating systems follow these coventions. The advantages are obvious. Once coded functions are portable from one version to the next or from one operating system to another. In the latter case, minor changes must be applied if functions of the target OS have other names or await parameters in different order. Putting it all together, conventions simplify re-use and porting of existing code.  But - every object in our universe has at least two (or more) sides. The disadvantages of the introduced methods, mechanisms and conventions are not as obvious as the advantages, because they are hidden in the deepest and darkest parts of the machine room. Only experienced, well trained technicians are able to find out why the machine chokes and does not run as fast as expected. Time to analyse what C-conventions do on assembly language level.   &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h1&gt;Caveats&lt;/h1&gt;As mentioned before, conventions, methods and techniques introduced with the programming language C have a lot of advantages over monolithic code we used to write applications for DOS and other ancient operating systems. All functions easily are portable to other operating systems or processor architectures and can be used multiple times, leading to versatile applications for multiple platforms. Unfortunately, after the C conventions were established, they never were updated or revised to keep pace with the development of processor architectures. If we compare an 8086 against a recent quad-core Athlon, a picturesque comparison were a cart dragged by a tired ox versus an Airbus A380. While no person with sane mind ever wasted one thought about equipping an A380 with a harness and motivate it to lift off with reins, whip and loud 'hee' and 'ho' shouts, software designers all over the world practise such weird things every day. They still create 'flying' stack frames, use slow &lt;i&gt;leave&lt;/i&gt;, &lt;i&gt;pop&lt;/i&gt; and &lt;i&gt;push&lt;/i&gt; instructions with the obligatory update of the stack pointer instead of the faster &lt;i&gt;mov&lt;/i&gt; and continue to abuse valuable resources by using EBP as base pointer. This reduces the too small register set of the x86 architecture by 1/7th, forcing the programmer to use slow memory reads and writes instead of much faster register operations.&lt;br /&gt;&lt;br /&gt;To demonstrate the caveats, we analyse a dialog procedure taken from an existing ST-Open program. The dialog consists of four groups of radiobuttons, the three buttons 'Abort', 'Move', 'Help' and some static texts. &lt;i&gt;DLGtxt()&lt;/i&gt; sets all texts in this dialog to strings taken from a subfield with the current language (multi-lingual dialog and menu texts are an integral part of ST-Open's libraries). The dialog is used to move datasets within a datafield, the target is selected with the radiobuttons.&lt;br /&gt;&lt;pre&gt;.text&lt;br /&gt;        .p2align 4,,15&lt;br /&gt;        .globl   MoveDlg&lt;br /&gt;MoveDlg:pushl    %ebp&lt;br /&gt;        movl     %esp, %ebp&lt;br /&gt;        pushl    %esi&lt;br /&gt;        pushl    %ebx&lt;br /&gt;        movl     12(%ebp), %edx&lt;br /&gt;        movl     8(%ebp), %esi&lt;br /&gt;        movl     16(%ebp), %ecx&lt;br /&gt;        movl     20(%ebp), %ebx&lt;br /&gt;        cmpl     $48, %edx&lt;br /&gt;        je       L12&lt;br /&gt;        ja       L39&lt;br /&gt;        cmpl     $32, %edx&lt;br /&gt;        je       L4&lt;br /&gt;&lt;br /&gt;    L37:movl     %ebx, 20(%ebp)&lt;br /&gt;        movl     %ecx, 16(%ebp)&lt;br /&gt;        movl     %edx, 12(%ebp)&lt;br /&gt;        movl     %esi, 8(%ebp)&lt;br /&gt;        leal     -8(%ebp), %esp&lt;br /&gt;        popl     %ebx&lt;br /&gt;        popl     %esi&lt;br /&gt;        popl     %ebp&lt;br /&gt;        jmp      _DefDP&lt;br /&gt;&lt;/pre&gt;Compared against GCC 2.6.1. (1993), GCC 3.3.5. (2006) generates much worse code, spiced with a lot of counterproductive, superfluous instructions. Writing back parameters to the stack violates some basic rules. First, there's absolutely no need to write data back to the locations we took them from some instructions before. Secondly, writes to stack locations above ESP violate all rules of proper programming. If any function starts to write data to the stack frames of other functions, it is just a question of time until we encounter desastrous malfunctions.&lt;br /&gt;&lt;pre&gt;.p2align 4,,7&lt;br /&gt;     L4:movl     %ecx, %eax&lt;br /&gt;        andl     $65535, %eax&lt;br /&gt;        cmpl     $4658,  %eax&lt;br /&gt;        je       L7&lt;br /&gt;        jg       L11&lt;br /&gt;        cmpl     $4657, %eax&lt;br /&gt;        je       L6&lt;br /&gt;&lt;/pre&gt;Because we only need the low word stored in message parameter 1, it was a good idea to extract this lower word with the instruction &lt;i&gt;movzwl 0x10(%ebp),%ecx&lt;/i&gt; rather than to waste valuable clock cycles with the code sequence shown above. Recent processors have three, not just one execution pipe(s). The choosen way sends two execution pipes to sleep while one pipe is busy to extract data from a register. This is repeated two times. We could switch off two execution pipes while this code is executed, because we created two avoidable dependencies.&lt;br /&gt;&lt;pre&gt;L9:movl     %ebx, 20(%ebp)&lt;br /&gt;        movl     %ecx, 16(%ebp)&lt;br /&gt;        movl     %edx, 12(%ebp)&lt;br /&gt;        movl     %esi, 8(%ebp)&lt;br /&gt;        leal     -8(%ebp), %esp&lt;br /&gt;        popl     %ebx&lt;br /&gt;        popl     %esi&lt;br /&gt;        popl     %ebp&lt;br /&gt;        jmp      _DefDP&lt;br /&gt;&lt;/pre&gt;Obviously, L37 and L9 provide identical code. Using our brain, these eighteen redundant (potentially pointless) lines could be reduced to five really necessary instructions.&lt;br /&gt;&lt;pre&gt;L6:movl     _GVAR, %eax        # [1]&lt;br /&gt;        subl     $12, %esp&lt;br /&gt;        movl     $0, 10464(%eax)&lt;br /&gt;&lt;br /&gt;    L42:pushl    %esi&lt;br /&gt;        call     _WinDD&lt;br /&gt;&lt;br /&gt;    L41:addl     $16, %esp&lt;br /&gt;        .p2align 4,,7&lt;br /&gt;&lt;/pre&gt;Up to seven &lt;i&gt;nop&lt;/i&gt;s are executed every time we branch to this part of code. It is a good way to slow down execution flow as much as possible.&lt;br /&gt;&lt;pre&gt;L2:leal     -8(%ebp), %esp&lt;br /&gt;        xorl     %eax, %eax&lt;br /&gt;        popl     %ebx&lt;br /&gt;        popl     %esi&lt;br /&gt;        popl     %ebp&lt;br /&gt;        ret&lt;br /&gt;&lt;br /&gt;    L11:cmpl     $4659, %eax&lt;br /&gt;        jne      L9&lt;br /&gt;        subl     $12, %esp&lt;br /&gt;        pushl    $17&lt;br /&gt;        call     _Help&lt;br /&gt;        jmp      L41&lt;br /&gt;        &lt;br /&gt;     L7:pushl    %edx&lt;br /&gt;        pushl    %edx&lt;br /&gt;        pushl    $18&lt;br /&gt;        movl     _GVAR, %eax        # [1]&lt;br /&gt;        addl     $10464, %eax&lt;br /&gt;        pushl    %eax&lt;br /&gt;        call     _FlgS&lt;br /&gt;        popl     %eax&lt;br /&gt;        jmp      L42&lt;br /&gt;&lt;/pre&gt;The branch prediction logic assumes every first branch as &lt;i&gt;false&lt;/i&gt; if there is no entry in its internal table. Almost all of the above branches trigger the obligatory ten penalty cycles, because the wrong branch instructions were chosen.&lt;br /&gt;&lt;pre&gt;.p2align 4,,7&lt;br /&gt;    L39:cmpl    $59, %edx&lt;br /&gt;        jne      L37&lt;br /&gt;        pushl    $233&lt;br /&gt;        pushl    $211&lt;br /&gt;        pushl    $210&lt;br /&gt;        pushl    %esi&lt;br /&gt;        call     _DLGtxt&lt;br /&gt;        movl     $0, (%esp)&lt;br /&gt;        pushl    $-1&lt;br /&gt;        pushl    $288&lt;br /&gt;        pushl    $4672&lt;br /&gt;        pushl    %esi&lt;br /&gt;        call     _SnDIM&lt;br /&gt;        addl     $20, %esp&lt;br /&gt;        pushl    $0&lt;br /&gt;        pushl    $-1&lt;br /&gt;        pushl    $288&lt;br /&gt;        pushl    $4680&lt;br /&gt;        pushl    %esi&lt;br /&gt;        call     _SnDIM&lt;br /&gt;        addl     $20, %esp&lt;br /&gt;        pushl    $0&lt;br /&gt;        pushl    $-1&lt;br /&gt;        pushl    $288&lt;br /&gt;        pushl    $4688&lt;br /&gt;        pushl    %esi&lt;br /&gt;        call     _SnDIM&lt;br /&gt;        addl     $20, %esp&lt;br /&gt;        pushl    $0&lt;br /&gt;        pushl    $-1&lt;br /&gt;        pushl    $288&lt;br /&gt;        pushl    $4696&lt;br /&gt;        pushl    %esi&lt;br /&gt;        call     _SnDIM&lt;br /&gt;        addl     $24, %esp&lt;br /&gt;&lt;/pre&gt;Only the second parameter changes for the four consecutive calls of SnDIM(). Twelve of these twenty &lt;i&gt;push&lt;/i&gt; instructions (60 percent) are redundant.&lt;br /&gt;&lt;pre&gt;pushl    %esi&lt;br /&gt;        pushl    $0&lt;br /&gt;        pushl    $2&lt;br /&gt;        pushl    $0&lt;br /&gt;        pushl    $11&lt;br /&gt;        movl     _GVAR, %eax        # [1]&lt;br /&gt;        movl     7252(%eax), %eax&lt;br /&gt;        pushl    %eax&lt;br /&gt;        call     _FDacc&lt;br /&gt;        movl     _GVAR, %eax        # [1]&lt;br /&gt;        addl     $24, %esp&lt;br /&gt;        movl     472(%eax), %ebx&lt;br /&gt;        pushl    %ebx&lt;br /&gt;        pushl    $0&lt;br /&gt;        pushl    $2&lt;br /&gt;        pushl    $4&lt;br /&gt;        pushl    $11&lt;br /&gt;        movl     7252(%eax), %ecx&lt;br /&gt;        pushl    %ecx&lt;br /&gt;        call     _FDacc&lt;br /&gt;&lt;/pre&gt;Only parameters 3 and 6 change for the both calls of FDacc(). While we are bound to&lt;i&gt; push&lt;/i&gt; instructions, there is no way to change just these parameters. We have to push all six parameters, again, because the parameter on top must be changed.&lt;br /&gt;&lt;pre&gt;addl     $20, %esp&lt;br /&gt;        movl     _GVAR, %eax        # [1]&lt;br /&gt;        movl     $0, 10464(%eax)&lt;br /&gt;        pushl    %esi&lt;br /&gt;        call     _CtrWn&lt;br /&gt;        movl     %esi, (%esp)&lt;br /&gt;        call     _DlgShow&lt;br /&gt;        jmp      L41&lt;br /&gt;&lt;br /&gt;        .p2align 4,,7&lt;br /&gt;    L12:movl     %ecx, %eax&lt;br /&gt;        andl     $65535, %eax&lt;br /&gt;        subl     $4672, %eax&lt;br /&gt;        cmpl     $30, %eax&lt;br /&gt;        ja       L37&lt;br /&gt;        jmp      *L36(,%eax,4)&lt;br /&gt;&lt;/pre&gt;Again, it were better to extract the low word via one &lt;i&gt;movzwl 0x10(%ebp),%ecx&lt;/i&gt; rather than to use the chosen way. Probably, parallel execution was considered to be too fast?&lt;br /&gt;&lt;pre&gt;.p2align 2&lt;br /&gt;        .align   2,0xcc&lt;br /&gt;&lt;/pre&gt;I don't know what the second &lt;i&gt;.align&lt;/i&gt; is good for. Any hints? My jump table is too large, because C programmers do not think about side effects like blowing up code while assigning resource IDs to 'straight' numbers. Due to the gaps between those IDs, there are a lot of superfluous entries in this jump table.&lt;br /&gt;&lt;pre&gt;L36:.long    L14&lt;br /&gt;        .long    L15&lt;br /&gt;        .long    L16&lt;br /&gt;        .long    L17&lt;br /&gt;        .long    L2&lt;br /&gt;        .long    L37&lt;br /&gt;        .long    L37&lt;br /&gt;        .long    L37&lt;br /&gt;        .long    L18&lt;br /&gt;        .long    L19&lt;br /&gt;        .long    L37&lt;br /&gt;        .long    L37&lt;br /&gt;        .long    L37&lt;br /&gt;        .long    L37&lt;br /&gt;        .long    L37&lt;br /&gt;        .long    L37&lt;br /&gt;        .long    L21&lt;br /&gt;        .long    L23&lt;br /&gt;        .long    L25&lt;br /&gt;        .long    L27&lt;br /&gt;        .long    L29&lt;br /&gt;        .long    L31&lt;br /&gt;        .long    L33&lt;br /&gt;        .long    L37&lt;br /&gt;        .long    L21&lt;br /&gt;        .long    L23&lt;br /&gt;        .long    L25&lt;br /&gt;        .long    L27&lt;br /&gt;        .long    L29&lt;br /&gt;        .long    L31&lt;br /&gt;        .long    L33&lt;br /&gt;&lt;/pre&gt;By the way: Jump tables belong to the &lt;i&gt;.data&lt;/i&gt;, not to the &lt;i&gt;.code&lt;/i&gt; segment. AS supports jump tables in the &lt;i&gt;.data&lt;/i&gt; segment, so there's no need to violate the recommendations of AMD and LETNi as all versions of GCC do...&lt;br /&gt;&lt;pre&gt;L14:movl     _GVAR, %edx        # [1], [2]&lt;br /&gt;        movl     10464(%edx), %eax&lt;br /&gt;        andb     $225, %ah&lt;br /&gt;        orb      $16, %ah&lt;br /&gt;        &lt;br /&gt;        .p2align 4,,7&lt;br /&gt;    L40:movl     %eax, 10464(%edx)&lt;br /&gt;        jmp      L2&lt;br /&gt;        &lt;br /&gt;    L15:movl     _GVAR, %edx        # [1], [2]&lt;br /&gt;        movl     10464(%edx), %eax&lt;br /&gt;        andb     $225, %ah&lt;br /&gt;        orb      $8, %ah&lt;br /&gt;        jmp      L40&lt;br /&gt;        &lt;br /&gt;    L16:movl     _GVAR, %edx        # [1], [2]&lt;br /&gt;        movl     10464(%edx), %eax&lt;br /&gt;        andb     $225, %ah&lt;br /&gt;        orb      $4, %ah&lt;br /&gt;        jmp      L40&lt;br /&gt;        &lt;br /&gt;    L17:movl     _GVAR, %edx        # [1], [2]&lt;br /&gt;        movl     10464(%edx), %eax&lt;br /&gt;        andb     $225, %ah&lt;br /&gt;        orb      $2, %ah&lt;br /&gt;        jmp      L40&lt;br /&gt;        &lt;br /&gt;    L18:movl     _GVAR, %edx        # [1], [2]&lt;br /&gt;        movl     10464(%edx), %eax&lt;br /&gt;        andl     $-385, %eax&lt;br /&gt;        orb      $1, %ah&lt;br /&gt;        jmp      L40&lt;br /&gt;        &lt;br /&gt;    L19:movl     _GVAR, %edx        # [1], [2]&lt;br /&gt;        movl     10464(%edx), %eax&lt;br /&gt;        andl     $-385, %eax&lt;br /&gt;        orb      $-128, %al&lt;br /&gt;        jmp      L40&lt;br /&gt;        &lt;br /&gt;    L21:movl     _GVAR, %edx        # [1], [2]&lt;br /&gt;        movl     10464(%edx), %eax&lt;br /&gt;        andl     $-128, %eax&lt;br /&gt;        orl      $64, %eax&lt;br /&gt;        jmp      L40&lt;br /&gt;        &lt;br /&gt;    L23:movl     _GVAR, %edx        # [1], [2]&lt;br /&gt;        movl     10464(%edx), %eax&lt;br /&gt;        andl     $-384, %eax&lt;br /&gt;        orl      $32, %eax&lt;br /&gt;        jmp      L40&lt;br /&gt;        &lt;br /&gt;    L25:movl     _GVAR, %edx        # [1], [2]&lt;br /&gt;        movl     10464(%edx), %eax&lt;br /&gt;        andl     $-384, %eax&lt;br /&gt;        orl      $16, %eax&lt;br /&gt;        jmp      L40&lt;br /&gt;        &lt;br /&gt;    L27:movl     _GVAR, %edx        # [1], [2]&lt;br /&gt;        movl     10464(%edx), %eax&lt;br /&gt;        andl     $-384, %eax&lt;br /&gt;        orl      $8, %eax&lt;br /&gt;        jmp      L40&lt;br /&gt;        &lt;br /&gt;    L29:movl     _GVAR, %edx        # [1], [2]&lt;br /&gt;        movl     10464(%edx), %eax&lt;br /&gt;        andl     $-384, %eax&lt;br /&gt;        orl      $4, %eax&lt;br /&gt;        jmp      L40&lt;br /&gt;&lt;br /&gt;    L31:movl     _GVAR, %edx        # [1], [2]&lt;br /&gt;        movl     10464(%edx), %eax&lt;br /&gt;        andl     $-384, %eax&lt;br /&gt;        orl      $2, %eax&lt;br /&gt;        jmp      L40&lt;br /&gt;        &lt;br /&gt;    L33:movl     _GVAR, %edx        # [1], [2]&lt;br /&gt;        movl     10464(%edx), %eax&lt;br /&gt;        andl     $-384, %eax&lt;br /&gt;        orl      $1, %eax&lt;br /&gt;        jmp      L40&lt;br /&gt;&lt;/pre&gt;Many redundant instructions unnecessarily blow up code size. The first two lines of all jump targets could be reduced to a common read before the distributor branches to the selected table entry.&lt;br /&gt;&lt;pre&gt;.comm   _hab,       16&lt;br /&gt;         .comm   _DEBUG,     16&lt;br /&gt;         .comm   _USE_LDF,   16&lt;br /&gt;         .comm   _LDR_AVAIL, 16&lt;br /&gt;         .comm   _MSGLD,     16&lt;br /&gt;         .comm   _BMM,       16&lt;br /&gt;         .comm   _BNR,       16&lt;br /&gt;         .comm   _GVAR,      16&lt;br /&gt;         .comm   _BST,       16&lt;br /&gt;         .comm   _BBF,       16&lt;br /&gt;         .comm   _TST,       16&lt;br /&gt;         .comm   _MHSTR,     16&lt;br /&gt;         .comm   _LDF,       16&lt;br /&gt;         .comm   _DUMPLINE,  16&lt;br /&gt;         .comm   _DUMPCNT,   16&lt;br /&gt;         .comm   _OLH_MODE,  16&lt;br /&gt;         .comm   _SEC,       16&lt;br /&gt;         .comm   _XXX,       16&lt;br /&gt;         .comm   _FLD_XXX,   16&lt;br /&gt;         .comm   _FLD_SEC,   16&lt;br /&gt;&lt;/pre&gt;We only use _GVAR in this file - enumerating all global variables is quite stupid. All globals are defined as doublewords (32 bit), so we might ask why GCC expands all these variables to a size of 128 bit, filling up the &lt;i&gt;.data&lt;/i&gt; segment with garbage.  &lt;br /&gt;&lt;h2&gt;Footnotes&lt;/h2&gt;&lt;b&gt;[1]&lt;/b&gt; This is the typical side effect of the C convention saying &lt;i&gt;'Thou shalt not save ECX and EDX'&lt;/i&gt;. Using two of six registers as a sanitary fill for temporary data, we reduce the set of safe registers to three, forcing us to reload frequently required parameters from memory over and over again. Reloading parameters more than two times eats up the advantage of omitting the two &lt;i&gt;push&lt;/i&gt; and &lt;i&gt;pop&lt;/i&gt; instructions to save and restore the content of ECX and EDX. After the third reload operation, we are on the losing side. Adding unnecessary loads of parameters to our code slows down our functions and does not speed them up.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;[2]&lt;/b&gt; Loading _GVAR into EDX and the flags MVP_FLAGS into EAX &lt;i&gt;could&lt;/i&gt; be done in L12. Applying some brain didn't just save 24 lines of code, it also sped up the 13 target functions, because loading and evaluating MVP_FLAGS were separated and the dependency chain were reduced to two single rather than three consecutive dependencies.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h1&gt;Improvements&lt;/h1&gt;Obviously, the code generated by GCC 3.3.5. is anything else than optimised. Even if you do not know anything about reading source code, you surely are able to grasp what those comments say. This document is not an introduction to programming, so you have to rely on my words, but you can be sure: I definitely know what I am talking about.  Rearranging some parts and reducing code sequences to really required instructions does shrink GCC's draft markably. Applying some human brain, the remaining (optimised) code should run about 30 percent faster now.&lt;br /&gt;&lt;pre&gt;.data&lt;br /&gt;         .p2align 4,0x00&lt;br /&gt;     jt0:.long    L02&lt;br /&gt;         .long    L03&lt;br /&gt;         .long    L04&lt;br /&gt;         .long    L05&lt;br /&gt;         .long    L16&lt;br /&gt;         .long    L15&lt;br /&gt;         .long    L15&lt;br /&gt;         .long    L15&lt;br /&gt;         .long    L06&lt;br /&gt;         .long    L07&lt;br /&gt;         .long    L15&lt;br /&gt;         .long    L15&lt;br /&gt;         .long    L15&lt;br /&gt;         .long    L15&lt;br /&gt;         .long    L15&lt;br /&gt;         .long    L15&lt;br /&gt;         .long    L08&lt;br /&gt;         .long    L09&lt;br /&gt;         .long    L10&lt;br /&gt;         .long    L11&lt;br /&gt;         .long    L12&lt;br /&gt;         .long    L13&lt;br /&gt;         .long    L14&lt;br /&gt;         .long    L15&lt;br /&gt;         .long    L08&lt;br /&gt;         .long    L09&lt;br /&gt;         .long    L10&lt;br /&gt;         .long    L11&lt;br /&gt;         .long    L12&lt;br /&gt;         .long    L13&lt;br /&gt;         .long    L14&lt;br /&gt;&lt;/pre&gt;Following recommendations of AMD and LETNi, the jump table is moved to the .data segment. This is much better than mixing code and data in the .code segment.&lt;br /&gt;&lt;pre&gt;.text&lt;br /&gt;         .p2align 4,,7&lt;br /&gt;         .globl   MoveDlg&lt;br /&gt; MoveDlg:pushl    %ebp&lt;br /&gt;         movl     %esp,%ebp&lt;br /&gt;         pushl    %edi&lt;br /&gt;         pushl    %esi&lt;br /&gt;         movl     0x08(%ebp),%edi&lt;br /&gt;         movl     0x0C(%ebp),%eax&lt;br /&gt;         movzwl   0x10(%ebp),%ecx&lt;br /&gt;         movl     _GVAR,%esi&lt;br /&gt;         cmpl     $0x30,%eax&lt;br /&gt;         je       L01&lt;br /&gt;         cmpl     $0x20,%eax&lt;br /&gt;         je       L00&lt;br /&gt;         cmpl     $0x3B,%eax&lt;br /&gt;         jne      L15&lt;br /&gt;&lt;/pre&gt;The distributor was optimised for the branch prediction logic. WM_CONTROL was put on top of the distributor, because most sent messages are WM_CONTROL messages. WM_COMMAND only is sent if the user pushes a button. No user is able to recognise delays of about 5 ns, so we can live with a ten cycles penalty if a branch target is misprediced. WM_INITDLG is sent only once. While the 1st comparison is 'guessed' as not taken, the branch does not trigger a penalty. The 2nd and all following comparisons are assumed to be taken, so the branch to the default routine (DefDP()) does not trigger penalties, as well.&lt;br /&gt;&lt;pre&gt;pushl    $0xE9&lt;br /&gt;         pushl    $0xD3&lt;br /&gt;         pushl    $0xD2&lt;br /&gt;         pushl    %edi&lt;br /&gt;         call     _DLGtxt&lt;br /&gt;         pushl    $0x00&lt;br /&gt;         pushl    $-0x01&lt;br /&gt;         pushl    $0x0120&lt;br /&gt;         pushl    $0x1240&lt;br /&gt;         pushl    %edi&lt;br /&gt;         call     _SnDIM&lt;br /&gt;         addl     $0x08,%esp&lt;br /&gt;         pushl    $0x1248&lt;br /&gt;         pushl    %edi&lt;br /&gt;         call     _SnDIM&lt;br /&gt;         addl     $0x08,%esp&lt;br /&gt;         pushl    $0x1250&lt;br /&gt;         pushl    %edi&lt;br /&gt;         call     _SnDIM&lt;br /&gt;         addl     $0x08,%esp&lt;br /&gt;         pushl    $0x1258&lt;br /&gt;         pushl    %edi&lt;br /&gt;         call     _SnDIM&lt;br /&gt;&lt;/pre&gt;The three parameters on top are pushed for the first call, only. This saves nine redundant instructions.&lt;br /&gt;&lt;pre&gt;addl     $0x14,%esp&lt;br /&gt;         movl     0x1C54(%esi),%ecx&lt;br /&gt;         movl     0x01D8(%esi),%edx&lt;br /&gt;&lt;/pre&gt;Both parameters can be preloaded at this point, because FDacc() is a function taken from ST-Open's library. Functions in my libraries restore all registers (including ECX and EDX) by default - they are 'clean'. But - watch out: MoveDlg() is a function following the C conventions. ECX and EDX neither are saved nor restored - MoveDlg() is a 'dirty' function.&lt;br /&gt;&lt;pre&gt;pushl    %edi&lt;br /&gt;         pushl    $0x00&lt;br /&gt;         pushl    $0x02&lt;br /&gt;         pushl    $0x00&lt;br /&gt;         pushl    $0x0B&lt;br /&gt;         pushl    %ecx&lt;br /&gt;         call     _FDacc&lt;br /&gt;         pushl    %edx&lt;br /&gt;         pushl    $0x00&lt;br /&gt;         pushl    $0x02&lt;br /&gt;         pushl    $0x04&lt;br /&gt;         pushl    $0x0B&lt;br /&gt;         pushl    %ecx&lt;br /&gt;         call     _FDacc&lt;br /&gt;         addl     $0x36,%esp&lt;br /&gt;         movl     $0x00,0x28E0(%esi)&lt;br /&gt;         pushl    %edi&lt;br /&gt;         call     _CtrWn&lt;br /&gt;         call     _DlgShow&lt;br /&gt;         jmp      3f&lt;br /&gt;&lt;/pre&gt;The next distributor was optimised for code reduction. Because none of the three buttons is pushed more than once (in general..), we can live with a ten cycles penalty for one or two mispredicted branch(es). The delay added by the penalty is at least six powers of ten faster than anything human senses could perceive.&lt;br /&gt;&lt;pre&gt;L00:subl     $0x1231,%ecx&lt;br /&gt;         je       0f&lt;br /&gt;         decl     %ecx&lt;br /&gt;         je       1f&lt;br /&gt;         decl     %ecx&lt;br /&gt;         jne      L15&lt;br /&gt;         pushl    $0x11&lt;br /&gt;         call     _Help&lt;br /&gt;         jmp      3f&lt;br /&gt;       0:movl     $0x00,0x28E0(%esi)&lt;br /&gt;         jmp      2f&lt;br /&gt;       1:orl      $0x00040000,0x28E0(%esi)&lt;br /&gt;       2:pushl    %edi&lt;br /&gt;         call     _WinDD&lt;br /&gt;       3:addl     $0x04,%esp&lt;br /&gt;         jmp      L16&lt;br /&gt;&lt;br /&gt;     L01:subl     $0x1240,%ecx&lt;br /&gt;         js     L15&lt;br /&gt;         cmpl     $0x1E,%ecx&lt;br /&gt;         ja     L15&lt;br /&gt;         movl     0x28E0(%esi),%eax&lt;br /&gt;         jmp    *jt0(, %ecx, 4)&lt;br /&gt;&lt;/pre&gt;The jump table was moved to the .data segment. To keep an overwiew, your symbols for jump tables generally should be marked with special names. ST-Open uses the symbol 'jtX', where X is the number of the current jump table.&lt;br /&gt;&lt;pre&gt;L02:andl     $0xFFFFE1FF,%eax&lt;br /&gt;         orl      $0x1000,%eax&lt;br /&gt;         jmp      0f&lt;br /&gt;     L03:andl     $0xFFFFE1FF,%eax&lt;br /&gt;         orl      $0x0800,%eax&lt;br /&gt;         jmp      0f&lt;br /&gt;     L04:andl     $0xFFFFE1FF,%eax&lt;br /&gt;         orl      $0x0400,%eax&lt;br /&gt;         jmp      0f&lt;br /&gt;     L05:andl     $0xFFFFE1FF,%eax&lt;br /&gt;         orl      $0x0200,%eax&lt;br /&gt;         jmp      0f&lt;br /&gt;     L06:andl     $0xFFFFFE7F,%eax&lt;br /&gt;         orl      $0x0100,%eax&lt;br /&gt;         jmp      0f&lt;br /&gt;     L07:andl     $0xFFFFFE7F,%eax&lt;br /&gt;         orl      $0x80,%eax&lt;br /&gt;         jmp      0f&lt;br /&gt;     L08:andl     $0xFFFFFF80,%eax&lt;br /&gt;         orl      $0x40,%eax&lt;br /&gt;         jmp      0f&lt;br /&gt;     L09:andl     $0xFFFFFE80,%eax&lt;br /&gt;         orl      $0x20,%eax&lt;br /&gt;         jmp      0f&lt;br /&gt;     L10:andl     $0xFFFFFE80,%eax&lt;br /&gt;         orl      $0x10,%eax&lt;br /&gt;         jmp      0f&lt;br /&gt;     L11:andl     $0xFFFFFE80,%eax&lt;br /&gt;         orl      $0x08,%eax&lt;br /&gt;         jmp      0f&lt;br /&gt;     L12:andl     $0xFFFFFE80,%eax&lt;br /&gt;         orl      $0x04,%eax&lt;br /&gt;         jmp      0f&lt;br /&gt;     L13:andl     $0xFFFFFE80,%eax&lt;br /&gt;         orl      $0x02,%eax&lt;br /&gt;         jmp      0f&lt;br /&gt;     L14:andl     $0xFFFFFE80,%eax&lt;br /&gt;         orl      $0x01,%eax&lt;br /&gt;       0:movl     %eax,0x28E0(%esi)&lt;br /&gt;         jmp      L16&lt;br /&gt;&lt;/pre&gt;This part surely could be reduced further if I could remember what all those flags are good for...&lt;br /&gt;&lt;pre&gt;L15:popl     %ebx&lt;br /&gt;         popl     %edi&lt;br /&gt;         popl     %esi&lt;br /&gt;         popl     %ebp&lt;br /&gt;         jmp      _DefDP&lt;br /&gt;&lt;br /&gt;     L16:xorl     %eax, %eax&lt;br /&gt;         popl     %ebx&lt;br /&gt;         popl     %edi&lt;br /&gt;         popl     %esi&lt;br /&gt;         popl     %ebp&lt;br /&gt;         ret&lt;br /&gt;&lt;/pre&gt;'Exits' belong to the bottom of a function. First, the processor does not have to jump back and forth to random locations within the instruction chain. Secondly, human senses perceive structured (sorted) input much faster than random patterns spread all over the screen.&lt;br /&gt;&lt;pre&gt;.comm    _GVAR, 4&lt;br /&gt;&lt;/pre&gt;One (used) out of all (unused). The size of the global variable was reset to the proper size of 32 bit, so four - rather than one - variable(s) will fit into one paragraph (16 byte), again.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h1&gt;Analysis&lt;/h1&gt;The source code generated by GCC 3.3.5 perfectly reveals the greatest weaknesses of the C conventions. The first step to slow down C code markably is the abuse of EBP as base pointer, reducing our set of available registers to 6: EAX, EBX, ECX, EDX, EDI and ESI. One of these remaining registers, EAX, is used to pass results or error codes from a called function to the calling function, reducing our register set to five. By default, ECX and EDX neither are saved nor restored, so every time we call another function their content is sent to the great formatter, or, in other words: If we stored frequently used parameters in ECX or EDX prior to the call, they probably will be overwritten by the called function. Due to counterproductive conventions, only three registers are left to store our frequently used parameters. If four or more parameters are required throughout a function, parameters 4 and up must be reloaded whenever we call another function, because the content of EAX, ECX and EDX is changed by the called function. Reloading parameters from the slower memory subsystem instead of preloading them in registers to perform much faster operations is quite time consuming. It is not about those five additional clock cycles we waste with reloading a parameter over and over again. The most negative side-effect is the immediate interruption of parallel execution, forcing two execution pipes to idle until the required parameter was reloaded, again. It's one of the reasons why C and C++ applications are that slow.  An exemplatory sequence is the following code snippet taken from GCC's output. It demonstrates how strict implementation of the C conventions slows down the execution of standard C and C++ code:&lt;br /&gt;&lt;pre&gt;...&lt;br /&gt;         movl     _GVAR, %eax&lt;br /&gt;         movl     7252(%eax), %eax&lt;br /&gt;         pushl    %eax&lt;br /&gt;         ...&lt;br /&gt;&lt;/pre&gt;This sequence turns any parallel execution off. Instead of executing up to 3 instructions in one of the 3 available execution pipes, the code listed above is executed sequentially, or, in other words: Two execution pipes have to wait until the previous instruction is executed and its result is available. If all data are present in L1 cache, it takes (approximately) nine clock cycles to execute the three lines listed above. Two execution pipes are idling while _GVAR (_GVAR is BNR defined as an array of dwords to satisfy GCC) is loaded into EAX. Next, two pipes are idling while 1C54[BNR] is loaded into EAX. Finally, EAX is pushed onto the stack, blocking ESP for two clock cycles. If some data were not loaded into L1 cache, yet, execution time is extended markably. Reading data from the L2 cache costs about six clock cycles, loads from main memory consume about 27 clock cycles. In the end, our primary intention, saving 14 clock cycles to push and pop ECX and EDX in every function's prologue and epilogue, turns out to be a bad idea. In the best case, the assumed advantage is eaten up after the second reload of a frequently used parameter. In general, there is no advantage at all - The saved 14 clock cycles are wasted with the first reloading from main memory. This is the main reason why conventional software is creeping through the execution pipes of John Doe's hypermodern Gigantium processor like thick dough. Perhaps it were a good idea to purify the sap called software to give John Doe the chance to partially enjoy the overwhelming computational powers of his expensive Gigantium machine?  Another obstacle to unleash the powers of recent processors is the flying stack pointer ESP. Modern software design should use the capabilities of modern processors instead of blocking them with outdated mechanisms and ancient techniques. A pointlessly floating stack pointer with random content cannot be used seriously to address stack elements.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h1&gt;Considerations&lt;/h1&gt;As discussed in great detail, any conventional stack frame generally looks like this:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://lh4.ggpht.com/_Z2WbH3F-E_Q/S8Ovd2D9lgI/AAAAAAAAAFw/l56NA8zCofU/stack2.png" style="max-width: 800px;" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Programmers and compilers are free to write any kind of data to stack locations between the stack's bottom and the stack element below ESP (-0x04[ESP]). To reserve a part of the stack for the private use of the current function, we have to subtract the required size from ESP. Moving ESP towards stack bottom, we reserve the corresponding area for our private use. No function (except it is compiled with GCC 3.3.5.) ever writes data to stack locations above -04[ESP]. It is very important to understand the connection between the subtraction of the required stack frame size from ESP and exclusive ownership of the reserved area. My alternative method introduced in this paper as well as conventional programming techniques are based on this principle. You just break an absolute taboo if you write to stack locations above -04[ESP]!  Because conventional programming methods push data onto the stack, respectively pop them from the stack, the content of ESP is changing continuously. Due to its randomly changing content, it is impossible to use ESP as base for adressing specific stack elements directly. An additional register, the base pointer EBP, is required for this task. It points to the current top of the stack all of the time, so we safely can adress all stack elements regardless of the current content of ESP.  If we analyse the described process and have a closer look at its details, one thing is quite obvious: We could save all time consuming contortions with additional registers if we replaced the usual push - call - add n,%esp sequenzes with something else. A short look into the data sheets of modern processors reveals us some additional caveats. Let us recapitulate what we know about push and pop instructions and compare them against a bunch of simple mov instructions:  &lt;br /&gt;&lt;h3&gt;PUSH&lt;/h3&gt;&lt;img src="http://lh3.ggpht.com/_Z2WbH3F-E_Q/S8OvoAXjwKI/AAAAAAAAAFw/ubRrR_lKbF8/stack6.png" style="max-width: 800px;" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Depending on the standard data size, either two, four or eight is subtracted from the stack pointer rSP, setting it to the next lower stack element. After updating rSP, the content of a register, a memory location or an immediate value is copied to the current stack element (00[rSP]). A push reg or push imm instruction is executed within three clock cycles. ESP is blocked for the time it needs to update its content and deliver the stack location where the given data shall be stored. This requires two clock cycles, so consecutive push instructions have a latency of at least two clock cycles.  As an alternative, we might replace one push with three mov instructions. If none of them depends on the result of one of the other two movs, all of them are executed simultaneously in three clock cycles, as well. If we write to continuous memory locations, all writes are done in one gulp, because a mechanism called write combining is triggered.  &lt;br /&gt;&lt;h3&gt;POP&lt;/h3&gt;&lt;img src="http://lh6.ggpht.com/_Z2WbH3F-E_Q/S8OvoMWeG_I/AAAAAAAAAFw/Gax8w-zxoME/stack7.png" style="max-width: 800px;" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The content of the current stack element is copied to the register or memory location specified by the pop instruction, then two, four or eight is added to the stack pointer rSP. The current stack element was 'taken from the stack' and the stack pointer now points to one stack element above (our new stack bottom). Other than the direct path push, pop is a vector path instruction. These instructions always are fed to execution pipes 0 and 1, while execution pipe 2 is blocked while the instruction is executed. The latency of each pop instruction is 4 clock cycles.  As an alternative, we might replace one pop with four mov instructions. Three of them are executed simultanuously in three clock cycles, the fourth mov starts execution in the fourth clock cycle.  &lt;br /&gt;&lt;h2&gt;Conclusions&lt;/h2&gt;As the analysis of both instructions shows, the only rational consequence is to replace all pop and push with mov instructions. Doing so not only speeds up passing of parameters, it also turns the flying stack pointer into a static one. As a positive side-effect of the frozen stack pointer, we do not need EBP as base pointer any longer, freeing one valuable register for general purposes. Using no base pointer, we get rid of that mess with positive and negative offsets to EBP - a source of errors, making source code less readable.   &lt;br /&gt;&lt;h3&gt;Addendum&lt;/h3&gt;With Athlon64, family 10h (aka Phenom), the latency for pop instructions was reduced to three clock cycles &lt;i&gt;direct path single&lt;/i&gt;. Nonetheless, replacing push and pop with mov instructions is a much better improvement, because of the expected positive side-effects coming along with the replacement automatically. Even if Intelligent Design functions will win any race around clock cycles with ease, its concept offers much more improvements than just counting clocks.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h1&gt;Intelligent Design&lt;/h1&gt;If we put all discussed aspects together, we can state that conventional programming techniques ignore improvements introduced with modern processor design completely. Well, conclusions of our analysis suggested to replace all leave, pop and push with mov instructions, because three of them can be executed simultaneously. The advantages of these replacements are a static stack pointer, an additional general purpose register (EBP), a naturally aligned stack (if supported by the operating system), read and write access to any stack element via ESP, and much more. Based on the facts we worked out until now, a concept was developed and tested for usability in everday's applications.  Because we use no base pointer any longer, the stack is addressed via ESP, only. As shown before, we can reserve a stack area for our private use by subtracting the required size from the stack pointer ESP. The reserved area is safe from being used by called functions (except those compiled by GCC 3.3.5.). Let us recapitulate how the stack looks like after a function is called:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://lh3.ggpht.com/_Z2WbH3F-E_Q/S8Ovn3yQcGI/AAAAAAAAAFw/GUB2WM54Ezg/stack5.png" style="max-width: 800px;" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The current stack element holds the return address - the address of the instruction following the call in the calling function. If the calling function has to pass parameters to our funktion, they are stored above the current stack element in ascending order. Because we want to use mov instructions, only, a sufficient stack area must be reserved where we can save used registers, store local variables and pass parameters to called functions. Working with a static stack pointer, we just have to add the required sizes of these three parts (registers, local area and parameters), then subtract the sum from ESP:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://lh3.ggpht.com/_Z2WbH3F-E_Q/S8Ovol5tSyI/AAAAAAAAAFw/lA16c-081QA/stack9.png" style="max-width: 800px;" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;After this subtraction, the content of ESP does not change until our function is terminated. Therefore, ESP can be used to address any element within our current stack frame. Elements outside our private area are taboo - you may read them whenever you want, but you must not write to them under any circumstances!  &lt;br /&gt;&lt;h2&gt;Calculating The Required Size&lt;/h2&gt;The size of our stack frame depends on the amount of registers we have to save, the size for our local variables and the amount of parameters we have to pass.  &lt;br /&gt;&lt;h3&gt;Registers&lt;/h3&gt;The contents of all used registers are stored at the top of our private area. All Intelligent Design functions do save ECX and EDX, as well! The following table shows the required size for 16, 32 and 64 bit functions:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;table&gt;&lt;tbody&gt;&lt;tr&gt;&lt;th width="25%"&gt;&lt;b&gt;Registers&lt;/b&gt;&lt;/th&gt;&lt;th width="25%"&gt;&lt;b&gt;16 Bit&lt;/b&gt;&lt;/th&gt;&lt;th width="25%"&gt;&lt;b&gt;32 Bit&lt;/b&gt;&lt;/th&gt;&lt;th width="25%"&gt;&lt;b&gt;64 Bit&lt;/b&gt;&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;1&lt;/td&gt;&lt;td&gt;0x02&lt;/td&gt;&lt;td&gt;0x04&lt;/td&gt;&lt;td&gt;0x08&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;2&lt;/td&gt;&lt;td&gt;0x04&lt;/td&gt;&lt;td&gt;0x08&lt;/td&gt;&lt;td&gt;0x10&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;3&lt;/td&gt;&lt;td&gt;0x06&lt;/td&gt;&lt;td&gt;0x0C&lt;/td&gt;&lt;td&gt;0x18&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;4&lt;/td&gt;&lt;td&gt;0x08&lt;/td&gt;&lt;td&gt;0x10&lt;/td&gt;&lt;td&gt;0x20&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;5&lt;/td&gt;&lt;td&gt;0x0A&lt;/td&gt;&lt;td&gt;0x14&lt;/td&gt;&lt;td&gt;0x28&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;6&lt;/td&gt;&lt;td&gt;0x0C&lt;/td&gt;&lt;td&gt;0x18&lt;/td&gt;&lt;td&gt;0x30&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;7&lt;/td&gt;&lt;td&gt;0x0E&lt;/td&gt;&lt;td&gt;0x1C&lt;/td&gt;&lt;td&gt;0x38&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;8&lt;/td&gt;&lt;td&gt;-&lt;/td&gt;&lt;td&gt;-&lt;/td&gt;&lt;td&gt;0x40&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;9&lt;/td&gt;&lt;td&gt;-&lt;/td&gt;&lt;td&gt;-&lt;/td&gt;&lt;td&gt;0x48&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;-&lt;/td&gt;&lt;td&gt;-&lt;/td&gt;&lt;td&gt;0x50&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;11&lt;/td&gt;&lt;td&gt;-&lt;/td&gt;&lt;td&gt;-&lt;/td&gt;&lt;td&gt;0x58&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;12&lt;/td&gt;&lt;td&gt;-&lt;/td&gt;&lt;td&gt;-&lt;/td&gt;&lt;td&gt;0x60&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;13&lt;/td&gt;&lt;td&gt;-&lt;/td&gt;&lt;td&gt;-&lt;/td&gt;&lt;td&gt;0x68&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;14&lt;/td&gt;&lt;td&gt;-&lt;/td&gt;&lt;td&gt;-&lt;/td&gt;&lt;td&gt;0x70&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;15&lt;/td&gt;&lt;td&gt;-&lt;/td&gt;&lt;td&gt;-&lt;/td&gt;&lt;td&gt;0x78&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt;If no registers are used, the required size is zero, of course. Rows 7 through 15 are not available in 16 and 32 bit mode, because these modes do not recognise the extended 64 bit register set. Eventually, you might need to save some MMX or XMM registers. Just add 8 for each MMX and 16 for each XMM register. It is recommended to avoid using MMX registers, because they internally are mirrored on the original FPU registers ST(0) through ST(7). Older software libraries and operating systems do not know anything about MMX registers, so you might encounter weird problems if you do not write an explicit emms or femms before you call external functions. Well, emms and femms generally destroy the contents of all MMX registers, so it is a good idea to use XMM instead of MMX registers.  &lt;br /&gt;&lt;h3&gt;Local Variables&lt;/h3&gt;Between saved registers at the top of our stack and the area where parameters for called functions are stored, we have to reserve the space required for local data (variables, structures, strings). If the size is no multiple of the standard data size, it is recommended to expand the required size to the next higher multiple of the standard data size. It's very important to keep the stack aligned, because odd addresses in ESP trigger penalty cycles for every unaligned memory access. If you don't keep care and add an even size in your epilogue, the program definitely will crash, because EIP probably is set to a faulty return address.  If you want to store strings on the stack, it is recommended to define a sufficient size, so the function reading characters from the input device has no chance to overwrite parameters or register contents stored in the stack elements above.  &lt;br /&gt;&lt;h3&gt;Parameters&lt;/h3&gt;Parameters are stored at the bottom of our stack frame. They are moved to the stack elements in ascending order, starting at 00[ESP]. Because we mov parameters to the corresponding stack elements rather than to push them any longer, we should reserve an area equal to the size required by the function with the largest amount of parameters. If we have three functions awaiting three parameters and one function awaiting five parameters, we need a size of 20 byte (32 bit code) to pass five parameters - the three parameters for the other three functions automatically fit into the reserved 20 byte. The required size can be taken from the above table for registers. If more than 6 (15) parameters must be passed, the size easily can be calculated with the formula parameters * datasize, where data size is 2 (16 bit), 4 (32 bit) or 8 (64 bit). For example, the required size for the 13 parameters we have to pass to WinCreateWindow() is 13 * 4 = 52 byte. The bottom of the corresponding stack frame looks like this:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://lh3.ggpht.com/_Z2WbH3F-E_Q/S8OvoSXGSfI/AAAAAAAAAFw/anRsF0-0dWs/stack8.png" style="max-width: 800px;" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Watch out: Parameter 1 in figure 08 is identical with the leftmost parameter we are writing within the paranthesis of a C or C++ function. In general, parameters are moved onto the stack in ascending order: Parameter 1 @ 00[ESP], parameter 2 @ 04[ESP] and so on, until the last parameter was moved. If any parameter does not change between two or more consecutive call instructions, it just has to be moved the very first time it is used. For following calls, the parameter already is stored at the right place, so we do not need to store it at the same location, again - it is stored there until we overwrite it!  &lt;br /&gt;&lt;h2&gt;Adding Sizes&lt;/h2&gt;After we determined the required sizes of all three parts (registers, local area, parameters to pass), we calculate their sum, then round the result up to a size ending with 3C, 7C, BC or FC (32 bit mode), respective 38, 78, B8 or F8 (64 bit mode). Applying this trick, our stack frame automatically is aligned to the beginning of the next cache line. Finally, we create our new stack frame by subtracting the rounded up sum from rSP. As described in detail, this subtraction reserves the created stack frame from being overwritten by other functions. Writing data to stack locations outside the current stack frame violates the rules of conventional programming as well as the rulework defined for Intelligent Design.  Setting the first stack element to the beginning of a cache line, we benefit from a bunch of advantages. First, no time consuming workarounds are required to align stack locations to store or load XMM registers. It is just one side effect of the Intelligent Design prologue that rSP is naturally aligned by default without a single line of additional code. Secondly, Intelligent Design code is able to benefit from bandwith and accellerating mechanisms provided by modern processors. At the latest, the first access to any stack element loads the stack frame into the L1 cache, moving all following accesses to the fastest area in the machine's memory hierarchy. Thirdly, writing data to continuous memory locations in ascending order triggers a mechanism called write combining, where up to 16(!) doublewords are stored on the stack in one gulp. Much more advantages like avoiding repetitive stores for never changed parameters come along with this fresh and modern design. Some of them are shown later on, some of them are not revealed, yet...  If all Intelligent Design rules were applied properly, the smallest possible stack frame automatically looks like this in 32 bit code&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://lh6.ggpht.com/_Z2WbH3F-E_Q/S8OwNcZhSnI/AAAAAAAAAFw/uZVRmcYsZ2E/stackC.png" style="max-width: 800px;" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;or like this in 64 bit code&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://lh5.ggpht.com/_Z2WbH3F-E_Q/S8OwNX1oClI/AAAAAAAAAFw/G7AWDvWJ83M/stackD.png" style="max-width: 800px;" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The real size can be varied by adding multiples of 64 to the smallest value of 0x3C (0x38 in 64 bit code) to match your individual requirements perfectly.  Due to the subtraction of a predefined size from rSP, the stack frames of all Intelligent Design functions automatically are set to the beginning of a cache line. In general, functions are called. Looking at the mechanism of the call instruction, the return address is pushed onto the stack before execution is continued with the first instruction of the called function. Hence, rSP points to an address ending with 3C, 7C, BC or FC in 32 bit code, respective 38, 78, B8 or F8 in 64 bit code - exactly the size we have to subtract from rSP.  &lt;br /&gt;&lt;h2&gt;Destroying The Stack Frame&lt;/h2&gt;After our function's code was executed, all registers must be restored to the content they had as they were passed to us. The last step before returning to the calling function is to add the size we subtracted to create our stack frame to rSP. With the final ret, rIP is taken from the stack and execution is continued with the instruction following the call of our function. At this point, our stack looks like this, again:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://lh3.ggpht.com/_Z2WbH3F-E_Q/S8Ov6Yf-hAI/AAAAAAAAAFw/HPXbAsnf4AE/stackA.png" style="max-width: 800px;" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Because the area between stack bottom and the address currently loaded into rSP is 'no-mans-land', the next function is free to reserve a part of it to save used registers, to store its local variables and structures or to put some parameters for called functions onto the stack. On principle, the content of a stack element is undefined until something was written to it. Writing data into untouched stack elements is called 'initialisation' in C and C++ terminology.&lt;br /&gt;&lt;h2&gt;Side Notes About Alignment&lt;/h2&gt;Using XMM instructions, most 128 bit stores and loads explicitely expect addresses aligned to 16 byte boundaries, in other words: The last digit of the address must be zero. No existing operating system cares about aligned addresses at all. Using excessive push orgies to put the parameters for called functions onto the stack, the chance to enter that function with an aligned stack pointer is about 25 percent. Because 16 / 4 = 4, the chance ESP ends with 04, 08 or 0C is about 75 percent. It's left to the programmer or compiler to handle this problem properly, so you have to add redundant, time consuming code to align the stack pointer in every function working with XMM instructions.&lt;br /&gt;&lt;br /&gt;As shown, Intelligent Design provides some mechanisms to naturally align all stack frames to the beginning of a cache line, but - even the best mechanism is of no use if the operating system does not support it. In most cases, an ID compliant function is called by the message transporting system within the operating system. Whenever a message is sent or posted to our application's message loop, we do not know which instance of which function of the operating system called our message loop. Hence, we do not know how many things were pushed onto our stack nor can we rely on something like an aligned stack pointer. Even if we aligned the stack pointer at the begin of our main() function, it surely is not aligned any longer if the code in our message loop is executed the very first time. Therefore, one of the most important improvents introduced with Intelligent Design is not available as long as we run our code on any existing platform. Because all existing operating systems put dirt into our gears, no machine ever will run at full speed.&lt;br /&gt;&lt;br /&gt;No application programmer is able to do anything against this counter-productive behaviour of existing operating systems. Either we bow our heads, obey and apply all those pointless work-arounds recommended by processor manufacturers, or we give up programming to get rid of them. Because it is a fact that work-arounds are not very practicable in many cases, there's a third way as an alternative for this 'take it or leave it' choice. Instead of searching for ways to bypass 'built in' restrictions, speed brakes and other obstacles of existing operating systems, it might be a much better idea to throw away these remains of old fashioned software design and create something new. Something supporting innovations and capabilities of modern processors instead of ignoring them. Something making use of the immense computational power and speed of recent quad core machines. Something called IDEOS (the Intelligent Design Easy2Use Operating System)...&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h1&gt;Summary&lt;/h1&gt;I hope I could convince you of various advantages introduced with the development of Intelligent Design. Compared against conventional programming techniques (as discussed in this document), Intelligent Design wins in all disciplines - be it sheer speed, high code density, simplified development or efficiency. Due to the strict renunciation of slow &lt;i&gt;leave&lt;/i&gt;, &lt;i&gt;pop&lt;/i&gt; and &lt;i&gt;push&lt;/i&gt; instructions, coming along with many ill side-effects like an unpredictable stack pointer, these caveats of conventional software design easily are turned into some really welcome side-effects by replacing those slow instructions with simple &lt;i&gt;mov&lt;/i&gt;es.&lt;br /&gt;&lt;br /&gt;The most welcome side-effect surely is the fact that we get EBP back as a general purpose register. One additional register is quite a lot if only six of them are available, because EBP was abused as base pointer. Each additional register is an advantage, because we can pair reads to copy parameters from memory to registers and access these registers rather than reloading single registers with some frequently used parameters over and over again.&lt;br /&gt;&lt;br /&gt;Another advantage is that we mov parameters to the right location rather than to push them onto the stack. While &lt;i&gt;mov&lt;/i&gt;e instructions can store changed parameters anywhere, a &lt;i&gt;push&lt;/i&gt; always addresses the next lower stack element (a quite limited range). If we call a function repeatedly, and only the last parameter changes from call to call, we had to &lt;i&gt;push&lt;/i&gt; all parameters after each call to update this topmost parameter with the current data. Moreover, the correction of ESP is mandatory after each call, blocking the following &lt;i&gt;push&lt;/i&gt; for an entire clock cycle. Using &lt;i&gt;mov&lt;/i&gt;e instructions, all these problems vanish. We save many unnecessary instructions and keep the three execution pipes busy most of the time. Of course, there are a lot of cases where only one or two instructions are required between two calls, but this still is faster than a conventional push orgy.&lt;br /&gt;&lt;br /&gt;Yet another advantage is the fact we can &lt;i&gt;mov&lt;/i&gt;e parameters and other data onto the stack anywhere within our source file. Pairing multiple reads in groups, we can avoid direct dependencies completely. Pairing multiple writes, we trigger write combining. Keeping proper distances between reads and writes can eliminate most dependencies with few exceptions. Applying these tricks as often as possible speeds up code markably.&lt;br /&gt;&lt;br /&gt;Putting it all together, Intelligent Design is &lt;i&gt;the&lt;/i&gt; up-to-date alternative to some old-fashioned conventional programming techniques. Conventional software design did not change too much in the last thirty years. We neither can create the most simple integrated circuit with stone age tools, nor are we able to create software making full use of the capabilities provided by modern processors with conventional code. While old fashioned code is spiced with mandatory brake pads and compilers generate tons of counter-productive dependencies, because the convention allows to abuse the most valuable resources - our registers - as garbage pile to save four instructions at the cost of multiple reloads, Intelligent Design replaces these obstacles with straight and clean rules, providing full support for features and improvements of recent processors.&lt;br /&gt;&lt;br /&gt;Intelligent Design is an up-to-date concept for the next generation of operating systems and applications, introducing a new quality to software design. Moreover, it is much easier to handle a single stack pointer than to manage two registers, a separate stack and base pointer, simultaneously. Getting rid of the institution called base pointer, we finally can say good bye to positive and negative offsets and error prone corrections of the stack pointer after each of the obligatory push orgies. Another positive side-effect is the return of a valuable resource (EBP) to the very sparse pool of general purpose registers - the worst conceptual flaw of LETNi's x86 architecture.  All in all, Intelligent Design points the way ahead to fastest code with high density, causing less head aches for stressed programmers. Best of all: This is not just a theoretical construction, floating around in the head of a weird, unwordly developer - it exists for real! ST-Open's libraries as well as DatTools and the SRE editor were ported to Intelligent Design, working as expected: Really fast and reliable.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h1&gt;Rules&lt;/h1&gt;Whenever we want to build a greater whole out of many independent parts created by individual contributors, we have to use a common set of rules. Anyone contributing a part to the whole has to obey these rules. If all contributors applied their own rules, we finally ended up with a lot of parts, but none of them were able to work together with any other part. Using different interfaces, single parts either could not communicate at all or misunderstood received messages completely. Therefore, we need a set of rules before we start to create parts of a greater whole. This set of rules, like a common language, defines a common interface, forcing any contributor to use standardised communication protocols, so other parts are able to understand all sent messages and do the right work. The big idea behind any set of rules lives or dies with strict obedience.&lt;br /&gt;&lt;h2&gt;Rule 1&lt;/h2&gt;&lt;span style="color: #000099;"&gt;The instructions &lt;i&gt;enter&lt;/i&gt;, &lt;i&gt;leave&lt;/i&gt;, &lt;i&gt;push&lt;/i&gt; and &lt;i&gt;pop&lt;/i&gt; are archaic remains of the past era. They &lt;b&gt;must not&lt;/b&gt; be used in &lt;i&gt;Intelligent Design &lt;/i&gt;code.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Any of these instructions change the content of the stack pointer automatically. All Intelligent Design functions depend on one principle: The stack pointer must not change its content for their entire runtime. We either can enjoy the advantages of a static stack pointer or we can waste precious time with obligatory corrections and repetitive writes of one and the same parameter. Using a 'flying' stack pointer and its urgently required compagnion called base pointer is mutual exclusive with Intelligent Design. Both concepts use the stack pointer in a very different way - there is no chance to use a mix of both concepts in one and the same function.  &lt;br /&gt;&lt;h2&gt;Rule 2&lt;/h2&gt;&lt;span style="color: #000099;"&gt;Stack frames are created by subtracting 0x3C in 32 bit functions or 0x38 in 64 bit functions from the stack pointer ESP/RSP. If larger stack frames are required, the basic value &lt;i&gt;0x3C&lt;/i&gt; or &lt;i&gt;0x38&lt;/i&gt; can be expanded in 64 byte steps (0x7C, 0xBC, 0xFC, ... in 32 bit functions, 0x78, 0xB8, 0xF8, ... in 64 bit functions).&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This rule is valid for IDEOS, only. Any other operating system will not benefit from the advantages of a properly aligned stack. Because the stack generally is not aligned in old fashioned operating systems, you have to align it with some lines of redundant bloat if you use XMM registers.  If you write IDEOS code, you have to apply this rule. The only exception is the rare case that you do not use other registers than rAX and XMM0 through XMM3. If there are registers to preserve, you need a stack frame. The advantage of this rule is a properly aligned stack pointer, starting at the beginning of a cache line. IDEOS's concept guarantees that the stack pointer is naturally aligned to a multiple of 64 before any function is called. Because the call instruction stores EIP/RIP in the next stack element, subtracting the correct value automatically aligns the stack pointer to a multiple of 64 and the corresponding cache line is preloaded into L1 cache without a single line of additional code.  &lt;br /&gt;&lt;h2&gt;Rule 3&lt;/h2&gt;&lt;span style="color: #000099;"&gt;All registers used in a function must be saved before they are overwritten and must be restored before returning to the caller.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Any software running on modern processors is most efficient if it can access as much hardware resources as possible. The most valuable resources of a x86 processor are its registers. The more frequently used parameters are held in registers, the less time is required to perform a given task. If we reload parameters with slow memory reads rather than to keep them in registers, we slow down our code with superfluous operations and interrupt simultaneous execution completely. Preloading all required parameters into registers in one gulp reduces all dependencies, supporting parallel execution, keeping all three pipes busy all of the time.  Saving precious clock cycles with much faster register operations outweighs the cost of the few clock cycles needed to save and restore all used registers by far. Many functions work with a subset of all available registers. The less registers we use, the faster they are saved and restored. Another welcome side-effekt is based on the fact that continuous writes to ascending addresses trigger write combining, where up to 16 doublewords are written to a cache line in one gulp. This surely is much faster than 16 single writes. Moreover, pairing of multiple reads or writes is a good opportunity to support simultaneous execition flow as long as possible.  &lt;br /&gt;&lt;h2&gt;Rule 4&lt;/h2&gt;&lt;span style="color: #000099;"&gt;The registers RAX, XMM0, XMM1, XMM2 and XMM3 are reserved for special purposes - they neither are saved nor restored by default.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This rule partially overrides rule 3:  Per definitionem, Register RAX is used to pass return values or errorcodes to the calling function. It does not make sense to save a register that is overwritten in all epilogues by default. If a function declaration defines its return value as VOID, rAX must be cleared with a simple xor %rAX,%rAX on exit.&lt;br /&gt;&lt;br /&gt;Registers XMM0 through XMM3 with their size of 4 * 128 bit = 64 byte cover an entire cache line. They are used to clear, move or manipulate large data areas in steps of 64 byte, preferably aligned to a multiple of 64. In general, these operations are executed in small loops where no other functions are called, so we can skip to save and restore these registers per se. This saves a lot of clock cycles if your application makes heavy use of such operations. If your function calls others while manipulating large memory blocks, the called function might use XMM0 through XMM3 as temporary storage, overwriting whatever was stored in these registers. Hence, it is recommended to use XMM4 through XMM15 if preloaded values must not change while other functions are called. As stipulated by rule 3, you have to save and restore  XMM4 through XMM15 if you use them.  &lt;br /&gt;&lt;h2&gt;Rule 5&lt;/h2&gt;&lt;span style="color: #000099;"&gt;Accessing registers is much faster than accessing memory locations. The most frequently used parameters in a function should be preloaded into registers as soon as these registers were saved on the stack.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Even if old-fashioned programmers aren't used to work with multiple execution pipes: Obeying to this rule can turn an ox cart into a rocket (see Rule 3).  &lt;br /&gt;&lt;h2&gt;Rule 6&lt;/h2&gt;&lt;span style="color: #000099;"&gt;Modern processors execute three (LETNi: 1.5) instructions simultaneously. Good code should make use of these capabilities rather than to ignore them. Dependency chains interrupt simultaneous execution. While one instruction is executed in one pipe, the other two pipes are waiting for the result. Sending two of three pipes to sleep for some clocks wastes two thirds of the available resources.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This extraordinary product of a classical compiler &lt;br /&gt;&lt;pre&gt;...&lt;br /&gt;    pushl    %esi&lt;br /&gt;    pushl    $0&lt;br /&gt;    pushl    $2&lt;br /&gt;    pushl    $0&lt;br /&gt;    pushl    $11&lt;br /&gt;    movl     _GVAR, %eax&lt;br /&gt;    movl     7252(%eax), %eax&lt;br /&gt;    pushl    %eax&lt;br /&gt;    call     _FDacc&lt;br /&gt;    movl     _GVAR, %eax&lt;br /&gt;    addl     $24, %esp&lt;br /&gt;    movl     472(%eax), %ebx&lt;br /&gt;    pushl    %ebx&lt;br /&gt;    pushl    $0&lt;br /&gt;    pushl    $2&lt;br /&gt;    pushl    $4&lt;br /&gt;    pushl    $11&lt;br /&gt;    movl     7252(%eax), %ecx&lt;br /&gt;    pushl    %ecx&lt;br /&gt;    call     _FDacc&lt;br /&gt;    addl     $20, %esp&lt;br /&gt;    movl     _GVAR, %eax&lt;br /&gt;    movl     $0, 10464(%eax)&lt;br /&gt;    ...&lt;br /&gt;&lt;/pre&gt;can be improved with a few useful changes &lt;br /&gt;&lt;pre&gt;...&lt;br /&gt;    movl _GVAR,%esi    # at least 3 clocks ahead&lt;br /&gt;    ...&lt;br /&gt;    ...&lt;br /&gt;    ...&lt;br /&gt;    movl 0x1C54(%esi),%eax&lt;br /&gt;    movl 0x01D8(%esi),%ecx&lt;br /&gt;    movl $0x00,0x28E0(%esi)&lt;br /&gt;    movl %eax,0x00(%esp)&lt;br /&gt;    movl $0x0B,0x04(%esp)&lt;br /&gt;    movl $0x00,0x08(%esp)&lt;br /&gt;    movl $0x02,0x0C(%esp)&lt;br /&gt;    movl $0x00,0x10(%esp)&lt;br /&gt;    movl %edi,0x14(%esp)&lt;br /&gt;    call _FDacc&lt;br /&gt;    movl $0x04,0x08(%esp)&lt;br /&gt;    movl %ecx,0x14(%esp)&lt;br /&gt;    call _FDacc&lt;br /&gt;    ...&lt;br /&gt;&lt;/pre&gt;to run at least twice as fast, now. Removing all superfluous instructions, code size could be reduced from 23 to 14 instructions. Placing all instructions in the proper order removes dependencies and keeps all pipes busy without delays.  &lt;br /&gt;&lt;h2&gt;Rule 7&lt;/h2&gt;&lt;span style="color: #000099;"&gt;Jump tables belong to the data segment. AS is able to build jump tables in the data segment, so there's absolutely no reason to pollute the code segment with data of any kind.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;No comment is required for this rule. Just do it!&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h1&gt;Appendices&lt;/h1&gt;&lt;a href="http://st-intelligentdesign.blogspot.com/2010/04/14-at-syntax.html"&gt;AT&amp;amp;T-Syntax&lt;/a&gt;&lt;br /&gt;&lt;a href="http://st-intelligentdesign.blogspot.com/2010/04/appendix-1.html"&gt;Appendix 1 - Examples&lt;/a&gt;&lt;br /&gt;&lt;a href="http://st-intelligentdesign.blogspot.com/2010/04/13-appendix-2.html"&gt;Appendix 2 - Optimisations&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4412535206273251260-3458883301139700788?l=st-intelligentdesign.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://st-intelligentdesign.blogspot.com/feeds/3458883301139700788/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/11/intelligent-design-in-one-piece.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/3458883301139700788'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/3458883301139700788'/><link rel='alternate' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/11/intelligent-design-in-one-piece.html' title='Intelligent Design in one piece'/><author><name>Bernhard Schornak</name><uri>http://www.blogger.com/profile/07864510983569379361</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_Z2WbH3F-E_Q/S4KZO1ynfyI/AAAAAAAAABQ/jgXB2qAV2Xc/S220/BS.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://lh6.ggpht.com/_Z2WbH3F-E_Q/S8Ovd6SnBxI/AAAAAAAAAFw/BKBVYRLUuFk/s72-c/stack4.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4412535206273251260.post-1793967656339837794</id><published>2010-04-14T00:58:00.001+02:00</published><updated>2010-04-14T11:33:55.912+02:00</updated><title type='text'>14 - AT&amp;T Syntax</title><content type='html'>&lt;div xmlns='http://www.w3.org/1999/xhtml'&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;ST-Open software is developed with GCC/2 (1994) and the GNU assembler AS. AS is a sophisticated assembler, so nothing is &lt;i&gt;ASSUME&lt;/i&gt;d and no hints like &lt;i&gt;SEGMENT&lt;/i&gt;, &lt;i&gt;BYTE PTR&lt;/i&gt; and compagnions are required. This saves a lot of typing work and the readability of source files markably grows. A simple &lt;i&gt;.data&lt;/i&gt; on top of user data belonging to the DATA segment and a simple &lt;i&gt;.text&lt;/i&gt; on top of the code going to the CODE segment is all AS needs to know. On the other hand, AS wants to be fed with code written in AT&amp;amp;T syntax.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Register Set&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;All register names are written in small letters and a percent sign preceeds the register name as a delimiter. The lists below enumerate the entire register set available for &lt;i&gt;LETNiums&lt;/i&gt; and &lt;i&gt;AMD's Athlon64&lt;/i&gt;:&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='monospace'&gt; 8 bit Registers      16 bit Registers&lt;br/&gt; LETNi     AT+T       LETNi     AT+T&lt;br/&gt;&lt;br/&gt; AL        %al        AX        %ax&lt;br/&gt; BL        %bl        BX        %bx&lt;br/&gt; CL        %cl        CX        %cx&lt;br/&gt; DL        %dl        DX        %dx&lt;br/&gt; DIL       %dil       DI        %di&lt;br/&gt; SIL       %sil       SI        %si&lt;br/&gt; BPL       %bpl       BP        %bp&lt;br/&gt; SPL       %spl       SP        %sp&lt;br/&gt; R8B       %r8b       R8W       %r8w&lt;br/&gt; R9B       %r9b       R9W       %r9w&lt;br/&gt; R10B      %r10b      R10W      %r10w&lt;br/&gt; R11B      %r11b      R11W      %r11w&lt;br/&gt; R11B      %r12b      R12W      %r12w&lt;br/&gt; R11B      %r13b      R13W      %r13w&lt;br/&gt; R11B      %r14b      R14W      %r14w&lt;br/&gt; R11B      %r15b      R15W      %r15w&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;32 bit Registers      64 bit Registers&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;LETNi     AT+T        LETNi     AT+T&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;EAX       %eax        RAX       %rax&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;EBX       %ebx        RBX       %rbx&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;ECX       %ecx        RCX       %rcx&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;EDX       %edx        RDX       %rdx&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;EDI       %edi        RDI       %rdi&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;ESI       %esi        RSI       %rsi&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;EBP       %ebp        RBP       %rbp&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;ESP       %esp        RSP       %rsp&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;R8D       %r8d        R8        %r8&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;R9D       %r9d        R9        %r9&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;R10D      %r10d       R10       %r10&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;R11D      %r11d       R11       %r11&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;R12D      %r12d       R12       %r12&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;R13D      %r13d       R13       %r13&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;R14D      %r14d       R14       %r14&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;R15D      %r15d       R15       %r15&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;FP / MMX              SSE / 3Dnow!&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;LETNi     AT+T        LETNi     AT+T&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;ST0       %st(0)      XMM0      %xmm0&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;ST1       %st(1)      XMM1      %xmm1&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;ST2       %st(2)      XMM2      %xmm2&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;ST3       %st(3)      XMM3      %xmm3&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;ST4       %st(4)      XMM4      %xmm4&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;ST5       %st(5)      XMM5      %xmm5&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;ST6       %st(6)      XMM6      %xmm6&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;ST7       %st(7)      XMM7      %xmm7&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;MM0       %mm0        XMM8      %xmm8&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;MM1       %mm1        XMM9      %xmm9&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;MM2       %mm2        XMM10     %xmm10&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;MM3       %mm3        XMM11     %xmm11&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;MM4       %mm4        XMM12     %xmm12&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;MM5       %mm5        XMM13     %xmm13&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;MM6       %mm6        XMM14     %xmm14&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;MM7       %mm7        XMM15     %xmm15&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt; Special              Debug&lt;br/&gt; LETNi    AT+T        LETNi     AT+T&lt;br/&gt;&lt;br/&gt; CS       %cs         DB0       %db0&lt;br/&gt; DS       %ds         DB1       %db1&lt;br/&gt; DS       %ds         DB2       %db2&lt;br/&gt; ES       %es         DB3       %db3&lt;br/&gt; FS       %fs         -          -&lt;br/&gt; GS       %gs         -          -&lt;br/&gt; SS       %ss         DB6       %db6&lt;br/&gt;                      DB7       %db7&lt;br/&gt;CR0       %cr0        DB8       %db8&lt;br/&gt;CR1       %cr1        DB9       %db9&lt;br/&gt;CR2       %cr2        DB10      %db10&lt;br/&gt;                      DB11      %db11&lt;br/&gt;TR6       %tr6        DB12      %db12&lt;br/&gt;TR7       %tr7        DB13      %db13&lt;br/&gt;                      DB14      %db14&lt;br/&gt;                      DB15      %db15&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Appendices&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Data sizes of instructions with operands are specified by "b" (byte), "w" (word), "d" (MMX or XMM for doubleword) or "l" (integer for doub&lt;i&gt;&lt;b&gt;l&lt;/b&gt;&lt;/i&gt;eword) and 'q' (quadword). They replace the &lt;i&gt;hints&lt;/i&gt; "byte ptr", "word ptr", "dword ptr" and "qword ptr" used in LETNi syntax:&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='monospace'&gt;movb $0x01,%al          # load byte        01 into AL&lt;br/&gt;movw $0x01,%ax          # load word      0001 into AX&lt;br/&gt;movl $0x01,%eax         # load dword 00000001 into EAX&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;...but:&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;movsbl $0x81,%eax&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;(load sign extended byte 81 into EAX, so &lt;/font&gt;&lt;font face='arial'&gt;EAX holds FFFFFF81&lt;/font&gt;, now)&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;&lt;br/&gt;movzb $0x81,%eax&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;(load zero extended byte 81 into EAX&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;, so EAX holds 00000081, now)&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Numbers And Addresses&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;Numbers are preceeded by a Dolar sign '$', addresses are written as plain numbers:&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;movl $0x01,%eax&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;(copy 00000001 to EAX)&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;movl 0x01,%eax&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;(copy the doubleword found at address 00000001 to EAX; this causes some penalty cycles for accessing an address not divisible by four, then crashes because we try to access protected memory)&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Indirect Addressing&lt;br/&gt;&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;The register holding the address is put into round brackets. The offset, in LETNi vocabulary it is called "displacement", is written in front of the leading bracket:&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='monospace'&gt;movw 0x04(%esi),%ax&lt;br/&gt;&lt;/font&gt;(copy the word found at address [ESI + 0x04] to AX)&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Indexed Adressing&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;The index register follows the register holding the address. The multiplicator, LETNi vocabulary uses the term "scale factor", follows the index register. All three are separated by commata:&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='monospace'&gt;movb 0x00(%esi, %edx, 1),%al&lt;/font&gt;&lt;br/&gt;(copy the byte at memory location [ESI + 0x04 + (EDX * 1)] to AL)&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;movl 0x00(, %edx, 4),%eax&lt;/font&gt;&lt;br/&gt;(copy the doubleword at memory location [0x00 + (EDX * 4)] to EAX)&lt;br/&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Global Variables And Functions&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;To make a function globally visible, we have to add a &lt;i&gt;.globl&lt;/i&gt; declaration in front of the function declaration. To make variables globally visible, we add a &lt;i&gt;.comm&lt;/i&gt; in each source file where this variable is required. All global functions and variables must be preceeded by an underscore "_".&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;To access adresses of functions or variables, their name must be preceeded by a Dollar sign "$". To access the content of a variable (read, write, increment, decrement, compare against, etc.), we write their name "as is":&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='monospace'&gt;.align 2,0x90&lt;br/&gt;&lt;/font&gt;(only in front of your functions!)&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;.globl _MyFunction&lt;br/&gt;&lt;/font&gt;(make MyFunction globally visible)&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;_MyFunction:   # declaration&lt;br/&gt;     ...       # function body&lt;br/&gt;     ret       # finished, return to caller&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;.comm _BNR,4&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;(reserves 4 byte in the data segment for the global variable _BNR)&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='monospace'&gt;movl _BNR,%eax&lt;br/&gt;&lt;/font&gt;(copy the content of _BNR to EAX)&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;movl $_BNR,%eax&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;(copy the address where _BNR is stored to EAX)&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='monospace'&gt;movl $_AllMine,%eax&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;(copy the address where function _AllMine starts to EAX)&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='monospace'&gt;call *%eax&lt;br/&gt;&lt;/font&gt;(execute _AllMine)&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;The instruction &lt;b&gt;call *%eax&lt;/b&gt; is equivalent with &lt;b&gt;call _AllMine&lt;/b&gt;. However, it wastes one clock cycle with loading the address of _AllMine into EAX. On the other hand, loading a return address into a register can save six clock cycles if we use simple JMP instructions instead of the CALL/RET mechanism. This, of course, is limited to a few local helper functions - the usual CALL/RET is more flexible, because we don't need to know where the called function is stored.&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Calls And Jumps&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;Calls and jumps either can use (global) labels or registers as operands. If a register is used, its name must be preceeded by an asterisk &lt;b&gt;*&lt;/b&gt;. While the previous example showed us how to use a register together wit a CALL instruction, the following example shows us how to create a jump table.&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;    .data&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;    .align 2,0x00&lt;br/&gt;L99:.long L00            # jump table&lt;br/&gt;    .long L01&lt;br/&gt;    .long L02&lt;br/&gt;    .long L03&lt;br/&gt;    .long L04&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;&lt;br/&gt;    .text&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;    ...                  # prologue&lt;br/&gt;    movl $0x04,%ebx&lt;br/&gt;    cmpl $0x04,%eax&lt;br/&gt;    cmova %ebx,%eax      # keep valid&lt;br/&gt;    jmp *L99(, %eax, 4)  # indexed jump&lt;br/&gt;L00:nop                  # target&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt; proc&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;    jmp L05&lt;br/&gt;L01:nop&lt;br/&gt;    jmp L05&lt;br/&gt;L02:nop&lt;br/&gt;    jmp L05&lt;br/&gt;L03:nop&lt;br/&gt;    jmp L05&lt;br/&gt;L04:nop&lt;br/&gt;    jmp L05&lt;br/&gt;L05:nop                  # epilogue&lt;br/&gt;    ...&lt;br/&gt;    ret&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;This is a C switch{} statement coded in assembler. Using &lt;i&gt;cmova&lt;/i&gt;, we save one conditional jump and avoid the ten penalty cycles for a false "guess" of the branch prediction logic.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;Please notice, that I put the jump table into the &lt;i&gt;&lt;b&gt;.data&lt;/b&gt;&lt;/i&gt;, not the &lt;i&gt;&lt;b&gt;.text&lt;/b&gt;&lt;/i&gt; segment. As LETNi and AMD clearly state - this is the place data belongs to. Unfortunately, GCC creates all jump tables in the &lt;i&gt;&lt;b&gt;.text&lt;/b&gt;&lt;/i&gt; segment. To optimise your code, you should move them to the top of the file and put them into the &lt;i&gt;&lt;b&gt;.data&lt;/b&gt;&lt;/i&gt; segment as shown above.&lt;br/&gt;&lt;br/&gt;Keep in mind that 32 bit jump tables must be aligned to a multiple of 4, while 64 bit jump tables must be aligned to a multiple of 8. This is done by putting an appropriate &lt;i&gt;&lt;b&gt;.align&lt;/b&gt;&lt;/i&gt; statement in front of the (first) jump table.&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;.align&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;GCC spices source files with tons of &lt;i&gt;&lt;b&gt;.align&lt;/b&gt;&lt;/i&gt; statements spread all over the text segment. &lt;/font&gt;If an &lt;i&gt;&lt;b&gt;.align&lt;/b&gt;&lt;/i&gt; preceeds a function, you should leave it alone - do not remove it! Because modern processors work with quite small instruction caches (32 byte on Athlon64 machines), it might be necessary to insert an&lt;i&gt;&lt;b&gt; .align&lt;/b&gt;&lt;/i&gt; &lt;b&gt;&lt;i&gt;4,,15&lt;/i&gt;&lt;/b&gt; in front of a branch target to support the processor's prefetch mechanisms. However, you should avoid to insert &lt;i&gt;&lt;b&gt;.align &lt;/b&gt;&lt;/i&gt;statements at places where they might be executed. Each &lt;i&gt;&lt;b&gt;.align&lt;/b&gt;&lt;/i&gt; statement inserts an appropriate number of &lt;i&gt;&lt;b&gt;nop&lt;/b&gt;s &lt;/i&gt;to move the&lt;font face='arial'&gt; instruction pointer to the next multiple of the cacheline's size, so the next instruction "sits" at the beginning of a new cache line. This is important, if the next instruction is the target of a branch. Because the processor speculatively prefetches  the code of branch targets, execution continues at the beginning of a cache line if the branch was taken. Execution is sped up if the processor doesn't have to load the instructions of the branch target before it can continue to execute them.&lt;br/&gt;&lt;i&gt;&lt;br/&gt;&lt;br/&gt;&lt;/i&gt;&lt;font face='arial'&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;nop&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;The &lt;b&gt;&lt;i&gt;nop&lt;/i&gt;&lt;/b&gt; instruction puts the next free execution pipeline into idle mode for one clock cycle. If you insert it at the proper places, it &lt;i&gt;can&lt;/i&gt; improve performance and speed up execution. However, the benefits only can be determined experimentally. You have to test the runtime of&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt; several variants of your code with exceptional care. The &lt;b&gt;&lt;i&gt;rdtsc&lt;/i&gt;&lt;/b&gt; instruction is a good tool to measure the runtime of test functions with acceptable accuracy. If you write the output to a file, the gathered data might be sufficient to find out which variant is the fastest.&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;div class='zemanta-pixie'&gt;&lt;img src='http://img.zemanta.com/pixy.gif?x-id=d1cecae2-372a-8943-b735-5a6c50779e8c' alt='' class='zemanta-pixie-img'/&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4412535206273251260-1793967656339837794?l=st-intelligentdesign.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://st-intelligentdesign.blogspot.com/feeds/1793967656339837794/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/14-at-syntax.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/1793967656339837794'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/1793967656339837794'/><link rel='alternate' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/14-at-syntax.html' title='14 - AT&amp;amp;T Syntax'/><author><name>Bernhard Schornak</name><uri>http://www.blogger.com/profile/07864510983569379361</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_Z2WbH3F-E_Q/S4KZO1ynfyI/AAAAAAAAABQ/jgXB2qAV2Xc/S220/BS.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4412535206273251260.post-926888530363036844</id><published>2010-04-13T23:49:00.001+02:00</published><updated>2010-04-13T23:56:24.444+02:00</updated><title type='text'>13 - Appendix 2</title><content type='html'>&lt;div xmlns='http://www.w3.org/1999/xhtml'&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;The second appendix provides a collection of recommendations, how existing code can be optimised. This collection will be extended whenever I stumble upon another piece of badly designed code...&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Optimisation 01&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;To access compound data stored in doublewords, GCC uses one of these constructs by default:&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;b&gt;Lower Word&lt;/b&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;Version 1:&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;...&lt;br/&gt;movl %edx,%eax     # DP 1&lt;br/&gt;andl $65535,%eax   # DP 1 + wait for EAX&lt;br/&gt;cmpl $4623,%eax    # DP 1 + wait for EAX&lt;br/&gt;...&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;Version 2:&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;...&lt;br/&gt;movl %edx,%eax     # DP 1&lt;br/&gt;cmpw $4623,%ax     # DP 1 + wait for EAX&lt;br/&gt;...                #      + wait for AX&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;The real task is the comparison of the lower word in a compound datatype - in our case a message parameter - against a given number. The first version copies the message parameter stored in EDX to EAX, clears the upper word via AND and finally compares the result against the ID. It takes three clock cycles to execute this construct, because the processor has to wait for the updated content of EAX after each of both instructions. The second version copies the content of EDX to EAX, then compares AX against the ID. That's worse than the first version. Loading the 32 bit register EAX and accessing the lower portion AX in the next instruction is rewarded with some extra clock cycles, because some internal processing known as &lt;i&gt;register merging&lt;/i&gt; is required to make AX accessible for the comparison.&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;Actually, EDX already contains the entire message parameter, so there is no reason to copy it to anywhere else. We also can assume, the message parameter was loaded into EDX several clock cycles before, so no &lt;i&gt;register merging&lt;/i&gt; is required. Therefore, the straight way&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='monospace'&gt;...&lt;br/&gt;cmpw $0x120F,%dx   # DP 1&lt;br/&gt;...&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;is the only proper solution, executed in one clock cycle, saving 9 (2) byte.&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;b&gt;Upper Word&lt;br/&gt;&lt;/b&gt;&lt;/big&gt;&lt;br/&gt;&lt;font face='monospace'&gt;...&lt;br/&gt;movl %esi,%eax     # DP 1&lt;br/&gt;shrl $16,%eax      # DP 1 + auf EAX warten&lt;br/&gt;andl $65535,%eax   # DP 1 + auf EAX warten&lt;br/&gt;cmpl $523,%eax     # DP 1 + auf EAX warten&lt;br/&gt;...&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;The message parameter stored in ESI is copied to EAX, then EAX is shifted right 16 bits. I do not know the reason why the upper word is cleared a second time, but I do know this: It is a superfluous operation. Finally, the separated upper word is compared against an immediate value, in our case a notification ID sent by one of the controls.&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;Extracting the upper word from a doubleword is more complex, because there are no instructions to access it directly. For the following examples, it is assumed the message parameter is stored at location 0x8C[ESP]. The first possible solution is to preload the upper word like this:&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='monospace'&gt;...&lt;br/&gt;movzwl 0x8E(%esp),%edx  # DP 4&lt;br/&gt;...&lt;br/&gt;...                     # 4 clocks distance&lt;br/&gt;...&lt;br/&gt;...&lt;br/&gt;cmpl $0x010B,%edx       # DP 1&lt;br/&gt;...&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;It is recommended to place the MOVZWL at top of your function, after all registers are saved. If you cannot keep the minimum distance of four clock cycles, a better solution might be&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='monospace'&gt;...&lt;br/&gt;cmpw $0x010B,0x8E(%esp)     # DP 4&lt;br/&gt;...&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;This one costs four clock cycles and should not be used if a lot of notifications are evaluated. If more than two comparisons are required, preloading saves a lot of clock cycles, even if it takes four clock cycles to preload that register and access its content with the next instruction. All following comparisons are done in one clock cycle, so we save time rather than to spend it.&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Optimisation 02&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;GCC has some bad manners, like the one shown in &lt;i&gt;Conventions&lt;/i&gt;:&lt;br/&gt;&lt;br/&gt;Wrong:&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;       movl %eax,%esi   # DP 1 &lt;br/&gt;       testl %esi,%esi  # DP 1&lt;br/&gt;       je L8            # DP 1&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;Okay:&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;       movl %eax,%esi   # DP 1&lt;br/&gt;       testl %eax,%eax  # DP 1&lt;br/&gt;       je L8            # DP 1&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;Even if both snippets look like monozygotic twins, the first one is one clock cycle slower. Both versions copy the content of EAX to ESI, because we need it after calling some other functions. The decisive difference between both versions is the register we use to test if its content is equal to zero. While the first version uses ESI and thus has to wait, until EAX was copied to ESI, the second version uses EAX for testing. Without dependencies, the first two instructions of the second version are executed simultaneously, saving one clock cycle. If this code snippet sits within a loop running 32,768 times, we save 32,768 clock cycles with a simple optimisation.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Optimisation 03&lt;br/&gt;&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;In most cases, the &lt;i&gt;direct path&lt;/i&gt; is the shortest way to move from A to B. Here is a snippet taken from the message procedure of ST-Open's&lt;/font&gt;&lt;font face='arial'&gt;&lt;i&gt; V700 skeleton&lt;/i&gt;, translated into assembly language by &lt;i&gt;GCC 3.3.5.&lt;/i&gt;:&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='monospace'&gt;    ...&lt;br/&gt;    movl 12(%ebp),%eax   # DP 3&lt;br/&gt;    movl %eax,-52(%ebp)  # DP 3 wait&lt;br/&gt;    cmpl $35,-52(%ebp)   # DP 4 wait&lt;br/&gt;    je L25               # DP 1&lt;br/&gt;    cmpl $35,-52(%ebp)   # DP 4 !&lt;br/&gt;    ja L56               # DP 1&lt;br/&gt;    cmpl $7,-52(%ebp)    # DP 4 !&lt;br/&gt;    je L14               # DP 1&lt;br/&gt;    cmpl $7,-52(%ebp)    # DP 4 !&lt;br/&gt;    ja L57               # DP 1&lt;br/&gt;    cmpl $1,-52(%ebp)    # DP 4 !&lt;br/&gt;    je L15               # DP 1&lt;br/&gt;    jmp L54              # DP 1&lt;br/&gt;L57:cmpl $32,-52(%ebp)   # DP 4 !&lt;br/&gt;    je L27               # DP 1&lt;br/&gt;    jmp L54              # DP 1&lt;br/&gt;L56:cmpl $41,-52(%ebp)   # DP 4 !&lt;br/&gt;    je L17               # DP 1&lt;br/&gt;    cmpl $41,-52(%ebp)   # DP 4 !&lt;br/&gt;    ja L58               # DP 1&lt;br/&gt;    cmpl $36,-52(%ebp)   # DP 4 !&lt;br/&gt;    je L51               # DP 1&lt;br/&gt;    jmp L54              # DP 1&lt;br/&gt;L58:cmpl $79,-52(%ebp)   # DP 4 !&lt;br/&gt;    je L24               # DP 1&lt;br/&gt;    jmp L54              # DP 1&lt;br/&gt;    ...&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;The message is copied from 12[EBP] to EAX, then EAX is copied to -52[EBP] to perform 10 comparisons of -52[EBP] against several message numbers. It is quite stupid to compare message numbers against a memory location, if the message already was loaded into a register, but copying the content of one stack element to another to compare that copy against message numbers is the unbeaten top of stupidity. It seems, these performance brakes were not wasting enough superfluous clock cycles, so GCC 3.3.5. decided to spice the distributor with pointless conditional jumps to force a lot of 'misses' of the branch prediction logic. Each 'miss' is rewarded with at least ten penalty cycles...&lt;br/&gt;&lt;br/&gt;Applying some logic, the message distributor could look like this:&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;    ...&lt;br/&gt;    movl 0x0C(%ebp),%eax     # DP 3&lt;br/&gt;    movl 0x08(%ebp),%edi     # DP 3&lt;br/&gt;    movzwl 0x10(%ebp),%ecx   # DP 4&lt;br/&gt;    cmpl $0x4F,%eax          # DP 1&lt;br/&gt;    je L24                   # DP 1&lt;br/&gt;    cmpl $0x23,%eax          # DP 1&lt;br/&gt;    je L25                   # DP 1&lt;br/&gt;    cmpl $0x07,%eax          # DP 1&lt;br/&gt;    je L14                   # DP 1&lt;br/&gt;    cmpl $0x24,%eax          # DP 1&lt;br/&gt;    je L51                   # DP 1&lt;br/&gt;    cmpl $0x20,%eax          # DP 1&lt;br/&gt;    je L27                   # DP 1&lt;br/&gt;    cmpl $0x29,%eax          # DP 1&lt;br/&gt;    je L17                   # DP 1&lt;br/&gt;    cmpl $0x01,%eax          # DP 1&lt;br/&gt;    jne L54                  # DP 1&lt;br/&gt;    ...                      # WM_CREATE&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;Code size is reduced to about one third and it takes less than a fourth of the time to execute it. Please notice, that preloading three registers in the second snippet is done in the same time it takes to execute the first line in the first snippet.&lt;br/&gt;&lt;br/&gt;Because some hundred message numbers are defined, a jump table did occupy some KB, so we cannot avoid a distributor with conditional jumps. A disadvantage of conditional jumps are possible 'misses' of the branch prediction logic. As mentioned, a 'miss' causes at least ten penalty cycles (the processor has to flush the instruction cache and reload the proper instructions). We cannot avoid the one or other 'miss', so you should place the most often processed messages on top of the distributor. It doesn't matter if a WM_CREATE or WM_CLOSE procedure needs 20 clock cycles more or less, but the user recognises for sure if required updates of the main window (repainting) are delayed.&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Optimisation 04&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;Another snippet taken from ST-Open's &lt;i&gt;V700 skeleton&lt;/i&gt;, translated into assembly language by GCC 3.3.5.. Here, some control variables for the Loader are set to initial values before LDinit() is called. If the initialisation of the Loader fails, a message box with the error code is displayed.  ST-Open's entire programming system depends on the loader. It manages memory allocation, loading of fields for the database engine and other files. Without running Loader, it is impossible to create a main window, because its position, size and control flags are stored in a file called &lt;i&gt;SystemNumerics&lt;/i&gt;. Next, ST-Open's multilingual menu and dialog texts are stored in fields managed by ST-Open's database engine. No fields are available if the Loader is missing. Therefore, if the Loader is not initialised, the program is terminated with the mentioned message box, because it cannot access its data.&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;   LC3:.ascii "RC LDinit()\0" # CODE segment!&lt;br/&gt;&lt;br/&gt;.globl _StInit&lt;br/&gt;_StInit:&lt;br/&gt;       pushl %ebp             # DP 3&lt;br/&gt;       movl %esp,%ebp         # DP 1&lt;br/&gt;       subl $8,%esp           # DP 1&lt;br/&gt;       movl $1,_DEBUG         # DP 3&lt;br/&gt;       movl $0,_DUMPLINE      # DP 3&lt;br/&gt;       movl $0,_USE_LDF       # DP 3&lt;br/&gt;       call _LDinit&lt;br/&gt;       movl %eax,-4(%ebp)     # DP 3   why?&lt;br/&gt;       cmpl $0,-4(%ebp)       # DP 4   why?&lt;br/&gt;       je L56                 # DP 1&lt;br/&gt;       subl $8,%esp           # DP 1   why?&lt;br/&gt;       pushl $LC3             # DP 3&lt;br/&gt;       pushl -4(%ebp)         # DP 3   why?&lt;br/&gt;       call _debug&lt;br/&gt;       addl $16,%esp          # DP 1   great!&lt;br/&gt;       movl -4(%ebp),%eax     # DP 3   why?&lt;br/&gt;       movl %eax,-8(%ebp)     # DP 3   why?&lt;br/&gt;       jmp L55                # DP 1&lt;br/&gt;   L56:movl $1,_OLH_MODE      # DP 3&lt;br/&gt;       movl $0,-8(%ebp)       # DP 3   why?&lt;br/&gt;   L55:movl -8(%ebp),%eax     # DP 3   why?&lt;br/&gt;       leave                  # VP 3&lt;br/&gt;       ret&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt; &lt;font face='arial'&gt;Without error 24, else 35 clock cycles&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;.&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;This version of GCC has a pathological tendency to waste as much time as possible. It might be a good therapy for stressed managers to relax, have a drink and smoke a box of cigarettes or two while the machine is counting from one to ten...&lt;br/&gt;&lt;br/&gt;The smart version:&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;       .data                  # DATA segment!&lt;br/&gt;   LC3:.ascii "RC LDinit()\0"&lt;br/&gt;       .text                  # CODE segment!&lt;br/&gt;.globl _StInit&lt;br/&gt;_StInit:&lt;br/&gt;       movl $0x00,_DUMPLINE   # DP 3&lt;br/&gt;       movl $0x00,_USE_LDF    # DP 3&lt;br/&gt;       movl $0x01,_DEBUG      # DP 3&lt;br/&gt;       call _LDinit&lt;br/&gt;       testl %eax,%eax        # DP 1&lt;br/&gt;       je 0f                  # DP 1&lt;br/&gt;       subl $0x08,%esp        # DP 1&lt;br/&gt;       movl %eax,0x00(%esp)   # DP 3&lt;br/&gt;       movl $LC3,0x04(%esp)   # DP 3&lt;br/&gt;       call _debug&lt;br/&gt;       addl $0x08,%esp        # DP 1&lt;br/&gt;     0:movl $0x01,_OLH_MODE   # DP 3&lt;br/&gt;       ret&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;&lt;font face='arial'&gt;Without error 8, else 12 clock cycles.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;Only one third of the time is required to execute the smart version. Writing code like this reduces code size and executes much faster. Faster programs make users happy. Happy people are much better than stressed and angry people. Hence, &lt;i&gt;Intelligent Design&lt;/i&gt; makes all&lt;br/&gt; people very happy, because it is the fastest software design they ever have seen... ;)&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Optimisation 05&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;This is the &lt;b&gt;About...&lt;/b&gt; box of many recent ST-Open programs:&lt;br/&gt;&lt;font face='monospace'&gt;&lt;br/&gt;.globl DlgAbout&lt;br/&gt;DlgAbout:&lt;br/&gt;       pushl      %ebp              # DP 3&lt;br/&gt;       movl       %esp, %ebp        # DP 1&lt;br/&gt;       pushl      %ebx              # DP 3&lt;br/&gt;       pushl      %eax              # DP 3 ?&lt;br/&gt;       movl       12(%ebp), %eax    # DP 3&lt;br/&gt;       cmpl       $32, %eax         # DP 1&lt;br/&gt;       movl       8(%ebp), %ebx     # DP 3&lt;br/&gt;       movl       16(%ebp), %edx    # DP 3&lt;br/&gt;       movl       20(%ebp), %ecx    # DP 3&lt;br/&gt;       je         L4&lt;br/&gt;       cmpl       $59, %eax         # DP 1&lt;br/&gt;       je         L13&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;These four instructions are redundant &lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;&lt;font face='arial'&gt;and violate the rule "Never write to stack elements above EBP!":&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;&lt;br/&gt;       movl       %ecx, 20(%ebp)    # DP 3&lt;br/&gt;       movl       %edx, 16(%ebp)    # DP 3&lt;br/&gt;       movl       %eax, 12(%ebp)    # DP 3&lt;br/&gt;   L12:movl       %ebx, 8(%ebp)     # DP 3 &lt;br/&gt;       movl       -4(%ebp), %ebx    # DP 3&lt;br/&gt;       leave                        # VP 3&lt;br/&gt;       jmp        DefDP&lt;br/&gt;&lt;br/&gt;       .p2align 2,,3&lt;br/&gt;   L13:pushl      $155              # DP 3&lt;br/&gt;       pushl      $151              # DP 3&lt;br/&gt;       pushl      $150              # DP 3&lt;br/&gt;       pushl      %ebx              # DP 3&lt;br/&gt;       call       DLGtxt&lt;br/&gt;       movl       %ebx, (%esp)      # DP 3&lt;br/&gt;       call       CtrWn&lt;br/&gt;   L11:addl       $16, %esp         # DP 1&lt;br/&gt;       xorl       %eax, %eax        # DP 1&lt;br/&gt;       movl       -4(%ebp), %ebx    # DP 3&lt;br/&gt;       leave                        # VP 3&lt;br/&gt;       ret&lt;br/&gt;&lt;br/&gt;       .p2align 2,,3&lt;br/&gt;    L4:cmpw       $4623, %dx        # DP 1&lt;br/&gt;       je         L14&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;Here, GCC 3.3.5. does it again:&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;       movl       %ecx, 20(%ebp)    # DP 3&lt;br/&gt;       movl       %edx, 16(%ebp)    # DP 3&lt;br/&gt;       movl       $32, 12(%ebp)     # DP 3&lt;br/&gt;       jmp        L12&lt;br/&gt;   L14:subl       $12, %esp         # DP 1&lt;br/&gt;       pushl      %ebx              # DP 3&lt;br/&gt;       call       WinDD&lt;br/&gt;       jmp        L11&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;This isn't only the worst code a compiler can generate. It also does very mean and dangerous things like overwriting passed parameters. Code like this might be usual in the evil world of virus or malware programmers, but definitely does not belong to the world of proper and sane software.&lt;br/&gt;&lt;br/&gt;The smart version:&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;.globl _DlgAbout&lt;br/&gt;_DlgAbout:&lt;br/&gt;       movl 0x08(%esp),%eax         # DP 3&lt;br/&gt;       cmpl $0x3B,%eax              # DP 1&lt;br/&gt;       je 0f&lt;br/&gt;       cmpl $0x32,%eax              # DP 1&lt;br/&gt;       jne _DefDP&lt;br/&gt;       cmpw $0x1218,0x0C(%esp)      # DP 4&lt;br/&gt;       jne _DefDP&lt;br/&gt;       pushl 0x04(%esp)             # DP 4&lt;br/&gt;       call _WinDD&lt;br/&gt;       addl $0x04,%esp              # DP 1&lt;br/&gt;       ret&lt;br/&gt;     0:pushl $0x9B                  # DP 3&lt;br/&gt;       pushl $0x97                  # DP 3&lt;br/&gt;       pushl $0x96                  # DP 3&lt;br/&gt;       pushl 0x10(%esp)             # DP 4&lt;br/&gt;       call _DLGtxt&lt;br/&gt;       call _CtrWn&lt;br/&gt;       addl $0x10,%esp              # DP 1&lt;br/&gt;       ret&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;This is the smart version of &lt;b&gt;DlgAbout()&lt;/b&gt;. It's not the fastest, but definitely the most compact form. The fall back to five PUSH instructions is one of those few exceptions, where PUSH is superior over MOV, because we can PUSH, but not MOV the content of a memory location to another memory location with a single instruction. If we want to do it with MOV, we have to use a register where the data is stored temporarily. We execute WM_INITDLG only once, and six clock cycles (3 nanoseconds) less or more cannot be recognised by the user.&lt;br/&gt;&lt;br/&gt;Exceptions, of course, should not become the default behaviour. Most functions in a program are more complex than popping up a simple dialog with only one task: "Destroy yourself after the user pushed that OK button."&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Optimisation 06&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;Finally, the optimised version of the function we discussed while talking about some &lt;i&gt;Caveats&lt;/i&gt; of conventional programming techniques and possible&lt;i&gt; Improvements&lt;/i&gt;. You have seen a small snippet of the optimised version while I introduced the &lt;i&gt;Intelligent Design&lt;/i&gt; &lt;i&gt;Rules&lt;/i&gt; to you. Now, here is the entire code, spiced with some comments why what has to be done exactly at this point and nowhere else.&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;        .data&lt;br/&gt;&lt;br/&gt;        .p2align 4,0x00&lt;br/&gt;    jt0:.long  L02&lt;br/&gt;        .long  L03&lt;br/&gt;        .long  L04&lt;br/&gt;        .long  L05&lt;br/&gt;        .long  L06&lt;br/&gt;        .long  L07&lt;br/&gt;        .long  L08&lt;br/&gt;        .long  L09&lt;br/&gt;        .long  L10&lt;br/&gt;        .long  L11&lt;br/&gt;        .long  L12&lt;br/&gt;        .long  L13&lt;br/&gt;        .long  L14&lt;br/&gt;        .long  L15&lt;br/&gt;        .long  L08&lt;br/&gt;        .long  L09&lt;br/&gt;        .long  L10&lt;br/&gt;        .long  L11&lt;br/&gt;        .long  L12&lt;br/&gt;        .long  L13&lt;br/&gt;        .long  L14&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;Jump tables never should be placed anywhere in the code segment - they belong to the data segment! All 'dead' jump targets were removed after updating the resource IDs in the resource definition file, reducing the table size by 40 byte (about thirty percent).&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;        .text&lt;br/&gt;&lt;br/&gt;        .align 2,0x90&lt;br/&gt;.globl MoveDlg&lt;br/&gt;MoveDlg:subl   $0x3C,%ebp&lt;br/&gt;        nop&lt;br/&gt;        nop&lt;br/&gt;        movl   %ecx,0x30(%esp)&lt;br/&gt;        movl   %edi,0x34(%esp)&lt;br/&gt;        movl   %esi,0x38(%esp)&lt;br/&gt;        movl   0x40(%ebp),%edi&lt;br/&gt;        movl   0x44(%ebp),%eax&lt;br/&gt;        movzwl 0x48(%ebp),%ecx&lt;br/&gt;        movl   _BNR,%esi&lt;br/&gt;        cmpl   $0x30,%eax&lt;br/&gt;        je     L01&lt;br/&gt;        cmpl   $0x20,%eax&lt;br/&gt;        je     L00&lt;br/&gt;        cmpl   $0x3B,%eax&lt;br/&gt;        jne    L15&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;Our distributor was taken from the improved C version, the frequently required base address of the global variables (SystemNumerics) is preloaded into ESI. Preloading parameters speeds up execution of subfunctions, because we can access the contents of the corresponding registers without delay whenever they are needed. Placing our loads like listed above keeps all pipes busy. While the first three instructions read data from L1 cache, it is not very likely that the address of _BNR is loaded in L1 cache, yet. We can assume that the first three loads have a latency of three clock cycles, only the last load has to access main memory. Therefore, the message number surely is present in EAX before the first comparison is executed.&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;        movl   0x1C54(%esi),%eax&lt;br/&gt;        movl   0x01D8(%esi),%ecx&lt;br/&gt;        movl   $0x00,0x28E0(%esi)&lt;br/&gt;        movl   %eax,0x00(%esp)&lt;br/&gt;        movl   $0x0B,0x04(%esp)&lt;br/&gt;        movl   $0x00,0x08(%esp)&lt;br/&gt;        movl   $0x02,0x0C(%esp)&lt;br/&gt;        movl   $0x00,0x10(%esp)&lt;br/&gt;        movl   %edi,0x14(%esp)&lt;br/&gt;        call   _FDacc&lt;br/&gt;        movl   $0x04,0x08(%esp)&lt;br/&gt;        movl   %ecx,0x14(%esp)&lt;br/&gt;        call   _FDacc&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;Here, the order of instructions was rearranged to keep all execution pipes busy. In our case, there is only one critical instruction. The remaining Instructions don't cause dependencies, so we can place them anywhere in our preload sequence. Placing the line &lt;i&gt;movl 0x1C54(%esi),%eax&lt;/i&gt; on top, there still are two other preloads between loading and using EAX. All &lt;i&gt;mov&lt;/i&gt; instructions have a latency of three clock cycles, keeping a sufficient distance between load and store. The worst case scenario, 0x01D8[BNR] and 0x28E0[ESI] are present in L1 cache while 0x1C54[ESI] is not, delayed execution for 15 clock cycles, because the 4th instruction blocks one pipe for 24 (27 for line one - 3 for lines two and three) clock cycles, while the following instructions are executed in the two remaining pipes in nine clocks (our stack definitely is present in L1 cache whenever MoveDlg() is called). Before we call FDacc() the second time, only the two changing parameters are updated, saving four superfluous write operations.&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;        movl   %edi,0x00(%esp)&lt;br/&gt;        movl   $0x1240,0x04(%esp)&lt;br/&gt;        movl   $0x0120,0x08(%esp)&lt;br/&gt;        movl   $0xFFFFFFFF,0x0C(%esp)&lt;br/&gt;        call   _SnDIM&lt;br/&gt;        movl   $0x1244,0x04(%esp)&lt;br/&gt;        call   _SnDIM&lt;br/&gt;        movl   $0x1246,0x04(%esp)&lt;br/&gt;        call   _SnDIM&lt;br/&gt;        movl   $0x124E,0x04(%esp)&lt;br/&gt;        call   _SnDIM&lt;br/&gt;        movl   $0xD2,0x04(%esp)&lt;br/&gt;        movl   $0xD3,0x08(%esp)&lt;br/&gt;        movl   $0xE9,0x0C(%esp)&lt;br/&gt;        call   _DLGtxt&lt;br/&gt;        call   _CtrWn&lt;br/&gt;        call   _DlgShow&lt;br/&gt;        jmp    L16&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;Performing only really required write operations, the improved version listed above works with just ten instructions - less than a third of GCC's draft with its 26&lt;i&gt; push&lt;/i&gt; instructions and six obligatory corrections of the stack pointer.&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;    L00:subl   $0x1231,%ecx&lt;br/&gt;        je     0f&lt;br/&gt;        decl   %ecx&lt;br/&gt;        je     1f&lt;br/&gt;        decl   %ecx&lt;br/&gt;        jne    L15&lt;br/&gt;        movl   $0x11,0x00(%esp)&lt;br/&gt;        call   _Help&lt;br/&gt;        jmp    L16&lt;br/&gt;      0:movl   $0x00,0x28E0(%esi)&lt;br/&gt;        jmp    2f&lt;br/&gt;      1:orl    $0x00040000,0x28E0(%esi)&lt;br/&gt;      2:movl   %edi,0x00(%esp)&lt;br/&gt;        call   _WinDD&lt;br/&gt;        jmp    L16&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;All compares with immediate values were replaced by a&lt;i&gt; sub&lt;/i&gt; and two &lt;i&gt;dec&lt;/i&gt; instructions. The one byte instruction &lt;i&gt;dec&lt;/i&gt; reduces code size and relieves the processor's prefetch mechanism.&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;    L01:subl   $0x1240,%ecx&lt;br/&gt;        js     L15&lt;br/&gt;        cmpl   $0x14,%ecx&lt;br/&gt;        ja     L15&lt;br/&gt;        movl   0x28E0(%esi),%eax&lt;br/&gt;        jmp    *jt0(, %ecx, 4)&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;Moving jump tables to the data segment is an obligatory exercise for all functions following &lt;i&gt;Intelligent Design&lt;/i&gt; standards. Separating code and data avoids a lot of caveats coming along with mixing them together in the code segment. There's surely a reason why AMD and LETNi advise programmers against storing data of any kind in the code segment. By the way: The GNU assembler AS supports jump tables in the data segment. There's no reason to store them in the code segment as practised by GCC!&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;    L02:andl   $0xFFFFE1FF,%eax&lt;br/&gt;        orl    $0x1000,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L03:andl   $0xFFFFE1FF,%eax&lt;br/&gt;        orl    $0x0800,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L04:andl   $0xFFFFE1FF,%eax&lt;br/&gt;        orl    $0x0400,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L05:andl   $0xFFFFE1FF,%eax&lt;br/&gt;        orl    $0x0200,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L06:andl   $0xFFFFFE7F,%eax&lt;br/&gt;        orl    $0x0100,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L07:andl   $0xFFFFFE7F,%eax&lt;br/&gt;        orl    $0x80,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L08:andl   $0xFFFFFF80,%eax&lt;br/&gt;        orl    $0x40,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L09:andl   $0xFFFFFE80,%eax&lt;br/&gt;        orl    $0x20,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L10:andl   $0xFFFFFE80,%eax&lt;br/&gt;        orl    $0x10,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L11:andl   $0xFFFFFE80,%eax&lt;br/&gt;        orl    $0x08,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L12:andl   $0xFFFFFE80,%eax&lt;br/&gt;        orl    $0x04,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L13:andl   $0xFFFFFE80,%eax&lt;br/&gt;        orl    $0x02,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L14:andl   $0xFFFFFE80,%eax&lt;br/&gt;        orl    $0x01,%eax&lt;br/&gt;      0:movl   %eax,0x28E0(%esi)&lt;br/&gt;        jmp    L16&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;Sorry, but the only solution for this problem were to change the definition of all flags to support automated evaluation. Obviously, this is not the case and we are not allowed to replace existing definitions without asking, first...&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;    L15:movl   0x30(%esp),%ecx&lt;br/&gt;        movl   0x34(%esp),%edi&lt;br/&gt;        movl   0x38(%esp),%esi&lt;br/&gt;        addl   $0x3C,%esp&lt;br/&gt;        jmp    _DefDP&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;One version of this exit is enough. GCC's draft, once split up into two absolutely identical subfuntions, were reduced to really necessary code, keeping five (27.78 percent) of eighteen instructions. Even if we removed entire thirteen (redundant) instructions, this subfunction still works fine.&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;    L16:movl   0x30(%esp),%ecx&lt;br/&gt;        movl   0x34(%esp),%edi&lt;br/&gt;        movl   0x38(%esp),%esi&lt;br/&gt;        addl   $0x3C,%esp&lt;br/&gt;        xorl   %eax,%eax&lt;br/&gt;        ret&lt;br/&gt;&lt;br/&gt;.comm _GVAR,4&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;This is at least as fast as three &lt;i&gt;pop&lt;/i&gt; and one &lt;i&gt;leave&lt;/i&gt; instruction!&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/div&gt;&lt;/div&gt;&lt;blockquote&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/blockquote&gt;&lt;br/&gt;&lt;br/&gt;&lt;div class='zemanta-pixie'&gt;&lt;img src='http://img.zemanta.com/pixy.gif?x-id=1b8e8d3b-1fbe-88b4-bc3b-48d72f142e68' alt='' class='zemanta-pixie-img'/&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4412535206273251260-926888530363036844?l=st-intelligentdesign.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://st-intelligentdesign.blogspot.com/feeds/926888530363036844/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/13-appendix-2.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/926888530363036844'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/926888530363036844'/><link rel='alternate' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/13-appendix-2.html' title='13 - Appendix 2'/><author><name>Bernhard Schornak</name><uri>http://www.blogger.com/profile/07864510983569379361</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_Z2WbH3F-E_Q/S4KZO1ynfyI/AAAAAAAAABQ/jgXB2qAV2Xc/S220/BS.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4412535206273251260.post-6648030666781775243</id><published>2010-04-13T18:26:00.001+02:00</published><updated>2010-04-13T22:44:12.051+02:00</updated><title type='text'>12 - Appendix 1</title><content type='html'>&lt;div xmlns='http://www.w3.org/1999/xhtml'&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;The first appendix supplies you with a collection of examples, showing how to design new functions from scratch. Some of them might be used as templates - just C&amp;amp;P them to your source files. The remaining functions may be used as a hint how to solve some specific poblems. Practice is the best teacher you can get. Read the examples, then start coding your own stuff. While debugging it, you will learn much more than a teacher ever could show you. Without errors, you never get a deeper insight how something really works.&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;b&gt;Example 01&lt;/b&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;No registers, local variables or parameters are required. All three areas therefore have a size of zero and the stackpointer stays untouched:&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;.globl _Brd&lt;br/&gt;  _Brd:movl 0x04(%esp),%eax   # block address&lt;br/&gt;       addl 0x08(%esp),%eax   # + offset&lt;br/&gt;       movzb 0x00(%eax),%eax  # Byte[block+offset]&lt;br/&gt;       ret&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Brd() &lt;strike&gt;is&lt;/strike&gt; was one of the few remaining crutches for C programmers provided by ST-Open's libraries. Even if the function is a convincing example for really bad code, this construct still is ways faster than any equivalent generated by a C compiler.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;Assembly language programmers do not need external functions to access data from an offset to a base address, of course. Loading the block address at least three clock cycles before we start to access data relative to it is all we have to do:&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;       ...&lt;br/&gt;       movl 0x04(%esp),%ecx    # block address&lt;br/&gt;       ...&lt;br/&gt;       ...                     # 3 clocks distance!&lt;br/&gt;       ...&lt;br/&gt;       movzb 0x1234(%ecx),%eax # DB @ 0x1234[block]&lt;br/&gt;       ...&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Code like this prevents the other two execution pipes from taking a nap while the first pipe is busy with copying the block address to ECX.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;b&gt;Side Note:&lt;/b&gt; 'Three clock cycles away' &lt;i&gt;does not&lt;/i&gt; mean 'three instructions away'! Not all instructions have a latency of three clock cycles, so it might be necessary to fill the dotted lines with much more than just two instructions. For example, direct manipulations of registers without memory operands - like &lt;i&gt;xorl %eax,%eax&lt;/i&gt; - generally are executed within one clock cycle. We had to insert six of them to feed the other two pipes for the three clock cycles the first pipe is busy with loading the block address into ECX. You have to calculate the latencies of all used instructions and keep them in the proper order to prevent interruption of simultaneouos execution in all three pipes.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;b&gt;Example 02&lt;/b&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Here we store two registers and pass for parameters to the API. The size of our stack frame therefore is 8 + 16 = 24 byte:&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='monospace'&gt;.globl _WinPP&lt;br/&gt;_WinPP:subl $0x3C,%esp&lt;br/&gt;       nop&lt;br/&gt;       nop&lt;br/&gt;       movdqu 0x40(%esp),%xmm0&lt;br/&gt;       movl %edx,0x10(%esp)&lt;br/&gt;       movl %ecx,0x14(%esp)&lt;br/&gt;       movdqu %xmm0,0x00(%esp)&lt;br/&gt;       call _WinSetPresParam&lt;br/&gt;       movl 0x10(%esp),%edx&lt;br/&gt;       movl 0x14(%esp),%ecx&lt;br/&gt;       addl $0x3C,%esp&lt;br/&gt;       ret&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;WinnPP() is one of many 'sandboxes', only called to save ECX und EDX and restore them after the API destroyed them. To speed up execution and save six MOV instructions with memory references, two MOVDQU instructions are used. XMM registers can hold four doublewords, so all four parameters can be copied in one gulp.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;big&gt;&lt;b&gt;Example 03&lt;/b&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Finally, we store one register, use a 32 byte string and call two functions - the first has three, the second one parameter to pass. Hence, the size of our stack frame is 4 + 32 + 12 = 48 byte. We add a 16 byte safety gap and subtract 64 from ESP:&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;.globl _GetSz&lt;br/&gt;_GetSz:movl 0x04(%esp),%eax&lt;br/&gt;       subl $0x3C,%esp&lt;br/&gt;       nop&lt;br/&gt;       movl %ebx,0x38(%esp)&lt;br/&gt;       leal 0x0C(%esp),%ebx&lt;br/&gt;       movl %eax,0x00(%esp)&lt;br/&gt;       movl $0x1234,0x04(%esp)&lt;br/&gt;       movl %ebx,0x08(%esp)&lt;br/&gt;       call _QEf&lt;br/&gt;       movl %ebx,0x00(%esp)&lt;br/&gt;       call _SLen&lt;br/&gt;       movl 0x38(%esp),%ebx&lt;br/&gt;       addl $0x3C,%esp&lt;br/&gt;       ret&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;GetSz(h) is called with the dialogs window handle as only parameter. The content of the entryfield specified by the dialog's window handle and a fixed resource ID (0x1234) is queried via the sandbox QEf(). Our temporary string buffer occupies the area 0x0C[ESP] through 0x3B[ESP]. Entryfields are limited to 32 byte by default (OS/2), so the buffer is large enough to prevent QEf() from overwriting other data. To determine the size of the returned string, SLen(), a function provided by ST-Open's main library, is called. Finally, the string size returned in EAX is passed back to the caller.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;We need the window handle for the first call, only, so it is copied to EAX before the stack frame is created. This saves extra code and clock cycles to save and restore an additional register. SLen() is a standard function provided by ST-Open's main library. It could be replaced by this alternative:&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;       ...&lt;br/&gt;       call _QEf&lt;br/&gt;       xorl %eax,%eax&lt;br/&gt;     0:cmpl $0x00,0x00(%ebx)&lt;br/&gt;       je 1f&lt;br/&gt;       incl %ebx&lt;br/&gt;       incl %eax&lt;br/&gt;       jmp 0b&lt;br/&gt;     1:movl 0x38(%esp),%ebx&lt;br/&gt;       addl $0x3C,%esp&lt;br/&gt;       ret&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Replacing SLen() with equivalent code saves 8 clock cycles for the CALL/RET sequence and preloading registers in the called function, again. To reduce overhead, you might consider to replace tasks of less complex functions like SLen() with own code to save external calls.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;big&gt;&lt;b&gt;Example 04&lt;/b&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;This example shows how local variables can be aligned to 16 byte boundaries, as required for XMM instructions like MOVDQA and friends. Actually, it is impossible to align ESP directly, so we have to sacrifice a general purpose register. Because we do not know, to which multiple of four ESP currently is aligned to, we have to add a 16 byte safety gap to the value we subtract from ESP. In &lt;b&gt;lng.S&lt;/b&gt;, a part of ST-Open's libraries, we can find the following code. Please notice that this code does not fully comply to &lt;i&gt;Intelligent Design&lt;/i&gt; rules - it creates a stack frame not aligned to a multiple of 64!&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;       ...&lt;br/&gt;       .align 2,0x90&lt;br/&gt;.globl _MNUtxt&lt;br/&gt;_MNUtxt:&lt;br/&gt;       subl $0x50,%esp&lt;br/&gt;       nop&lt;br/&gt;       nop&lt;br/&gt;       movl %ebp,0x4C(%esp)&lt;br/&gt;       movl %esi,0x48(%esp)&lt;br/&gt;       movl %edi,0x44(%esp)&lt;br/&gt;       movl %ebx,0x40(%esp)&lt;br/&gt;       movl %ecx,0x3C(%esp)&lt;br/&gt;       movl %edx,0x38(%esp)&lt;br/&gt;       movl _BNR,%esi&lt;br/&gt;       leal 0x10(%esp),%edi&lt;br/&gt;       movl 0x58(%esp),%ebx&lt;br/&gt;       movl 0x5C(%esp),%ecx&lt;br/&gt;       movl 0x20(%esi),%edx&lt;br/&gt;       andl $0xFFFFFFF0,%edi&lt;br/&gt;       pxor %xmm0,%xmm0&lt;br/&gt;       subl %ebx,%ecx&lt;br/&gt;       jns 0f&lt;br/&gt;       movl $0x0A,%eax&lt;br/&gt;       jmp L00&lt;br/&gt;       /*&lt;br/&gt;         load field FFFFFF12&lt;br/&gt;       */&lt;br/&gt;     0:andl $0x0F,%edx&lt;br/&gt;       movq %xmm0,0x00(%edi)&lt;br/&gt;       movl $0xFFFFFF12,0x08(%edi)&lt;br/&gt;       movl $0x00000003,0x0C(%edi)&lt;br/&gt;       movdqa %xmm0,0x10(%edi)&lt;br/&gt;       movq %xmm0,0x20(%edi)&lt;br/&gt;       movl %edx,0x20(%esi)&lt;br/&gt;       movl %edi,0x00(%esp)&lt;br/&gt;       call _LDreq&lt;br/&gt;       testl %eax,%eax&lt;br/&gt;       jne L00&lt;br/&gt;       ...&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;This function passes the address of a LD structure to LDreq(). The LD structure is only required for this call, so we create it in our stack frame on the fly. Because most of the parameters should be set to zero, we clear them with XMM instructions, saving five &lt;i&gt;mov&lt;/i&gt;s. To align the structure, we consider the following facts: EDI cannot end with any other number than 0, 4, 8 or C - the only possible multiples of four in hexadecimal notation. If it is 0, it already is aligned. If it is any other number x, we have to add the difference (16 - x) to the required offset, moving the offset to the beginning of the next 16 byte boundary. Calculations like this definitely take too much time, so we use a trick and add the largest possible number to EDI, then &lt;i&gt;and&lt;/i&gt; the new content of EDI with the pattern &lt;b&gt;0xFFFFFFF0&lt;/b&gt;. If ESP currently points to address 0x0003FEC4, the required structure starts at 0x04[ESP] =&amp;gt; 0x0003FEC8. We add 0x0C + 0x04 = 0x10 to move the offset to a safe region. Now we do a &lt;b&gt;leal 0x10(%esp),%edi&lt;/b&gt;. This loads 0x0003FED8 into EDI. The final &lt;b&gt;andl $0xFFFFFFF0,%edi&lt;/b&gt; clears the lowest 4 bits of the address, leaving 0x0003FED0 - a properly aligned address with sufficient safety distance from the parameter at the bottom of our stack frame.&lt;br/&gt;&lt;br/&gt;MNUtxt() is a part of ST-Open's language support. Depending on the current language, entries of the corresponding subfield are copied to the menu items specified by their ID. Up to 16 languages can be stored in fields FFFFFF12 (user) and FFFFFF13 (system) and the user can switch between them in the running program.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;/div&gt;&lt;br/&gt;&lt;br/&gt;&lt;div class='zemanta-pixie'&gt;&lt;img src='http://img.zemanta.com/pixy.gif?x-id=584960b4-8eff-85b6-9920-912f26f1ec52' alt='' class='zemanta-pixie-img'/&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4412535206273251260-6648030666781775243?l=st-intelligentdesign.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://st-intelligentdesign.blogspot.com/feeds/6648030666781775243/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/appendix-1.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/6648030666781775243'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/6648030666781775243'/><link rel='alternate' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/appendix-1.html' title='12 - Appendix 1'/><author><name>Bernhard Schornak</name><uri>http://www.blogger.com/profile/07864510983569379361</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_Z2WbH3F-E_Q/S4KZO1ynfyI/AAAAAAAAABQ/jgXB2qAV2Xc/S220/BS.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4412535206273251260.post-7836739504569122704</id><published>2010-04-13T17:47:00.001+02:00</published><updated>2010-04-13T18:34:07.224+02:00</updated><title type='text'>11 - Rules</title><content type='html'>&lt;div xmlns='http://www.w3.org/1999/xhtml'&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Whenever we want to build a greater whole out of many independent parts created by individual contributors, we have to use a common set of rules. Anyone contributing a part to the whole has to obey these rules. If all contributors applied their own rules, we finally ended up with a lot of parts, but none of them were able to work together with any other part. Using different interfaces, single parts either could not communicate at all or misunderstood received messages completely. Therefore, we need a set of rules before we start to create parts of a greater whole. This set of rules, like a common language, defines a common interface, forcing any contributor to use standardised communication protocols, so other parts are able to understand all sent messages and do the right work. The big idea behind any set of rules lives or dies with strict obedience.&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;big&gt;&lt;b&gt;Rule 1&lt;/b&gt;&lt;/big&gt;&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font color='#006600'&gt;&lt;font color='#000099'&gt;The instructions &lt;i&gt;enter&lt;/i&gt;, &lt;i&gt;leave&lt;/i&gt;, &lt;i&gt;push&lt;/i&gt; and &lt;i&gt;pop&lt;/i&gt; are archaic remains of the past era. They &lt;b&gt;must not&lt;/b&gt; be used in &lt;i&gt;Intelligent Design &lt;/i&gt;code&lt;/font&gt;.&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;Any of these instructions change the content of the stack pointer automatically. All &lt;i&gt;Intelligent Design&lt;/i&gt; functions depend on one principle: The stack pointer must not change its content for their entire runtime. We either can enjoy the advantages of a static stack pointer or we can waste precious time with obligatory corrections and repetitive writes of one and the same parameter. Using a 'flying' stack pointer and its urgently required compagnion called base pointer is &lt;b&gt;mutual exclusive&lt;/b&gt; with &lt;i&gt;Intelligent Design&lt;/i&gt;. Both concepts use the stack pointer in a very different way - there is no chance to use a mix of both concepts in one and the same function.&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;big&gt;&lt;b&gt;Rule 2&lt;/b&gt;&lt;/big&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font color='#000099'&gt;Stack frames are created by subtracting 0x3C in 32 bit functions or 0x38 in 64 bit functions from the stack pointer ESP/RSP. If larger stack frames are required, the basic value &lt;i&gt;0x3C&lt;/i&gt; or&lt;i&gt;0x38&lt;/i&gt; can be expanded in 64 byte steps (0x7C, 0xBC, 0xFC, ... in 32 bit functions, 0x78, 0xB8, 0xF8, ... in 64 bit functions).&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;This rule is valid for &lt;b&gt;IDEOS&lt;/b&gt;, only. Any other operating system will not benefit from the advantages of a properly aligned stack. Because the stack generally isn't aligned in old fashioned operating systems, you have to align it with some lines of redundant bloat if you use XMM registers.&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;If you write &lt;b&gt;IDEOS&lt;/b&gt; code, you have to apply this rule. The only exception is the rare case that you do not use other registers than rAX and XMM0 through XMM3. If there are registers to preserve, you need a stack frame. The advantage of this rule is a properly aligned stack pointer, starting at the beginning of a cache line. &lt;b&gt;IDEOS&lt;/b&gt;'s concept guarantees that the stack pointer is naturally aligned to a multiple of 64 before any function is called. Because the &lt;i&gt;call&lt;/i&gt; instruction stores EIP/RIP in the next stack element, subtracting the correct value automatically aligns the stack pointer to a multiple of 64 and the corresponding cache line is preloaded into L1 cache without a single line of additional code.&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;big&gt;&lt;b&gt;Rule 3&lt;/b&gt;&lt;/big&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font color='#000099'&gt;All registers used in a function must be saved before they are overwritten and must be restored before returning to the caller.&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;Any software running on modern processors is most efficient if it can access as much hardware resources as possible. The most valuable resources of a x86 processor are its registers. The more frequently used parameters are held in registers, the less time is required to perform a given task. If we reload parameters with slow memory reads rather than to keep them in registers, we slow down our code with superfluous operations and interrupt simultaneous execution completely. Preloading all required parameters into registers in one gulp reduces all dependencies, supporting parallel execution, keeping all three pipes busy all of the time.&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;Saving precious clock cycles with much faster register operations outweighs the cost of the few clock cycles needed to save and restore all used registers by far. Many functions work with a subset of all available registers. The less registers we use, the faster they are saved and restored. Another welcome side-effekt is based on the fact that continuous writes to ascending addresses trigger &lt;i&gt;write combining&lt;/i&gt;, where up to 16 doublewords are written to a cache line in one gulp. This surely is much faster than 16 single writes. Moreover, pairing of multiple reads or writes is a good opportunity to support simultaneous execition flow as long as possible.&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;big&gt;&lt;b&gt;Rule 4&lt;/b&gt;&lt;/big&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font color='#000099'&gt;The registers RAX, XMM0, XMM1, XMM2 and XMM3 are reserved for special purposes - they neither are saved nor restored by default.&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;This rule partially overrides rule 3:&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;i&gt;Per definitionem&lt;/i&gt;, Register RAX is used to pass return values or errorcodes to the calling function. It does not make sense to save a register that is overwritten in all epilogues by default. If a function declaration defines its return value as VOID, rAX must be cleared with a simple &lt;i&gt;xor %rAX,%rAX&lt;/i&gt; on exit.&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;Registers XMM0 through XMM3 with their size of 4 * 128 bit = 64 byte cover an entire cache line. They are used to clear, move or manipulate large data areas in steps of 64 byte, preferably aligned to a multiple of 64. In general, these operations are executed in small loops where no other functions are called, so we can skip to save and restore these registers &lt;i&gt;per se&lt;/i&gt;. This saves a lot of clock cycles if your application makes heavy use of such operations. If your function calls others while manipulating large memory blocks, the called function might use XMM0 through XMM3 as temporary storage, overwriting whatever was stored in these registers. Hence, it is recommended to use XMM4 through XMM15 if preloaded values must not change while other functions are called. As stipulated by rule 3, you have to save and restore&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt; XMM4 through XMM15 if you use them.&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;big&gt;&lt;b&gt;Rule 5&lt;/b&gt;&lt;/big&gt;&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font color='#000099'&gt;Accessing registers is much faster than accessing memory locations. The most frequently used parameters in a function should be preloaded into registers as soon as these registers were saved on the stack.&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;Even if old-fashioned programmers aren't used to work with multiple execution pipes: Obeying to this rule can turn an ox cart into a rocket (see &lt;i&gt;Rule 3&lt;/i&gt;).&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;big&gt;&lt;b&gt;Rule 6&lt;/b&gt;&lt;/big&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font color='#000099'&gt;Modern processors execute three (LETNi: 1.5) instructions simultaneously. Good code should make use of these capabilities rather than to ignore them. Dependency chains interrupt simultaneous execution. While one instruction is executed in one pipe, the other two pipes are waiting for the result. Sending two of three pipes to sleep for some clocks wastes two thirds of the available resources.&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;This extraordinary product of a classical compiler&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='monospace'&gt;        ...&lt;br/&gt;    pushl    %esi&lt;br/&gt;    pushl    $0&lt;br/&gt;    pushl    $2&lt;br/&gt;    pushl    $0&lt;br/&gt;    pushl    $11&lt;br/&gt;    movl     _GVAR, %eax&lt;br/&gt;    movl     7252(%eax), %eax&lt;br/&gt;    pushl    %eax&lt;br/&gt;    call     _FDacc&lt;br/&gt;    movl     _GVAR, %eax&lt;br/&gt;    addl     $24, %esp&lt;br/&gt;    movl     472(%eax), %ebx&lt;br/&gt;    pushl    %ebx&lt;br/&gt;    pushl    $0&lt;br/&gt;    pushl    $2&lt;br/&gt;    pushl    $4&lt;br/&gt;    pushl    $11&lt;br/&gt;    movl     7252(%eax), %ecx&lt;br/&gt;    pushl    %ecx&lt;br/&gt;    call     _FDacc&lt;br/&gt;    addl     $20, %esp&lt;br/&gt;    movl     _GVAR, %eax&lt;br/&gt;    movl     $0, 10464(%eax)&lt;br/&gt;        ...&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;can be improved with a few useful changes&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='monospace'&gt;        ...&lt;br/&gt;        movl _GVAR,%esi    # at least 3 clocks ahead&lt;br/&gt;        ...&lt;br/&gt;        ...&lt;br/&gt;        ...&lt;br/&gt;        movl 0x1C54(%esi),%eax&lt;br/&gt;        movl 0x01D8(%esi),%ecx&lt;br/&gt;        movl $0x00,0x28E0(%esi)&lt;br/&gt;        movl %eax,0x00(%esp)&lt;br/&gt;        movl $0x0B,0x04(%esp)&lt;br/&gt;        movl $0x00,0x08(%esp)&lt;br/&gt;        movl $0x02,0x0C(%esp)&lt;br/&gt;        movl $0x00,0x10(%esp)&lt;br/&gt;        movl %edi,0x14(%esp)&lt;br/&gt;        call _FDacc&lt;br/&gt;        movl $0x04,0x08(%esp)&lt;br/&gt;        movl %ecx,0x14(%esp)&lt;br/&gt;        call _FDacc&lt;br/&gt;        ...&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;to run at least twice as fast, now. Removing all superfluous instructions, code size could be reduced from 23 to 14 instructions. Placing all instructions in the proper order removes dependencies and keeps all pipes busy without delays.&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;big&gt;&lt;b&gt;Rule 7&lt;/b&gt;&lt;/big&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font color='#000099'&gt;Jump tables belong to the data segment. AS is able to build jump tables in the data segment, so there's absolutely no reason to pollute the code segment with data of any kind.&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;No comment is required for this rule. Just do it!&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;div class='zemanta-pixie'&gt;&lt;img src='http://img.zemanta.com/pixy.gif?x-id=61aa9a6f-1720-8445-ab8c-1292112ae3e2' alt='' class='zemanta-pixie-img'/&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4412535206273251260-7836739504569122704?l=st-intelligentdesign.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://st-intelligentdesign.blogspot.com/feeds/7836739504569122704/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/rules.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/7836739504569122704'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/7836739504569122704'/><link rel='alternate' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/rules.html' title='11 - Rules'/><author><name>Bernhard Schornak</name><uri>http://www.blogger.com/profile/07864510983569379361</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_Z2WbH3F-E_Q/S4KZO1ynfyI/AAAAAAAAABQ/jgXB2qAV2Xc/S220/BS.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4412535206273251260.post-8588682649137719163</id><published>2010-04-13T17:27:00.001+02:00</published><updated>2010-04-14T15:07:24.316+02:00</updated><title type='text'>10 - Summary</title><content type='html'>&lt;div xmlns='http://www.w3.org/1999/xhtml'&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;I hope I could convince you of various advantages introduced with the development of&lt;i&gt; Intelligent Design&lt;/i&gt;. Compared against conventional programming techniques (as discussed in this document), &lt;i&gt;Intelligent Design&lt;/i&gt; wins in all disciplines - be it sheer speed, high code density, simplified development or efficiency. Due to the strict renunciation of slow &lt;i&gt;leave&lt;/i&gt;, &lt;i&gt;pop&lt;/i&gt; and &lt;i&gt;push&lt;/i&gt; instructions, coming along with many ill side-effects like an unpredictable stack pointer, these caveats of conventional software design easily are turned into some really welcome&lt;/font&gt;&lt;font face='arial'&gt; side-effects by replacing those slow instructions with simple &lt;i&gt;mov&lt;/i&gt;s. The most welcome side-effect surely is the fact that we get EBP back as a general purpose register. One additional register is quite a lot if only six of them are available, because EBP was abused as base pointer. Each additional register is an advantage, because we can pair reads to copy parameters from memory to registers and access these registers rather than reloading single registers with some frequently used parameters over and over again.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;Another advantage is that we &lt;i&gt;mov&lt;/i&gt; parameters to the right location rather than to &lt;i&gt;push&lt;/i&gt; them onto the stack. While &lt;i&gt;mov&lt;/i&gt; instructions can store changed parameters anywhere, a &lt;i&gt;push&lt;/i&gt; always addresses the next lower stack element (a quite limited range). If we call a function repeatedly, and only the last parameter changes from call to call, we had to &lt;i&gt;push&lt;/i&gt; all parameters after each call to update this topmost parameter with the current data. Moreover, the correction of ESP is mandatory after each call, blocking the following &lt;i&gt;push&lt;/i&gt; for an entire&lt;/font&gt;&lt;font face='arial'&gt; clock cycle. Using &lt;i&gt;mov&lt;/i&gt; instructions, all these problems vanish. We save many unnecessary instructions and keep the three execution pipes busy most of the time. Of course, there are a lot of cases where only one or two instructions are required between two calls, but this still is faster than a conventional &lt;i&gt;push&lt;/i&gt; orgy.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;One more advantage is the fact that we can &lt;i&gt;mov&lt;/i&gt; parameters and other data onto the stack anywhere within our source file. Pairing multiple reads in groups, we can avoid direct dependencies completely. Pairing multiple writes, we trigger &lt;i&gt;write combining&lt;/i&gt;. Keeping proper distances between reads and writes can eliminate most dependencies with few exceptions. Applying these tricks as often as possible speeds up code markably.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;Putting it all together, &lt;i&gt;Intelligent Design&lt;/i&gt; is an up-to-date alternative to old-fashioned conventional programming techniques. Conventional software design did not change too much in the last thirty years. We neither can create the most simple integrated circuit with stone age tools, nor are we able to create software making full use of the capabilities provided by modern processors with conventional code. While old fashioned code is spiced with mandatory brake pads and compilers generate tons of counter-productive dependencies, because the convention allows to abuse the most valuable resources - our registers - as garbage pile to save four instructions at the cost of multiple reloads, &lt;i&gt;Intelligent Design&lt;/i&gt; replaces these obstacles with straight and clean rules, providing full support for features and improvements of recent processors. &lt;i&gt;Intelligent Design&lt;/i&gt; is an up-to-date concept for the next generation of operating systems and applications, introducing a new quality to software design. Moreover, it is much easier to handle a single stack pointer than to manage two registers, a separate stack and base pointer, simultaneously. Getting rid of the institution called base pointer, we finally can say good bye to positive and negative offsets and error prone corrections of the stack pointer after each of the obligatory &lt;i&gt;push&lt;/i&gt; orgies. Another positive side-effect is the return of a valuable resource (EBP) to the very sparse pool of general purpose registers - the worst conceptual flaw of LETNi's x86 architecture.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;All in all, &lt;i&gt;Intelligent Design&lt;/i&gt; points the way ahead to fastest code with high density, causing less head aches for stressed programmers. Best of all: This is not just a theoretical construction, floating around in the head of a weird, unwordly developer - it exists for real! ST-Open's libraries as well as DatTools and the SRE editor were ported to &lt;i&gt;Intelligent Design&lt;/i&gt;, working as expected: Really fast and reliable.&lt;br/&gt;&lt;br/&gt;Go to the &lt;a href='http://st-intelligentdesign.blogspot.com/2010/04/rules.html'&gt;next post&lt;/a&gt; (11- Rules)&lt;br/&gt;&lt;/font&gt;&lt;/div&gt;&lt;br/&gt;&lt;br/&gt;&lt;div class='zemanta-pixie'&gt;&lt;img src='http://img.zemanta.com/pixy.gif?x-id=7e8df377-80e0-8114-80f5-32f9a56d8709' alt='' class='zemanta-pixie-img'/&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4412535206273251260-8588682649137719163?l=st-intelligentdesign.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://st-intelligentdesign.blogspot.com/feeds/8588682649137719163/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/summary.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/8588682649137719163'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/8588682649137719163'/><link rel='alternate' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/summary.html' title='10 - Summary'/><author><name>Bernhard Schornak</name><uri>http://www.blogger.com/profile/07864510983569379361</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_Z2WbH3F-E_Q/S4KZO1ynfyI/AAAAAAAAABQ/jgXB2qAV2Xc/S220/BS.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4412535206273251260.post-3555869552552960183</id><published>2010-04-13T16:33:00.001+02:00</published><updated>2010-04-14T15:03:30.844+02:00</updated><title type='text'>09 - First Steps</title><content type='html'>&lt;div xmlns='http://www.w3.org/1999/xhtml'&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;While the previous pages introduced the basic concepts of &lt;i&gt;Intelligent Design&lt;/i&gt; to you, it is time to fill this abstract theortical building with life and show, how the described methods and techniques are applied to real life applications. To get in touch with &lt;i&gt;Intelligent Design&lt;/i&gt;, an example function is developed step for step.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;Two things should be stressed: All statements and comments assume that this function is run on a recent operating system with support for recent programming techniques like &lt;i&gt;Intelligent Design&lt;/i&gt;, e.g. &lt;b&gt;IDEOS&lt;/b&gt;. If you run the example on an old fashioned platform like Linux, OS/2 or Windows, there is a 75 percent chance to encounter a nice crash while the first XMM instruction is executed. Furthermore, &lt;i&gt;Intelligent Design&lt;/i&gt; is optimised for the most mature processor design, AMD's Athlon64. It is the only machine with three execution pipes for integer and another three execution pipes for floating point instructions. This makes any Athlon64 superior to LETNi's &lt;i&gt;I-am-so-slowium&lt;/i&gt; processors with their one and a half execution pipes for all (integer and FP) instructions. As a matter of fact, &lt;i&gt;Intelligent Design&lt;/i&gt; code runs faster on a slower Athlon64 than on a faster &lt;i&gt;I-am-so-slowium&lt;/i&gt;.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;(Sidenote: Unfortunately, I had to remove all comments in the code snippets because of the limited formatting abitlities of the blog engine.)&lt;br/&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Problem&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Eight entry fields in a dialog are to fill with four hexadecimal (subfields 00...03, IDs 0x1200...0x1203) and two decimal (subfield 04/05, IDs 0x1204/0x1205) numbers as well as two strings (subfield 06/07, IDs 0x1206/0x1207). All data are stored in the given subfields of datafield 0000001E. Entries are selected via keyboard, evaluated and the selected entry number is stored in a global variable with the symbolic name&lt;/font&gt;&lt;font face='arial'&gt; CURSEL (adress 0x0240[BNR]), then our function is called to display all data of the chosen entry in a dialog. The datafield is loaded permanently, its MemoryHandle is stored in a runtime variable with the symbolic name MH_SEL (adress 0x2028[BNR]). It has 1,024 entries with 6 subfields (00-05) of type 03 (doubleword) and 2 subfields (06, 07) of type 07 (dynamic strings with garbage collection).&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Solution&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Even if this is a trivial problem, there are many ways to solve it. The preferred solution in typical C code is a loop executed eight times where some logic switches between three possible output types (hex, decimal, string). It is not the best way to solve the given problem - comparisons and repeated writes are quite slow.&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;To avoid repetitive work and speed up execution markably, it's a much better idea to unroll the suggested loop completely. Unfortunately, it is impossible to apply real improvements if we use high level languages like C. Well, we could add some lines inline code, but that's like using tape to fix a broken screw. The only way to get anywhere is writing our code in assembler. No existing compiler can see connections&lt;/font&gt;&lt;font face='arial'&gt; between called functions nor is it able to detect how many dependencies are created with the code it generates. It were possible to write a compiler with capabilities like that, but the result were bloated and, first of all, extremely slow. A human can solve really complex problems much better, especially if the code is written in pure assembler.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;big&gt;&lt;b&gt;Prologue&lt;/b&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;The usual conventional prologue&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;font face='monospace'&gt;.globl _MyFunction&lt;br/&gt;_MyFunction:&lt;br/&gt;        pushl %ebp&lt;br/&gt;        movl %esp,%ebp&lt;br/&gt;        subl $0xC0,%esp&lt;br/&gt;        pushl %ebx&lt;br/&gt;        pushl %edi&lt;br/&gt;        pushl %esi&lt;br/&gt;        ...&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;looks like this in &lt;i&gt;Intelligent Design&lt;/i&gt;&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;.globl _MyFunction&lt;br/&gt;_MyFunction:&lt;br/&gt;        subl $0xFC,%esp&lt;br/&gt;        nop&lt;br/&gt;        nop&lt;br/&gt;        movl %ebp,0xE4(%esp)&lt;br/&gt;        movl %esi,0xE8(%esp)&lt;br/&gt;        movl %edi,0xEC(%esp)&lt;br/&gt;        movl %edx,0xF0(%esp)&lt;br/&gt;        movl %ecx,0xF4(%esp)&lt;br/&gt;        movl %ebx,0xF8(%esp)&lt;br/&gt;        ...&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;The difference between both versions is obvious on the first sight. What is going on 'under the hood' cannot be seen at all, but it can be shown in form of a grahical image. Figure 0E shows the snapshot of a conventional stack frame&lt;br/&gt;&lt;br/&gt;&lt;img src='http://lh3.ggpht.com/_Z2WbH3F-E_Q/S8OwNiRhmoI/AAAAAAAAAFw/ex0YaiEbD3A/stackE.png' style='max-width: 800px;'/&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;font face='arial'&gt; figure 0F (right) a snapshot of an &lt;i&gt;Intelligent Design&lt;/i&gt; stack frame soon after the above  prologues were executed&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;img src='http://lh5.ggpht.com/_Z2WbH3F-E_Q/S8OwNr_z64I/AAAAAAAAAFw/4qJP2LImGTA/stackF.png' style='max-width: 800px;'/&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;Because conventional code is based on a flying stack pointer with random content, it is impossible to determine the current alignment of the stack pointer. Conventional code uses old fashioned &lt;i&gt;push&lt;/i&gt; - &lt;i&gt;call&lt;/i&gt; -&lt;i&gt;addl $x,%esp&lt;/i&gt; sequences, as well, so ESP/RSP frequently changes its current content. These disadvantages are the reason why we need a second register to address stack elements with reliable accuracy. The question marks in figure 0E emphasize the undetermined content of the stack pointer - an artificial feature of conventional code. Obviously, the visible difference between conventional and &lt;i&gt;Intelligent Design&lt;/i&gt; stack frames is the naturally aligned order of the latter ones, one of many built-in and cost-free (in terms of additional clock cycles)&lt;i&gt; Intelligent Design&lt;/i&gt; features.&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;The second feature of &lt;i&gt;Intelligent Design&lt;/i&gt; code is its prologue. It generally starts with the subtraction of the stack frame's size from ESP/RSP. The &lt;i&gt;nop&lt;/i&gt;s between the &lt;i&gt;sub&lt;/i&gt; and &lt;i&gt;mov&lt;/i&gt; instructions are required to feed the other two execution pipes while the content of ESP/RSP is set to the bottom of our stack frame. It is important to keep the remaining pipes busy for exactly this one clock cycle, because we need ESP/RSP to address the stack elements where our registers are stored. Inserting two &lt;i&gt;nop&lt;/i&gt;s keeps all pipes busy. The simultaneous flow in all execution pipes is not interrupted, so up to three instructions are executed at any time. Hint: Both &lt;i&gt;nop&lt;/i&gt;s should be seen as placeholder for more useful&lt;/font&gt;&lt;font face='arial'&gt; instructions. For example, we could replace them with &lt;i&gt;prefetch n&lt;/i&gt; instructions to preload memory areas to the L1 cache, so we can access these areas immediately after saving &lt;i&gt;all&lt;/i&gt; used registers.&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;Step two: Every &lt;i&gt;Intelligent Design&lt;/i&gt; prologue saves &lt;i&gt;all&lt;/i&gt; used registers. These are EBX, ECX, EDX, EDI, ESI, EBP and XMM4 through XMM15. The return address to the calling function is written to the topmost element in our stack frame while the &lt;i&gt;call _MyFunction&lt;/i&gt; instruction in our callers code is executed. Storing the content of EIP/RIP in this element automatically loads the topmost cache line of the new stack frame into the L1 cache, so this cache line should be present at the time we start to save our registers there. Writing to L1 cache is the fastest way to store data, writing data in ascending order to continuous locations speeds up execution further, because a mechanism called &lt;i&gt;write combining&lt;/i&gt; is forced, allowing to write an entire cache line in one gulp. Therefore, we store registers in ascending order to speed up our function. As shown on the next page, we try to fill the topmost stack elements up to the return address without gaps to force the processor to use &lt;i&gt;write combining&lt;/i&gt; rather than single &lt;i&gt;mov&lt;/i&gt;s. Applying all these improvements, the conventional prologue is not gaining huge advantages&lt;/font&gt;&lt;font face='arial'&gt; over the &lt;i&gt;Intelligent Design&lt;/i&gt; prologue. Even if we only &lt;i&gt;push&lt;/i&gt; one half of the registers, the conventional prologue is executed mere two or three clock cycles faster than the &lt;i&gt;Intelligent Design&lt;/i&gt; prologue, &lt;i&gt;Mov&lt;/i&gt;ing twice as much registers onto the stack. This small advantage disappears completely if the first parameter must be reloaded, because ECX and EDX were destroyed by a called function. If parameters must be reloaded more than once, the advantage turns into a handicap, losing some clocks with every reload. Conventional techniques do not speed up functions - they slow them down.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;big&gt;&lt;b&gt;Preparation&lt;/b&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;Let's start with some real work. The standard prologue was modified slightly to suit our needs. It looks like this, now:&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='monospace'&gt;.globl _ShowUs&lt;br/&gt;_ShowUs:subl $0xFC,%esp&lt;br/&gt;        movl _BNR,%eax&lt;br/&gt;        nop&lt;br/&gt;        movl %ebp,0xE4(%esp)&lt;br/&gt;        movl %esi,0xE8(%esp)&lt;br/&gt;        movl %edi,0xEC(%esp)&lt;br/&gt;        movl %edx,0xF0(%esp)&lt;br/&gt;        movl %ecx,0xF4(%esp)&lt;br/&gt;        movl %ebx,0xF8(%esp)&lt;br/&gt;        movl MH_SEL(%eax),%esi&lt;br/&gt;        movl CURSEL(%eax),%ebp&lt;br/&gt;        movdqa %xmm6,0xC0(%esp)&lt;br/&gt;        movdqa %xmm7,0xD0(%esp)&lt;br/&gt;        ...&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;img src='http://lh4.ggpht.com/_Z2WbH3F-E_Q/S8Ov5x0FbyI/AAAAAAAAAFw/Q_GutjuTlpM/stack10.png' style='max-width: 800px;'/&gt;&lt;br/&gt;&lt;br/&gt;We need eight registers to preload all required parameters for later use. The set of general purpose registers is limited, so two XMM registers are used to fill the gap (nice to have them). We just need the lowest 32 bit of XMM6 and XMM7, but the used &lt;i&gt;movd&lt;/i&gt; instruction clears all bits above (32 through 127), as well. Whenever the contents of XMM4 through XMM15 change, these registers must be saved initially and restored on exit (like &lt;i&gt;all&lt;/i&gt; general purpose registers except EAX). Any of them might hold preloaded parameters of another function, so it were a bad idea to overwrite them with garbage, because this might cause fatal malfunctions or crashes whenever we return to that function. Keep in mind that FP instructions are executed in a separate unit, so both &lt;i&gt;movdqa&lt;/i&gt; instructions are executed simultaneously with the six &lt;i&gt;mov&lt;/i&gt; instructions. As a matter of fact - they do not cost any additional clock cycle!&lt;br/&gt;&lt;br/&gt;Working with ST-Open's libraries, the base address of the global variables (_BNR or _GVAR) is required in most cases. This datafield is automatically loaded by _LDinit and is available for the entire runtime of the program. If it is shut down, _LDexit is called during termination and the permanent portion of the datafield (4,096 byte = 1,024 variables) is stored before the datafield is released. In ST-Open slang, this datafield is called 'SystemNumerics'. It allows to bypass the stack by storing frequently used parameters at defined locations, so we save some time for passing those parameters on the stack.&lt;br/&gt;&lt;br/&gt;To load the eight parameters, some tricks requiring basic knowledge about ST-Open's database engine are used. A less informed programmer (or a compiler) had to preload parameter for parameter with repeated &lt;i&gt;call&lt;/i&gt;s:&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;        ...&lt;br/&gt;        movl %esi,0x00(%esp)&lt;br/&gt;        movl %ebp,0x04(%esp)&lt;br/&gt;        movl $0x07,0x08(%esp)&lt;br/&gt;        movl $0x07,0x0C(%esp)&lt;br/&gt;        call _FDacc&lt;br/&gt;        movl %eax,%esi&lt;br/&gt;        decl 0x08(%esp)&lt;br/&gt;        call _FDacc&lt;br/&gt;        movl %eax,%edi&lt;br/&gt;        decl 0x08(%esp)&lt;br/&gt;        movl $0x01,0x0C(%esp)&lt;br/&gt;        call _FDacc&lt;br/&gt;        movl %eax,0xBC(%esp)&lt;br/&gt;        decl 0x08(%esp)&lt;br/&gt;        call _FDacc&lt;br/&gt;        movl %eax,0xB8(%esp)&lt;br/&gt;        decl 0x08(%esp)&lt;br/&gt;        call _FDacc&lt;br/&gt;        movl %eax,%edx&lt;br/&gt;        movd 0xB8(%esp),%xmm6&lt;br/&gt;        movd 0xBC(%esp),%xmm7&lt;br/&gt;        decl 0x08(%esp)&lt;br/&gt;        call _FDacc&lt;br/&gt;        movl %eax,%ecx&lt;br/&gt;        decl 0x08(%esp)&lt;br/&gt;        call _FDacc&lt;br/&gt;        movl %eax,%ebx&lt;br/&gt;        leal 0x20(%esp),%ebp&lt;br/&gt;        decl 0x08(%esp)&lt;br/&gt;        call _FDacc&lt;br/&gt;        ...&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;Compared to conventional code, where all 6 parameters for _FDacc() repeatedly were&lt;i&gt;push&lt;/i&gt;ed onto the stack for each &lt;i&gt;call&lt;/i&gt;, the improvements introduced by &lt;i&gt;Intelligent Design&lt;/i&gt; still keep the above code dense and fast, but this is not the most elegant solution.&lt;br/&gt;&lt;br/&gt;Recapitulating some knowledge about ST-Open's database engine and ST-Open's Loader, there is a much faster way to load parameters stored in a datafield. Okay, we start with a delay because we load an address into ESI which is required for the next and all following load operations. But this still is faster than a single &lt;i&gt;call&lt;/i&gt;, not to talk of the eight &lt;i&gt;call&lt;/i&gt;s performed by the above code. The advantage of the below code is obvious - its only caveat is the distance of 4,096 byte between any of the accessed memory locations. It is not very likely that all these areas are present in the L1 cache - we should assume long delays, because all data surely are taken from main memory. But that is true for all other ways to read these data, as well - the below code is the fastest possible way to load our parameters within the given premises. It were possible to organise the field in blocks rather than in sequential subfields, so all data of a dataset could be kept in a cache line - but this is out of the frame of our exercise.&lt;br/&gt;&lt;br/&gt;A field's MemoryHandle is a pointer into the Loader table _BNR, where all parameters of a loaded datafield are stored. The first parameter at 00[MemHandle] is the real address (EA) of the allocated memory block. Loading this address into ESI, we now can access all required parameters by using the appropriate offset to ESI. Because we do not know, which entry we have to read next, we access them using the indexed, indirect addressing mode provided by the instruction set of any x86 processor. This is a simple exercise, because all subfields have a distance of exactly 0x1000 from each other. Oh, there are two more complex actions, of course. String addresses are stored as offsets to the field's base address, so we have to preload two registers with this address prior to adding the offset of the string to them. Following the definition, empty strings are stored with an offset of zero. Because the first 32 bit of each datafield are reset by default, 00[EA] always holds an empty string...&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;        ...&lt;br/&gt;        movl 0x00(%esi),%esi&lt;br/&gt;        movl %esi,%edi&lt;br/&gt;        movl 0x0100(%esi, %ebp, 4),%eax&lt;br/&gt;        movl 0x1100(%esi, %ebp, 4),%ebx&lt;br/&gt;        movl 0x2100(%esi, %ebp, 4),%ecx&lt;br/&gt;        movl 0x3100(%esi, %ebp, 4),%edx&lt;br/&gt;        movd 0x4100(%esi, %ebp, 4),%xmm6&lt;br/&gt;        movd 0x5100(%esi, %ebp, 4),%xmm7&lt;br/&gt;        addl 0x6100(%esi, %ebp, 4),%edi&lt;br/&gt;        addl 0x7100(%esi, %ebp, 4),%esi&lt;br/&gt;        leal 0x20(%esp),%ebp&lt;br/&gt;        ...&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;Collecting similar tasks to groups and executing them one after the other is a good way to speed up a function markably. It often saves a lot of redundant operations, because we can &lt;i&gt;mov&lt;/i&gt; a parameter onto the stack, once, then perform multiple &lt;i&gt;call&lt;/i&gt;s to one and the same function repeatedly without having to pass this parameter over and over again. As shown in the first code snippet (the less elegant try to solve our problem), only two out of six parameters are updated between seven out of eight _FDacc() &lt;i&gt;call&lt;/i&gt;s. Three out of six parameters are static, staying unchanged for all eight &lt;i&gt;call&lt;/i&gt;s - it is a waste of time to &lt;i&gt;push&lt;/i&gt; all six parameters onto the stack eight times as practised in conventional code.&lt;br/&gt;&lt;br/&gt;Back to the last line of the improved source code. While EBP was uses as index for eight &lt;i&gt;mov&lt;/i&gt; instructions, this index is not required any longer after the last parameter was loaded to ESI. It's recommended to preload all required parameters as soon as possible to keep an appropriate distance from instructions accessing them. Therefore, we preload the address 20[ESP] into EBP immediately after its previous content is not required any longer. EBP now holds the address of the buffer where the converted numeric values of our parameters are stored as strings. This speeds up our function a little bit - simultaneouos execution flow in all three pipes is supported if our source code provides intelligently distributed instructions.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;big&gt;&lt;b&gt;Hexadecimal Numbers&lt;/b&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;ST-Open's libraries provide some functions to convert numeric data into hexadecimal strings. D2str() is the right choice to convert doublewords into formatted strings, the format is 'nnnn nnnn'. D2str() awaits two parameters: The number to convert and the address of the buffer where the output string shall be stored.&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;        ...&lt;br/&gt;        movl %eax,0x00(%esp)&lt;br/&gt;        movl %ebp,0x04(%esp)&lt;br/&gt;        call _D2str&lt;br/&gt;        movl %ebx,0x00(%esp)&lt;br/&gt;        addl $0x10,0x04(%esp)&lt;br/&gt;        call _D2str&lt;br/&gt;        movl %ecx,0x00(%esp)&lt;br/&gt;        addl $0x10,0x04(%esp)&lt;br/&gt;        call _D2str&lt;br/&gt;        movl %edx,0x00(%esp)&lt;br/&gt;        addl $0x10,0x04(%esp)&lt;br/&gt;        call _D2str&lt;br/&gt;        ...&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;After converting the four numbers into hexadecimal strings, the stack now looks like this:&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;img src='http://lh3.ggpht.com/_Z2WbH3F-E_Q/S8Ov6Dw_SqI/AAAAAAAAAFw/a99Kzk1tgTM/stack11.png' style='max-width: 800px;'/&gt;&lt;br/&gt;&lt;br/&gt;This is trivial code, so there is no need for lengthy descriptions. There's only one thing deserving some words: Have a look at the three lines &lt;i&gt;addl $0x10,0x04(%esp)&lt;/i&gt; in the above code snippet. These instructions have a latency of four clocks, each. We could replace them with sequences like &lt;i&gt;addl $0x10,%ebp&lt;/i&gt; / &lt;i&gt;movl %ebp,0x04(%esp)&lt;/i&gt;. The replacements have a latency of four clock cycles, as well, but the &lt;i&gt;mov&lt;/i&gt; instruction has to wait until the content of EBP was updated. It is obvious that the alternative method interrupts simultaneous execution and adds some bloat to our source code. Another negative side-effect is the changed content of EBP. We need the address of the first string later on, so we had to subtract all added values before we could use EBP to address the output strings. A general rule of thumb: If something can be done with one instruction, we should not split it up into multiple instructions doing the same job.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;big&gt;&lt;b&gt;Decimal Numbers&lt;/b&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;To convert a doubleword into a formatted decimal string, we use the function D2dec() provided by ST-Open's libraries. This function awaits a doubleword to convert, the address of a buffer where the output shall be stored and a doubleword defining the output format. Only the lower two bytes of this doubleword are recognised. The used code is &lt;i&gt;0000ffii&lt;/i&gt;, where &lt;i&gt;ii&lt;/i&gt; is the amount of integer and &lt;i&gt;ff&lt;/i&gt; the amount of pseudo floating point digits. Beware - this is just a char inserted somewhere in a string, not a real floating point! We do not need a floating point, so the second byte is set to zero. The integer digits should cover the full range between 0 and 4 294 967 295, so the appropriate value is ten. Because D2dec() treats a formatting dword with the value of zero as 0x0000000A, we either may pass zero or ten as third parameter.&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;        ...&lt;br/&gt;        movd %xmm6,0x00(%esp)&lt;br/&gt;        addl $0x10,0x04(%esp)&lt;br/&gt;        movl $0x0A,0x08(%esp)&lt;br/&gt;        call _D2dec&lt;br/&gt;        movl 0x0100(%esp),%edx&lt;br/&gt;        movd %xmm7,0x00(%esp)&lt;br/&gt;        addl $0x10,0x04(%esp)&lt;br/&gt;        call _D2dec&lt;br/&gt;        ...&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;This is trivial code, so we skip lengthy explanations. EDX is loaded between both calls, because we need the WindowHandle (HWND) later on to write our strings to the corresponding entry fields. Preloading registers some instructions before they are used supports parallel execution in all three pipes and we should place these loads soon after the content of one of our registers is not required any longer. The more practised, the faster the resulting code.&lt;br/&gt;&lt;br/&gt;Well, our stack looks like this after all conversions are done:&lt;br/&gt;&lt;br/&gt;&lt;img src='http://lh3.ggpht.com/_Z2WbH3F-E_Q/S8Ov6bY-9rI/AAAAAAAAAFw/p10hX-HQDe0/stack12.png' style='max-width: 800px;'/&gt;&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;big&gt;&lt;b&gt;Strings&lt;/b&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;To avoid redundant copy operations, the addresses of both strings were stored in EDI and ESI while we preloaded our eight parameters. It does not make sense to copy a string from memory to the stack, first, if we have to copy it from the stack to another memory location later on. Passing strings always means to pass the address where the first byte of the string can be found - it is a waste of time to copy a string to a buffer instead of just passing the address where this string actually is stored.&lt;br/&gt;&lt;br/&gt;At this point, all conversions are done and we can start to fill those entry fields with some data. Because our strings are passed via their address, the stack remains untouched.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;big&gt;&lt;b&gt;Setting The Entry Fields&lt;/b&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;All data are converted and accessible, now. It's time to fill our eight entry fields with the corresponding strings:&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='monospace'&gt;        ...&lt;br/&gt;        movl %edx,0x00(%esp)&lt;br/&gt;        movl $0x1200,0x04(%esp)&lt;br/&gt;        movl %ebp,0x08(%esp)&lt;br/&gt;        call _SEf&lt;br/&gt;        incl 0x04(%esp)&lt;br/&gt;        addl $0x10,0x08(%esp)&lt;br/&gt;        call _SEf&lt;br/&gt;        incl 0x04(%esp)&lt;br/&gt;        addl $0x10,0x08(%esp)&lt;br/&gt;        call _SEf&lt;br/&gt;        incl 0x04(%esp)&lt;br/&gt;        addl $0x10,0x08(%esp)&lt;br/&gt;        call _SEf&lt;br/&gt;        incl 0x04(%esp)&lt;br/&gt;        addl $0x10,0x08(%esp)&lt;br/&gt;        call _SEf&lt;br/&gt;        incl 0x04(%esp)&lt;br/&gt;        addl $0x10,0x08(%esp)&lt;br/&gt;        call _SEf&lt;br/&gt;        incl 0x04(%esp)&lt;br/&gt;        movl %edi,0x08(%esp)&lt;br/&gt;        call _SEf&lt;br/&gt;        incl 0x04(%esp)&lt;br/&gt;        movl %esi,0x08(%esp)&lt;br/&gt;        call _SEf&lt;br/&gt;        ...&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;This is the most dense code. It were possible to replace all &lt;i&gt;incl 0x04(%esp)&lt;/i&gt; with &lt;i&gt;movl $0x120*,0x04(%esp)&lt;/i&gt; instructions, where the asterisk '*' stands for the current ID. This alternative does not speed up execution, but blows up our code by 32 byte. Why that? Well, each &lt;i&gt;addl $0x10,0x08(%esp)&lt;/i&gt; has a latency of four clock cycles. Pairing a three clocks with a four clocks latency does not result in higher speed. All three pipes are engaged for four clock cycles and the last pipe is idling all of the time. If we replace the first instruction with the alternative instruction, the first pipe - together with the last pipe - is idling for one clock cycle. Hence, we win nothing than 32 byte additional code.&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;With this step, all tasks of our exercise are done, only the epilogue is missing.&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;b&gt;Epilogue&lt;/b&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;Before we return to the calling function, we restore &lt;i&gt;all&lt;/i&gt; used registers. This includes the garbage pile of conventional code (ECX and EDX) as well as XMM4-XMM15. When all registers contain what they contained before our prologue was executed, we might want to set EAX to a specific return value, if the function declaration says so. Whether we set EAX or not, the next step is to add the same value we subtracted with the very first instruction in our function's prologue. ESP should point to the address of the instruction following &lt;i&gt;call _ShowUs&lt;/i&gt;, now. After executing the final &lt;i&gt;ret&lt;/i&gt;, EIP should contain that address and the processor should continue to execute the code of the function calling ShowUs().&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;        ...&lt;br/&gt;        movl 0xC0(%esp),%xmm6&lt;br/&gt;        movl 0xD0(%esp),%xmm7&lt;br/&gt;        movl 0xE4(%esp),%ebp&lt;br/&gt;        movl 0xE8(%esp),%esi&lt;br/&gt;        movl 0xEC(%esp),%edi&lt;br/&gt;        movl 0xF0(%esp),%edx&lt;br/&gt;        movl 0xF4(%esp),%ecx&lt;br/&gt;        movl 0xF8(%esp),%ebx&lt;br/&gt;        addl $0xFC,%esp&lt;br/&gt;        xorl %eax,%eax&lt;br/&gt;        ret&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;img src='http://lh6.ggpht.com/_Z2WbH3F-E_Q/S8Ov6WnBpCI/AAAAAAAAAFw/flhWpK7xROk/stack13.png' style='max-width: 800px;'/&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;With the correction of the stack pointer ESP, all stack elements below the address stored in ESP are 'no-man's-land', again. The next function is free to reserve some byte for its private stack frame as required.&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Notes&lt;br/&gt;&lt;br/&gt;&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;It still might not be clear, why grouping of similar tasks is much better than doing everything step by step, as we are taught by some gurus preaching the old fashioned methods and techniques. The advantages of &lt;i&gt;Intelligent Design&lt;/i&gt; are based on a lot of small improvements. Each of them supports the other. In the end, all of them sum up to a new quality, making as much as possible use of the full power of modern processors.&lt;br/&gt;&lt;br/&gt;As a matter of fact, most parameters never change in repetitive calls to one and the same function. Hence, we can save a lot of redundant writes if we just &lt;i&gt;mov&lt;/i&gt; changing parameters to their proper position rather than &lt;i&gt;push&lt;/i&gt; all parameters onto the stack over and over, again. In our example, parameter 1 is a MemoryHandle for the first call, the address of a buffer for the second call and a WindowHandle for the third call. Hence, we had to &lt;i&gt;push&lt;/i&gt; these changing parameters onto the stack eight times. Parameters 2 and 3 also change with every call. We finally had to  perform 6*8 + 4*2 + 2*3 + 3*8 = 86 &lt;i&gt;push&lt;/i&gt; instructions to solve the given problem. If we count all &lt;i&gt;mov&lt;/i&gt; instructions in the listed example code, we get a result of 38 - less than one half of the instructions required for conventional code. But: That's just counting instructions. Having a closer look at them, there is much more than the pure count. Analysing both versions, conventional code is not as efficient as &lt;i&gt;Intelligent Design&lt;/i&gt;. Conventional techniques not only prevent a processor to execute multiple instructions simultaneously, because they create many avoidable dependencies. They also create obstacles by default, because their rules stipulate us to omit important things like preserving ECX and EDX. Putting it all together, conventional programming techniques are as outdated as ox carts, while &lt;i&gt;Intelligent Design&lt;/i&gt; is the programming technique of the 21st century.&lt;br/&gt;&lt;br/&gt;Go to the &lt;a href='http://st-intelligentdesign.blogspot.com/2010/04/summary.html'&gt;next post&lt;/a&gt; (10 - Summary)&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/div&gt;&lt;br/&gt;&lt;br/&gt;&lt;div class='zemanta-pixie'&gt;&lt;img src='http://img.zemanta.com/pixy.gif?x-id=4751cbcb-bcdd-8ef3-8c7b-827ecd34ee1d' alt='' class='zemanta-pixie-img'/&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4412535206273251260-3555869552552960183?l=st-intelligentdesign.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://st-intelligentdesign.blogspot.com/feeds/3555869552552960183/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/first-steps.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/3555869552552960183'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/3555869552552960183'/><link rel='alternate' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/first-steps.html' title='09 - First Steps'/><author><name>Bernhard Schornak</name><uri>http://www.blogger.com/profile/07864510983569379361</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_Z2WbH3F-E_Q/S4KZO1ynfyI/AAAAAAAAABQ/jgXB2qAV2Xc/S220/BS.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://lh3.ggpht.com/_Z2WbH3F-E_Q/S8OwNiRhmoI/AAAAAAAAAFw/ex0YaiEbD3A/s72-c/stackE.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4412535206273251260.post-8521931833047585910</id><published>2010-04-13T06:21:00.001+02:00</published><updated>2010-04-14T15:00:41.233+02:00</updated><title type='text'>08 - Intelligent Design</title><content type='html'>&lt;div xmlns='http://www.w3.org/1999/xhtml'&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;If we put all discussed aspects together, we can state that conventional programming techniques ignore improvements introduced with modern processor design completely. Well, conclusions of our analysis suggested to replace all &lt;i&gt;leave&lt;/i&gt;, &lt;i&gt;pop&lt;/i&gt; and &lt;i&gt;push&lt;/i&gt; with &lt;i&gt;mov&lt;/i&gt; instructions, because three of them can be executed simultaneously. The advantages of these replacements are a static stack pointer, an additional general purpose register (EBP), a naturally aligned stack (if supported by the operating system), read and write access to any stack element via ESP, and much more. Based on the facts we worked out until now, a concept was developed and tested for usability in everday's applications.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;Because we use no base pointer any longer, the stack is addressed via ESP, only. As shown before, we can reserve a stack area for our private use by subtracting the required size from the stack pointer ESP. The reserved area is safe from being used by called functions (except those compiled by GCC 3.3.5.). Let us recapitulate how the stack looks like after a function is called:&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;img src='http://lh3.ggpht.com/_Z2WbH3F-E_Q/S8Ovn3yQcGI/AAAAAAAAAFw/GUB2WM54Ezg/stack5.png' style='max-width: 800px;'/&gt;&lt;br/&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;The current stack element holds the return address - the address of the instruction following the &lt;i&gt;call&lt;/i&gt; in the calling function. If the calling function has to pass parameters to our funktion, they are stored above the current stack element in ascending order. Because we want to use &lt;i&gt;mov&lt;/i&gt; instructions, only, a sufficient stack area must be reserved where we can save used registers, store local variables and pass parameters to called functions. Working with a static stack pointer, we just have to add the required sizes of these three parts (registers, local area and parameters), then subtract the sum from ESP:&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;img src='http://lh3.ggpht.com/_Z2WbH3F-E_Q/S8Ovol5tSyI/AAAAAAAAAFw/lA16c-081QA/stack9.png' style='max-width: 800px;'/&gt;&lt;br/&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;After this subtraction, the content of ESP &lt;i&gt;does not&lt;/i&gt; change until our function is terminated. Therefore, ESP can be used to address any element within our current stack frame. Elements outside our private area are &lt;i&gt;taboo&lt;/i&gt; - you may read them whenever you want, but you &lt;i&gt;must not&lt;/i&gt; write to them under any circumstances!&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Calculating The Required Size&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;The size of our stack frame depends on the amount of registers we have to save, the size for our local variables and the amount of parameters we have to pass.&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;b&gt;Registers&lt;/b&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;The contents of &lt;i&gt;all&lt;/i&gt; used registers are stored at the top of our private area. All &lt;i&gt;Intelligent Design&lt;/i&gt; functions &lt;i&gt;do&lt;/i&gt; save ECX and EDX, as well! The following table shows the required size for 16, 32 and 64 bit functions:&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;/font&gt;&lt;table width='500' height='412'&gt;&lt;tbody&gt;&lt;tr&gt;&lt;th width='25%'&gt;&lt;font face='arial'&gt;&lt;b&gt;Register&lt;/b&gt;&lt;/font&gt;&lt;/th&gt;&lt;th width='25%'&gt;&lt;font face='arial'&gt;&lt;b&gt;16 Bit&lt;/b&gt;&lt;/font&gt;&lt;/th&gt;&lt;th width='25%'&gt;&lt;font face='arial'&gt;&lt;b&gt;32 Bit&lt;/b&gt;&lt;/font&gt;&lt;/th&gt;&lt;th width='25%'&gt;&lt;font face='arial'&gt;&lt;b&gt;64 Bit&lt;/b&gt;&lt;/font&gt;&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;font face='arial'&gt; 1&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x02&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x04&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x08&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;font face='arial'&gt; 2&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x04&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x08&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x10&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;font face='arial'&gt; 3&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x06&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x0C&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x18&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;font face='arial'&gt; 4&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x08&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x10&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x20&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;font face='arial'&gt; 5&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x0A&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x14&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x28&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;font face='arial'&gt; 6&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x0C&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x18&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x30&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;font face='arial'&gt; 7&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;-&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;-&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x38&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;font face='arial'&gt; 8&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;-&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;-&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x40&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;font face='arial'&gt; 9&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;-&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;-&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x48&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;font face='arial'&gt;10&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;-&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;-&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x50&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;font face='arial'&gt;11&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;-&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;-&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x58&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;font face='arial'&gt;12&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;-&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;-&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x60&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;font face='arial'&gt;13&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;-&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;-&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x68&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;font face='arial'&gt;14&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;-&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;-&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x70&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;font face='arial'&gt;15&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;-&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;-&lt;/font&gt;&lt;/td&gt;&lt;td&gt;&lt;font face='arial'&gt;0x78&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;If no registers are used, the required size is zero, of course. Rows 7 through 15 are not available in 16 and 32 bit mode, because these modes do not recognise the extended 64 bit register set. Eventually, you might need to save some MMX or XMM registers. Just add 8 for each MMX and 16 for each XMM register. It is recommended to avoid using MMX registers, because they internally are mirrored on the original FPU registers ST(0) through ST(7). Older software libraries and operating systems do not know anything about MMX registers, so you might encounter weird problems if you do not write an explicit &lt;i&gt;emms&lt;/i&gt; or &lt;i&gt;femms&lt;/i&gt; before you call external functions. Well, &lt;i&gt;emms&lt;/i&gt; and &lt;i&gt;femms&lt;/i&gt; generally destroy the contents of all MMX registers, so it is a good idea to use XMM instead of MMX registers.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;b&gt;Local Variables&lt;br/&gt;&lt;/b&gt;&lt;/big&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Between saved registers at the top of our stack and the area where parameters for called functions are stored, we have to reserve the space required for local data (variables, structures, strings). If the size is no multiple of the standard data size, it is recommended to expand the required size to the next higher multiple of the standard data size. It's very important to keep the stack aligned, because odd addresses in ESP trigger penalty cycles for every unaligned memory access. If you don't keep care and add an even size in your epilogue, the program definitely will crash, because EIP probably is set to a faulty return address.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;If you want to store strings on the stack, it is recommended to define a sufficient size, so the function reading characters from the input device has no chance to overwrite parameters or register contents stored in the stack elements above.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;b&gt;Parameters&lt;br/&gt;&lt;/b&gt;&lt;/big&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Parameters are stored at the bottom of our stack frame. They are moved to the stack elements in ascending order, starting at 00[ESP]. Because we &lt;i&gt;mov&lt;/i&gt; parameters to the corresponding stack elements rather than to &lt;i&gt;push&lt;/i&gt; them any longer, we should reserve an area equal to the size required by the function with the largest amount of parameters. If we have three functions awaiting three parameters and one function awaiting five parameters, we need a size of 20 byte (32 bit code) to pass five parameters - the three parameters for the other three functions automatically fit into the reserved 20 byte. The required size can be taken from the above table for registers. If more than 6 (15) parameters must be passed, the size easily can be calculated with the formula &lt;b&gt;parameters * datasize&lt;/b&gt;, where data size is 2 (16 bit), 4 (32 bit) or 8 (64 bit). For example, the required size for the 13 parameters we have to pass to WinCreateWindow() is &lt;b&gt;13 * 4&lt;/b&gt; = 52 byte. The bottom of the corresponding stack frame looks like this:&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;img src='http://lh3.ggpht.com/_Z2WbH3F-E_Q/S8OvoSXGSfI/AAAAAAAAAFw/anRsF0-0dWs/stack8.png' style='max-width: 800px;'/&gt;&lt;br/&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Watch out: Parameter 1 in figure 08 is identical with the leftmost parameter we are writing within the paranthesis of a C or C++ function. In general, parameters are &lt;i&gt;mov&lt;/i&gt;ed onto the stack in ascending order: Parameter 1 @ 00[ESP], parameter 2 @ 04[ESP] and so on, until the last parameter was &lt;i&gt;mov&lt;/i&gt;ed. If any parameter does not change between two or more consecutive &lt;i&gt;call&lt;/i&gt; instructions, it just has to be &lt;i&gt;mov&lt;/i&gt;ed the very first time it is used. For following &lt;i&gt;call&lt;/i&gt;s, the parameter already is stored at the right place, so we do not need to store it at the same location, again - it &lt;i&gt;is&lt;/i&gt; stored there until we overwrite it!&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;/div&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Adding Sizes&lt;br/&gt;&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;After we determined the required sizes of all three parts (registers, local area, parameters to pass), we calculate their sum, then round the result up to a size ending with &lt;i&gt;3C&lt;/i&gt;, &lt;i&gt;7C&lt;/i&gt;, &lt;i&gt;BC&lt;/i&gt; or &lt;i&gt;FC&lt;/i&gt; (32 bit mode), respective&lt;i&gt; 38&lt;/i&gt;, &lt;i&gt;78&lt;/i&gt;, &lt;i&gt;B8&lt;/i&gt; or &lt;i&gt;F8&lt;/i&gt; (64 bit mode). Applying this trick, our stack frame automatically is aligned to the beginning of the next cache line. Finally, we create our new stack frame by subtracting the rounded up sum from rSP. As described in detail, this subtraction reserves the created stack frame from being overwritten by other functions. Writing data to stack locations outside the current stack frame violates the rules of conventional programming as well as the rulework defined for &lt;i&gt;Intelligent Design&lt;/i&gt;.&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;/font&gt;&lt;/div&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Setting the first stack element to the beginning of a cache line, we benefit from a bunch of advantages. First, no time consuming workarounds are required to align stack locations to store or load XMM registers. It is just one side effect of the&lt;/font&gt;&lt;font face='arial'&gt;&lt;i&gt; Intelligent Design&lt;/i&gt; prologue that rSP is naturally aligned &lt;i&gt;by default&lt;/i&gt; without a single line of additional code. Secondly, &lt;i&gt;Intelligent Design&lt;/i&gt; code is able to benefit from bandwith and accellerating mechanisms provided by modern processors. At the latest, the first access to any stack element loads the stack frame into the L1 cache, moving all following accesses to the fastest area in the machine's memory hierarchy. Thirdly, writing data to continuous memory locations in ascending order triggers a mechanism called &lt;i&gt;write combining&lt;/i&gt;, where up to 16(!) doublewords are stored on the stack in one gulp. Much more advantages like avoiding repetitive stores for never changed parameters come along with this fresh and modern design. Some of them are shown later on, some of them are not revealed, yet...&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;If all &lt;i&gt;Intelligent Design&lt;/i&gt; rules were applied properly, the smallest possible stack frame automatically looks like this&lt;/font&gt; in 32 bit code&lt;br/&gt;&lt;br/&gt;&lt;img src='http://lh6.ggpht.com/_Z2WbH3F-E_Q/S8OwNcZhSnI/AAAAAAAAAFw/uZVRmcYsZ2E/stackC.png' style='max-width: 800px;'/&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;or like this in 64 bit code&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;img src='http://lh5.ggpht.com/_Z2WbH3F-E_Q/S8OwNX1oClI/AAAAAAAAAFw/G7AWDvWJ83M/stackD.png' style='max-width: 800px;'/&gt;&lt;br/&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;The real size can be varied by adding multiples of 64 to the smallest value of 0x3C (0x38 in 64 bit code) to match your individual requirements perfectly.&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;Due to the subtraction of a predefined size from rSP, the stack frames of all &lt;i&gt;Intelligent Design&lt;/i&gt; functions automatically are set to the beginning of a cache line. In general, functions are &lt;i&gt;call&lt;/i&gt;ed. Looking at the mechanism of the &lt;i&gt;call&lt;/i&gt; instruction, the return address is &lt;i&gt;push&lt;/i&gt;ed onto the stack before execution is continued with the first instruction of the called function. Hence, rSP points to an address ending with &lt;i&gt;3C&lt;/i&gt;, &lt;i&gt;7C&lt;/i&gt;, &lt;i&gt;BC&lt;/i&gt; or &lt;i&gt;FC&lt;/i&gt; in 32 bit code, respective &lt;i&gt;38&lt;/i&gt;, &lt;i&gt;78&lt;/i&gt;, &lt;i&gt;B8&lt;/i&gt; or &lt;i&gt;F8&lt;/i&gt; in 64 bit code - exactly the size we have to subtract from rSP.&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Destroying The Stack Frame&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;After our function's code was executed, &lt;i&gt;all&lt;/i&gt; registers must be restored to the content they had as they were passed to us. The last step before returning to the calling function is to add the size we subtracted to create our stack frame to rSP. With the final &lt;i&gt;ret&lt;/i&gt;, rIP is taken from the stack and execution is continued with the instruction following the &lt;i&gt;call&lt;/i&gt; of our function. At this point, our stack looks like this, again:&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;img src='http://lh3.ggpht.com/_Z2WbH3F-E_Q/S8Ov6Yf-hAI/AAAAAAAAAFw/HPXbAsnf4AE/stackA.png' style='max-width: 800px;'/&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;Because the area between stack bottom and the address currently loaded into rSP is 'no-mans-land', the next function is free to reserve a part of it to save used registers, to store its local variables and structures or to put some parameters for called functions onto the stack. On principle, the content of a stack element is undefined until something was written to it. Writing data into untouched stack elements is called 'initialisation' in C and C++ terminology.&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Side Notes About Alignment&lt;br/&gt;&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;font face='arial'&gt;Using XMM instructions, most 128 bit stores and loads explicitely expect addresses aligned to 16 byte boundaries, in other words: The last digit of the address must be zero. No existing operating system cares about aligned addresses at all. Using excessive &lt;i&gt;push&lt;/i&gt; orgies to put the parameters for called functions onto the stack, the chance to enter that function with an aligned stack pointer is about 25 percent. Because 16 / 4 = 4, the chance ESP ends with 04, 08 or 0C is about 75 percent. It's left to the programmer or compiler to handle this problem properly, so you have to add redundant, time consuming code to align the stack pointer in every function working with XMM instructions.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;As shown, &lt;i&gt;Intelligent Design&lt;/i&gt; provides some mechanisms to naturally align all stack frames to the beginning of a cache line. But - even the best mechanism is of no use if the operating system does not support it. In most cases, an ID compliant function is called by the message transporting system within the operating system. Whenever a message is sent or posted to our application's message loop, we do not know which instance of which function of the operating system called our message loop. Hence, we do not know how many things were &lt;i&gt;push&lt;/i&gt;ed onto our stack nor can we rely on something like an aligned stack pointer. Even if we aligned the stack pointer at the begin of our &lt;i&gt;main() &lt;/i&gt;function, it surely is not aligned any longer if the code in our message loop is executed the very first time. Therefore, one of the most important improvents introduced with &lt;i&gt;Intelligent Design&lt;/i&gt; is not available as long as we run our code on any existing platform. Because all&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt; existing operating systems put dirt into our gears, no machine ever will run at full speed.&lt;/font&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;font face='arial'&gt;No application programmer is able to do anything against this counter-productive behaviour of existing operating systems. Either we bow our heads, obey and apply all those pointless work-arounds recommended by processor manufacturers, or we give up programming to get rid of them. Because it is a fact that work-arounds are not very practicable in many cases, there's a third way as an alternative for this 'take it or leave it' choice. Instead of searching for ways to bypass 'built in' restrictions, speed brakes and other obstacles of existing operating systems, it might be a much better idea to throw away these remains of old fashioned software design and create something new. Something supporting innovations and capabilities of modern processors instead of ignoring them. Something making use of the immense computational power and speed of recent quad core machines. Something called &lt;i&gt;IDEOS&lt;/i&gt; (the &lt;i&gt;Intelligent Design&lt;/i&gt; Easy2Use Operating System)...&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;Go to the &lt;a href='http://st-intelligentdesign.blogspot.com/2010/04/first-steps.html'&gt;next post&lt;/a&gt; (09 - First Steps)&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;div class='zemanta-pixie'&gt;&lt;img src='http://img.zemanta.com/pixy.gif?x-id=2787cc89-4bda-8557-b7ba-a2d28f0abc2d' alt='' class='zemanta-pixie-img'/&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4412535206273251260-8521931833047585910?l=st-intelligentdesign.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://st-intelligentdesign.blogspot.com/feeds/8521931833047585910/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/intelligent-design.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/8521931833047585910'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/8521931833047585910'/><link rel='alternate' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/intelligent-design.html' title='08 - Intelligent Design'/><author><name>Bernhard Schornak</name><uri>http://www.blogger.com/profile/07864510983569379361</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_Z2WbH3F-E_Q/S4KZO1ynfyI/AAAAAAAAABQ/jgXB2qAV2Xc/S220/BS.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://lh3.ggpht.com/_Z2WbH3F-E_Q/S8Ovn3yQcGI/AAAAAAAAAFw/GUB2WM54Ezg/s72-c/stack5.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4412535206273251260.post-8525251017175312611</id><published>2010-04-13T05:05:00.001+02:00</published><updated>2010-04-14T14:57:56.414+02:00</updated><title type='text'>07 - Considerations</title><content type='html'>&lt;div xmlns='http://www.w3.org/1999/xhtml'&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;As discussed in great detail, any conventional stack frame generally looks like this:&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;img style='max-width: 800px;' src='http://lh4.ggpht.com/_Z2WbH3F-E_Q/S8Ovd2D9lgI/AAAAAAAAAFw/l56NA8zCofU/stack2.png'/&gt;&lt;br/&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Programmers and compilers are free to write any kind of data to stack locations between the stack's bottom and the stack element below ESP (-0x04[ESP]). To reserve a part of the stack for the private use of the current function, we have to subtract the needed size from ESP. Moving ESP towards stack bottom, we reserve the corresponding area for our private use. No function (except it is compiled with &lt;i&gt;GCC 3.3.5.&lt;/i&gt;) ever writes data to stack locations above -04[ESP]. It is very important to understand the connection between the subtraction of the required stack frame size from ESP and &lt;i&gt;exclusive&lt;/i&gt; ownership of the reserved area. My alternative method introduced in this paper as well as conventional programming techniques are based on this principle. You just break an &lt;i&gt;absolute taboo&lt;/i&gt; if you write to stack locations above -04[ESP]!&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Because conventional programming methods &lt;i&gt;push&lt;/i&gt; data onto the stack, respectively &lt;i&gt;pop&lt;/i&gt; them from the stack, the content of ESP is changing continuously. Due to its randomly changing content, it is impossible to use ESP as base for adressing specific stack elements directly. An additional register, the base pointer EBP, is required for this task. It points to the current top of the stack all of the time, so we safely can adress all stack elements regardless of the current content of ESP.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;If we analyse the described process and have a closer look at its details, one thing is quite obvious: We could save all time consuming contortions with additional registers if we replaced the usual &lt;i&gt;push&lt;/i&gt; - &lt;i&gt;call&lt;/i&gt; - &lt;i&gt;add n,%esp&lt;/i&gt; sequenzes with something else. A short look into the data sheets of modern processors reveals us some additional caveats. Let us recapitulate what we know about &lt;i&gt;push&lt;/i&gt; and &lt;i&gt;pop&lt;/i&gt; instructions and compare them against a bunch of simple &lt;i&gt;mov&lt;/i&gt; instructions:&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;big&gt;&lt;b&gt;PUSH&lt;/b&gt;&lt;/big&gt;&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;img style='max-width: 800px;' src='http://lh3.ggpht.com/_Z2WbH3F-E_Q/S8OvoAXjwKI/AAAAAAAAAFw/ubRrR_lKbF8/stack6.png'/&gt;&lt;br/&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Depending on the standard data size, either two, four or eight is subtracted from the stack pointer rSP, setting it to the next lower stack element. After updating rSP, the content of a register, a memory location or an immediate value is copied to the current stack element (00[rSP]). A &lt;i&gt;push reg&lt;/i&gt; or &lt;i&gt;push imm&lt;/i&gt; instruction is executed within three clock cycles. ESP is blocked for the time it needs to update its content and deliver the stack location where the given data shall be stored. This requires two clock cycles, so consecutive &lt;i&gt;push&lt;/i&gt; instructions have a latency of at least two clock cycles.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;As an alternative, we might replace one &lt;i&gt;push&lt;/i&gt; with three &lt;i&gt;mov&lt;/i&gt; instructions. If none of them depends on the result of one of the other two &lt;i&gt;mov&lt;/i&gt;s, all of them are executed simultaneously in three clock cycles, as well. If we write to continuous memory locations, all writes are done in one gulp, because a mechanism called write combining is triggered.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;big&gt;&lt;b&gt;&lt;font face='arial'&gt;POP&lt;/font&gt;&lt;/b&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;&lt;img style='max-width: 800px;' src='http://lh6.ggpht.com/_Z2WbH3F-E_Q/S8OvoMWeG_I/AAAAAAAAAFw/Gax8w-zxoME/stack7.png'/&gt;&lt;br/&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;The content of the current stack element is copied to the register or memory location specified by the &lt;i&gt;pop&lt;/i&gt; instruction, then two, four or eight is added to the stack pointer rSP. The current stack element was 'taken from the stack' and the stack pointer now points to one stack element above (our new stack bottom). Other than the direct path &lt;i&gt;push&lt;/i&gt;, &lt;i&gt;pop&lt;/i&gt; is a vector path instruction. These instructions always are fed to execution pipes 0 and 1, while execution pipe 2 is blocked while the instruction is executed. The latency of each &lt;i&gt;pop&lt;/i&gt; instruction is 4 clock cycles.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;As an alternative, we might replace one &lt;i&gt;pop&lt;/i&gt; with four &lt;i&gt;mov&lt;/i&gt; instructions. Three of them are executed simultanuously in three clock cycles, the fourth &lt;i&gt;mov&lt;/i&gt; starts execution in the fourth clock cycle.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Conclusions&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;As the analysis of both instructions shows, the only rational consequence is to replace all &lt;i&gt;pop&lt;/i&gt; and &lt;i&gt;push&lt;/i&gt; with &lt;i&gt;mov&lt;/i&gt; instructions. Doing so not only speeds&lt;/font&gt;&lt;font face='arial'&gt; up passing of parameters, it also turns the flying stack pointer into a &lt;i&gt;static&lt;/i&gt; one. As a positive side-effect of the frozen stack pointer, we do not need EBP as base pointer any longer, freeing one valuable register for general purposes. Using no base pointer, we get rid of that mess with positive and negative offsets to EBP - a source of errors, making source code less readable.&lt;br/&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Addendum&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;With Athlon64, family 10h (November 2008), the latency for &lt;i&gt;pop&lt;/i&gt; instructions was reduced to three clock cycles (direct path single). Nonetheless, replacing &lt;i&gt;push&lt;/i&gt; and &lt;i&gt;pop&lt;/i&gt; with &lt;i&gt;mov&lt;/i&gt; instructions is a much better improvement, because of the expected positive side-effects coming along with the replacement automatically. Even if &lt;i&gt;Intelligent Design&lt;/i&gt; functions will win any race around clock cycles with ease, its concept offers much more improvements than just counting clocks.&lt;/font&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='arial'&gt;Go to the &lt;a href='http://st-intelligentdesign.blogspot.com/2010/04/intelligent-design.html'&gt;next post&lt;/a&gt; &lt;/font&gt;(08 - Intelligent Design)&lt;br/&gt;&lt;br/&gt;&lt;div class='zemanta-pixie'&gt;&lt;img src='http://img.zemanta.com/pixy.gif?x-id=52067082-c7b9-87ba-b590-09576f033035' alt='' class='zemanta-pixie-img'/&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4412535206273251260-8525251017175312611?l=st-intelligentdesign.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://st-intelligentdesign.blogspot.com/feeds/8525251017175312611/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/considerations.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/8525251017175312611'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/8525251017175312611'/><link rel='alternate' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/considerations.html' title='07 - Considerations'/><author><name>Bernhard Schornak</name><uri>http://www.blogger.com/profile/07864510983569379361</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_Z2WbH3F-E_Q/S4KZO1ynfyI/AAAAAAAAABQ/jgXB2qAV2Xc/S220/BS.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://lh4.ggpht.com/_Z2WbH3F-E_Q/S8Ovd2D9lgI/AAAAAAAAAFw/l56NA8zCofU/s72-c/stack2.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4412535206273251260.post-8581436573519430256</id><published>2010-04-13T04:38:00.001+02:00</published><updated>2010-04-14T14:55:18.253+02:00</updated><title type='text'>06 - Analysis</title><content type='html'>&lt;div xmlns='http://www.w3.org/1999/xhtml'&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;The source code generated by&lt;i&gt; GCC 3.3.5&lt;/i&gt; perfectly reveals the greatest weaknesses of the C conventions. The first step to slow down C code markably is the abuse of EBP as &lt;i&gt;base pointer&lt;/i&gt;, reducing our set of available registers to 6: EAX, EBX, ECX, EDX, EDI and ESI. One of these remaining registers, EAX, is used to pass results or error codes from a called function to the calling function, reducing our register set to five. By default, ECX and EDX neither are saved nor restored, so every time we call another function their content is sent to the great formatter, or, in other words: If we stored frequently used parameters in ECX or EDX prior to the call, they're probably overwritten by the called function. Due to counterproductive conventions, only three registers are left to store our frequently used parameters. If four or more parameters are required throughout a function, parameters 4 and up must be reloaded whenever we call another function, because the content of EAX, ECX and EDX is changed by the called function. Reloading parameters from the slower memory subsystem instead of preloading them in registers to perform much faster operations is quite time consuming. It is not about those five additional clock cycles we waste with reloading a parameter over and over again. The most negative side-effect is the immediate interruption of parallel execution, forcing two execution pipes to idle until the required parameter was reloaded, again. It's one of the reasons why C and C++ applications are that slow.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;An exemplatory sequence is the following code snippet taken from &lt;i&gt;GCC's output&lt;/i&gt;. It demonstrates how strict implementation of the C conventions slows down the execution of standard C and C++ code:&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;    ...&lt;br/&gt;    movl    _GVAR, %eax&lt;br/&gt;    movl    7252(%eax), %eax&lt;br/&gt;    pushl    %eax&lt;br/&gt;    ...&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;This sequence turns any parallel execution off. Instead of executing up to 3 instructions in one of the 3 available execution pipes, the code listed above is executed sequentially, or, in other words: Two execution pipes have to wait until the previous instruction is executed and its result is available. If all data are present in L1 cache, it takes (approximately) nine clock cycles to execute the three lines listed above. Two execution pipes are idling while _GVAR (_GVAR is BNR defined as an array of dwords to satisfy GCC) is loaded into EAX. Next, two pipes are idling while 1C54[BNR] is loaded into EAX. Finally, EAX is pushed onto the stack, blocking ESP for two clock cycles. If some data were not loaded into L1 cache, yet, execution time is extended markably. Reading data from the L2 cache costs about six clock cycles, loads from main memory consume about 27 clock cycles. In the end, our primary intention, saving 14 clock cycles to &lt;i&gt;push &lt;/i&gt;and&lt;i&gt; &lt;/i&gt;&lt;i&gt;pop&lt;/i&gt; ECX and EDX in every function's prologue and epilogue, turns out to be a bad idea. In the best case, the assumed advantage is eaten up after the second reload of a frequently used parameter. In general, there is &lt;i&gt;no&lt;/i&gt; advantage at all - The saved 14 clock cycles are wasted with the first reloading from main memory. This is the reason, why conventional software is creeping through the execution pipes of John Doe's hypermodern Gigantium processor like thick dough. Perhaps it were a good idea to purify the sap called software to give John Doe the chance to partially enjoy the overwhelming computational powers of his expensive Gigantium machine?&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Another obstacle to unleash the powers of recent processors is the flying stack pointer ESP. Modern software design should use the capabilities of modern processors instead of blocking them with outdated mechanisms and ancient techniques. A pointlessly floating stack pointer with random content cannot be used seriously to address stack elements. How to solve these problems elegantly is the topic of the &lt;a href='http://st-intelligentdesign.blogspot.com/2010/04/considerations.html'&gt;following post&lt;/a&gt;.&lt;/font&gt;&lt;/div&gt;&lt;br/&gt;&lt;br/&gt;&lt;div class='zemanta-pixie'&gt;&lt;img src='http://img.zemanta.com/pixy.gif?x-id=32f6b4b9-c24d-8a0d-97d9-3b5e946bb2e4' alt='' class='zemanta-pixie-img'/&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4412535206273251260-8581436573519430256?l=st-intelligentdesign.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://st-intelligentdesign.blogspot.com/feeds/8581436573519430256/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/analysis.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/8581436573519430256'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/8581436573519430256'/><link rel='alternate' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/analysis.html' title='06 - Analysis'/><author><name>Bernhard Schornak</name><uri>http://www.blogger.com/profile/07864510983569379361</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_Z2WbH3F-E_Q/S4KZO1ynfyI/AAAAAAAAABQ/jgXB2qAV2Xc/S220/BS.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4412535206273251260.post-8202457287731474765</id><published>2010-04-13T04:27:00.001+02:00</published><updated>2010-04-14T14:53:10.018+02:00</updated><title type='text'>05 - Improvements</title><content type='html'>&lt;div xmlns='http://www.w3.org/1999/xhtml'&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Obviously, the code generated by GCC 3.3.5. is anything else than optimised. Even if you do not know anything about reading source code, you surely are able to grasp what those comments say. This document is not an introduction to programming, so you have to rely on my words, but you can be sure: I definitely know what I am talking about.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;Re-arranging some parts and reducing code sequences to really required instructions does shrink GCC's draft markably. Applying some human brain, the remaining (optimised) code should run about 30 percent faster now.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;        .data&lt;br/&gt;&lt;br/&gt;        .p2align 4,0x00&lt;br/&gt;    jt0:.long  L02&lt;br/&gt;        .long  L03&lt;br/&gt;        .long  L04&lt;br/&gt;        .long  L05&lt;br/&gt;        .long  L16&lt;br/&gt;        .long  L15&lt;br/&gt;        .long  L15&lt;br/&gt;        .long  L15&lt;br/&gt;        .long  L06&lt;br/&gt;        .long  L07&lt;br/&gt;        .long  L15&lt;br/&gt;        .long  L15&lt;br/&gt;        .long  L15&lt;br/&gt;        .long  L15&lt;br/&gt;        .long  L15&lt;br/&gt;        .long  L15&lt;br/&gt;        .long  L08&lt;br/&gt;        .long  L09&lt;br/&gt;        .long  L10&lt;br/&gt;        .long  L11&lt;br/&gt;        .long  L12&lt;br/&gt;        .long  L13&lt;br/&gt;        .long  L14&lt;br/&gt;        .long  L15&lt;br/&gt;        .long  L08&lt;br/&gt;        .long  L09&lt;br/&gt;        .long  L10&lt;br/&gt;        .long  L11&lt;br/&gt;        .long  L12&lt;br/&gt;        .long  L13&lt;br/&gt;        .long  L14&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;Following recommendations of AMD and LETNi, the jump table is moved to the &lt;i&gt;.data&lt;/i&gt; segment. This is much better than mixing code and data in the &lt;i&gt;.code&lt;/i&gt; segment.&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;        .text&lt;br/&gt;&lt;br/&gt;        .align 2,0x90&lt;br/&gt;.globl MoveDlg&lt;br/&gt;MoveDlg:pushl  %ebp&lt;br/&gt;        movl   %esp,%ebp&lt;br/&gt;        pushl  %edi&lt;br/&gt;        pushl  %esi&lt;br/&gt;        movl   0x08(%ebp),%edi&lt;br/&gt;        movl   0x0C(%ebp),%eax&lt;br/&gt;        movzwl 0x10(%ebp),%ecx&lt;br/&gt;        movl   _GVAR,%esi&lt;br/&gt;        cmpl   $0x30,%eax&lt;br/&gt;        je     L01&lt;br/&gt;        cmpl   $0x20,%eax&lt;br/&gt;        je     L00&lt;br/&gt;        cmpl   $0x3B,%eax&lt;br/&gt;        jne    L15&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;The distributor was optimised for the branch prediction logic. WM_CONTROL was put on top of the distributor, because most sent messages are WM_CONTROL messages. WM_COMMAND only is sent if the user pushes a button. No user is able to recognise delays of about 5 ns, so we can live with a ten cycles penalty if a branch target is misprediced. WM_INITDLG is sent only once. While the 1st comparison is 'guessed' as not taken, the branch does not trigger a penalty. The 2nd and all following comparisons are assumed to be taken, so the branch to the default routine (DefDP()) does not trigger penalties, as well.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;        pushl  $0xE9&lt;br/&gt;        pushl  $0xD3&lt;br/&gt;        pushl  $0xD2&lt;br/&gt;        pushl  %edi&lt;br/&gt;        call   _DLGtxt&lt;br/&gt;        pushl  $0x00&lt;br/&gt;        pushl  $-0x01&lt;br/&gt;        pushl  $0x0120&lt;br/&gt;        pushl  $0x1240&lt;br/&gt;        pushl  %edi&lt;br/&gt;        call   _SnDIM&lt;br/&gt;        addl   $0x08,%esp&lt;br/&gt;        pushl  $0x1248&lt;br/&gt;        pushl  %edi&lt;br/&gt;        call   _SnDIM&lt;br/&gt;        addl   $0x08,%esp&lt;br/&gt;        pushl  $0x1250&lt;br/&gt;        pushl  %edi&lt;br/&gt;        call   _SnDIM&lt;br/&gt;        addl   $0x08,%esp&lt;br/&gt;        pushl  $0x1258&lt;br/&gt;        pushl  %edi&lt;br/&gt;        call   _SnDIM&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;The three parameters on top are &lt;i&gt;push&lt;/i&gt;ed for the first call, only. This saves nine redundant instructions.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;        addl   $0x14,%esp&lt;br/&gt;        movl   0x1C54(%esi),%ecx&lt;br/&gt;        movl   0x01D8(%esi),%edx&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Both parameters can be preloaded at this point, because FDacc() is a function taken from ST-Open's library. Functions in my libraries restore all registers (including ECX and EDX) by default - they are 'clean'. But - watch out: MoveDlg() is a function following the C conventions. ECX and EDX neither are saved nor restored - MoveDlg() is a 'dirty' function.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;        pushl  %edi&lt;br/&gt;        pushl  $0x00&lt;br/&gt;        pushl  $0x02&lt;br/&gt;        pushl  $0x00&lt;br/&gt;        pushl  $0x0B&lt;br/&gt;        pushl  %ecx&lt;br/&gt;        call   _FDacc&lt;br/&gt;        pushl  %edx&lt;br/&gt;        pushl  $0x00&lt;br/&gt;        pushl  $0x02&lt;br/&gt;        pushl  $0x04&lt;br/&gt;        pushl  $0x0B&lt;br/&gt;        pushl  %ecx&lt;br/&gt;        call   _FDacc&lt;br/&gt;        addl   $0x36,%esp&lt;br/&gt;        movl   $0x00,0x28E0(%esi)&lt;br/&gt;        pushl  %edi&lt;br/&gt;        call   _CtrWn&lt;br/&gt;        call   _DlgShow&lt;br/&gt;        jmp   3f&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;The next distributor was optimised for code reduction. Because none of the three buttons is pushed more than once (in general..), we can live with a ten cycles penalty for one or two mispredicted branch(es). The delay added by the penalty is at least six powers of ten faster than anything human senses could perceive.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;    L00:subl  $0x1231,%ecx&lt;br/&gt;        je    0f&lt;br/&gt;        decl  %ecx&lt;br/&gt;        je    1f&lt;br/&gt;        decl  %ecx&lt;br/&gt;        jne   L15&lt;br/&gt;        pushl $0x11&lt;br/&gt;        call  _Help&lt;br/&gt;        jmp   3f&lt;br/&gt;      0:movl  $0x00,0x28E0(%esi)&lt;br/&gt;        jmp   2f&lt;br/&gt;      1:orl   $0x00040000,0x28E0(%esi)&lt;br/&gt;      2:pushl %edi&lt;br/&gt;        call _WinDD&lt;br/&gt;      3:addl  $0x04,%esp&lt;br/&gt;        jmp   L16&lt;br/&gt;&lt;br/&gt;    L01:subl   $0x1240,%ecx&lt;br/&gt;        js     L15&lt;br/&gt;        cmpl   $0x1E,%ecx&lt;br/&gt;        ja     L15&lt;br/&gt;        movl   0x28E0(%esi),%eax&lt;br/&gt;        jmp    *jt0(, %ecx, 4)&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;The jump table was moved to the &lt;i&gt;.data&lt;/i&gt; segment. To keep an overwiew, your symbols for jump tables generally should be marked with special names. ST-Open uses the symbol 'jtX', where X is the number of the current jump table.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;    L02:andl   $0xFFFFE1FF,%eax&lt;br/&gt;        orl    $0x1000,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L03:andl   $0xFFFFE1FF,%eax&lt;br/&gt;        orl    $0x0800,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L04:andl   $0xFFFFE1FF,%eax&lt;br/&gt;        orl    $0x0400,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L05:andl   $0xFFFFE1FF,%eax&lt;br/&gt;        orl    $0x0200,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L06:andl   $0xFFFFFE7F,%eax&lt;br/&gt;        orl    $0x0100,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L07:andl   $0xFFFFFE7F,%eax&lt;br/&gt;        orl    $0x80,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L08:andl   $0xFFFFFF80,%eax&lt;br/&gt;        orl    $0x40,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L09:andl   $0xFFFFFE80,%eax&lt;br/&gt;        orl    $0x20,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L10:andl   $0xFFFFFE80,%eax&lt;br/&gt;        orl    $0x10,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L11:andl   $0xFFFFFE80,%eax&lt;br/&gt;        orl    $0x08,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L12:andl   $0xFFFFFE80,%eax&lt;br/&gt;        orl    $0x04,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L13:andl   $0xFFFFFE80,%eax&lt;br/&gt;        orl    $0x02,%eax&lt;br/&gt;        jmp    0f&lt;br/&gt;    L14:andl   $0xFFFFFE80,%eax&lt;br/&gt;        orl    $0x01,%eax&lt;br/&gt;      0:movl   %eax,0x28E0(%esi)&lt;br/&gt;        jmp    L16&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;This part surely could be reduced further if I could remember what all those flags are good for...&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;    L15:popl  %ebx&lt;br/&gt;        popl  %edi&lt;br/&gt;        popl  %esi&lt;br/&gt;        popl  %ebp&lt;br/&gt;        jmp  _DefDP&lt;br/&gt;&lt;br/&gt;    L16:xorl  %eax, %eax&lt;br/&gt;        popl  %ebx&lt;br/&gt;        popl  %edi&lt;br/&gt;        popl  %esi&lt;br/&gt;        popl  %ebp&lt;br/&gt;        ret&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;'Exits' belong to the bottom of a function. First, the processor does not have to jump back and forth to random locations within the instruction chain. Secondly, human senses perceive structured (sorted) input much faster than random patterns spread all over the screen.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;.comm _GVAR,4&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;One (used) out of all (unused). The size of the global variable was reset to the proper size of 32 bit, so four - rather than one - variable(s) will fit into one paragraph (16 byte), again.&lt;/font&gt;&lt;/div&gt;&lt;br/&gt;Go to the &lt;a href='http://st-intelligentdesign.blogspot.com/2010/04/analysis.html'&gt;next post&lt;/a&gt; (06 - Analysis).&lt;br/&gt;&lt;br/&gt;&lt;div class='zemanta-pixie'&gt;&lt;img src='http://img.zemanta.com/pixy.gif?x-id=e8c20499-35d4-8c51-9cf5-afc9bed669bb' alt='' class='zemanta-pixie-img'/&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4412535206273251260-8202457287731474765?l=st-intelligentdesign.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://st-intelligentdesign.blogspot.com/feeds/8202457287731474765/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/improvements.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/8202457287731474765'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/8202457287731474765'/><link rel='alternate' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/improvements.html' title='05 - Improvements'/><author><name>Bernhard Schornak</name><uri>http://www.blogger.com/profile/07864510983569379361</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_Z2WbH3F-E_Q/S4KZO1ynfyI/AAAAAAAAABQ/jgXB2qAV2Xc/S220/BS.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4412535206273251260.post-6078931053018077570</id><published>2010-04-13T04:03:00.001+02:00</published><updated>2010-04-14T14:47:58.424+02:00</updated><title type='text'>04 - Caveats</title><content type='html'>&lt;div xmlns='http://www.w3.org/1999/xhtml'&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;As mentioned before, conventions, methods and techniques introduced with the programming language C have a lot of advantages over monolithic code we used to write applications for DOS and other ancient operating systems. All functions easily are portable to other operating systems or processor architectures and can be used multiple times, leading to versatile applications for multiple platforms. Unfortunately, after those C conventions were established, they never were updated or revised to keep pace with the development of processor architectures. If we compare an 8086 against a recent quad-core Athlon, a picturesque comparison were a cart dragged by a tired ox versus an Airbus A380. While no person with sane mind ever wasted one thought about equipping an A380 with a harness and motivate it to lift off with reins, whip and loud 'hee' and 'ho' shouts, software designers all over the world practise such weird things every day. They still create 'flying' stack frames, use slow &lt;i&gt;leave&lt;/i&gt;, &lt;i&gt;pop&lt;/i&gt; and &lt;i&gt;push&lt;/i&gt; instructions with the obligatory update of the stack pointer instead of the faster &lt;i&gt;mov&lt;/i&gt; and continue to abuse valuable resources by using EBP as base pointer. This reduces the too small register set of the x86 architecture by 1/7th, forcing the programmer to use slow memory reads and writes instead of much faster register operations.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;To demonstrate the caveats, we analyse a dialog procedure taken from an existing ST-Open program. The dialog consists of four groups of radiobuttons, the three buttons 'Abort', 'Move', 'Help' and some static texts. &lt;i&gt;DLGtxt()&lt;/i&gt; sets all texts in this dialog to strings taken from a subfield with the current language (multi-lingual dialog and menu texts are an integral part of ST-Open's libraries). The dialog is used to move datasets within a datafield, the target is selected with the radiobuttons.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;/font&gt;&lt;font face='monospace'&gt;    .text&lt;br/&gt;    .p2align 4,,15&lt;br/&gt;&lt;br/&gt;.globl MoveDlg&lt;br/&gt;MoveDlg:&lt;br/&gt;    pushl   %ebp&lt;br/&gt;    movl    %esp, %ebp&lt;br/&gt;    pushl   %esi&lt;br/&gt;    pushl   %ebx&lt;br/&gt;    movl    12(%ebp), %edx&lt;br/&gt;    movl    8(%ebp), %esi&lt;br/&gt;    movl    16(%ebp), %ecx&lt;br/&gt;    movl    20(%ebp), %ebx&lt;br/&gt;    cmpl    $48, %edx&lt;br/&gt;    je    L12&lt;br/&gt;    ja    L39&lt;br/&gt;    cmpl    $32, %edx&lt;br/&gt;    je    L4&lt;br/&gt;&lt;br/&gt;L37:movl    %ebx, 20(%ebp)&lt;br/&gt;    movl    %ecx, 16(%ebp)&lt;br/&gt;    movl    %edx, 12(%ebp)&lt;br/&gt;    movl    %esi, 8(%ebp)&lt;br/&gt;    leal    -8(%ebp), %esp&lt;br/&gt;    popl    %ebx&lt;br/&gt;    popl    %esi&lt;br/&gt;    popl    %ebp&lt;br/&gt;    jmp    _DefDP&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Compared against GCC 2.6.1. (1993), GCC 3.3.5. (2006) generates much worse code, spiced with a lot of counterproductive, superfluous instructions. Writing back parameters to the stack violates some basic rules. First, there's absolutely no need to write data back to the locations we took them from some instructions before. Secondly, writes to stack locations above ESP violate all rules of proper programming. If any function starts to write data to the stack frames of other functions, it is just a question of time until we encounter desastrous malfunctions.&lt;br/&gt;&lt;/font&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;    .p2align 4,,7&lt;br/&gt;&lt;br/&gt; L4:movl    %ecx, %eax&lt;br/&gt;    andl    $65535, %eax&lt;br/&gt;    cmpl    $4658, %eax&lt;br/&gt;    je    L7&lt;br/&gt;    jg    L11&lt;br/&gt;    cmpl    $4657, %eax&lt;br/&gt;    je    L6&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Because we only need the low word stored in message parameter 1, it was a good idea to extract this lower word with the instruction &lt;i&gt;movzwl 0x10(%ebp),%ecx&lt;/i&gt; rather than to waste valuable clock cycles with the code sequence shown above. Recent processors have three, not just one execution pipe(s). The choosen way sends two execution pipes to sleep while one pipe is busy to extract data from a register. This is repeated two times. We could switch off two execution pipes while this code is executed, because we created two avoidable dependencies.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt; L9:movl    %ebx, 20(%ebp)&lt;br/&gt;    movl    %ecx, 16(%ebp)&lt;br/&gt;    movl    %edx, 12(%ebp)&lt;br/&gt;    movl    %esi, 8(%ebp)&lt;br/&gt;    leal    -8(%ebp), %esp&lt;br/&gt;    popl    %ebx&lt;br/&gt;    popl    %esi&lt;br/&gt;    popl    %ebp&lt;br/&gt;    jmp    _DefDP&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Obviously, L37 and L9 provide identical code. Using our brain, these eighteen redundant (partially pointless) lines can be reduced to five really necessary instructions.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt; L6:movl    _GVAR, %eax        # [1]&lt;br/&gt;    subl    $12, %esp&lt;br/&gt;    movl    $0, 10464(%eax)&lt;br/&gt;&lt;br/&gt;L42:pushl    %esi&lt;br/&gt;    call    _WinDD&lt;br/&gt;&lt;br/&gt;L41:addl    $16, %esp&lt;br/&gt;&lt;br/&gt;    .p2align 4,,7&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Up to seven &lt;i&gt;nop&lt;/i&gt;s are executed every time we branch to this part of code. It is a good way to slow down execution flow as much as possible.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt; L2:leal    -8(%ebp), %esp&lt;br/&gt;    xorl    %eax, %eax&lt;br/&gt;    popl    %ebx&lt;br/&gt;    popl    %esi&lt;br/&gt;    popl    %ebp&lt;br/&gt;    ret&lt;br/&gt;&lt;br/&gt;L11:cmpl    $4659, %eax&lt;br/&gt;    jne    L9&lt;br/&gt;    subl    $12, %esp&lt;br/&gt;    pushl    $17&lt;br/&gt;    call    _Help&lt;br/&gt;    jmp    L41&lt;br/&gt;&lt;br/&gt; L7:pushl    %edx&lt;br/&gt;    pushl    %edx&lt;br/&gt;    pushl    $18&lt;br/&gt;    movl    _GVAR, %eax        # [1]&lt;br/&gt;    addl    $10464, %eax&lt;br/&gt;    pushl    %eax&lt;br/&gt;    call    _FlgS&lt;br/&gt;    popl    %eax&lt;br/&gt;    jmp    L42&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;The branch prediction logic assumes every first branch as &lt;i&gt;false&lt;/i&gt; if there is no entry in its internal table. Almost all of the above branches trigger the obligatory ten penalty cycles, because the wrong branch instructions were chosen.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;    .p2align 4,,7&lt;br/&gt;&lt;br/&gt;L39:cmpl    $59, %edx&lt;br/&gt;    jne    L37&lt;br/&gt;    pushl   $233&lt;br/&gt;    pushl   $211&lt;br/&gt;    pushl   $210&lt;br/&gt;    pushl   %esi&lt;br/&gt;    call    _DLGtxt&lt;br/&gt;    movl    $0, (%esp)&lt;br/&gt;    pushl   $-1&lt;br/&gt;    pushl   $288&lt;br/&gt;    pushl   $4672&lt;br/&gt;    pushl   %esi&lt;br/&gt;    call    _SnDIM&lt;br/&gt;    addl    $20, %esp&lt;br/&gt;    pushl   $0&lt;br/&gt;    pushl   $-1&lt;br/&gt;    pushl   $288&lt;br/&gt;    pushl   $4680&lt;br/&gt;    pushl   %esi&lt;br/&gt;    call    _SnDIM&lt;br/&gt;    addl    $20, %esp&lt;br/&gt;    pushl   $0&lt;br/&gt;    pushl   $-1&lt;br/&gt;    pushl   $288&lt;br/&gt;    pushl   $4688&lt;br/&gt;    pushl   %esi&lt;br/&gt;    call    _SnDIM&lt;br/&gt;    addl    $20, %esp&lt;br/&gt;    pushl   $0&lt;br/&gt;    pushl   $-1&lt;br/&gt;    pushl   $288&lt;br/&gt;    pushl   $4696&lt;br/&gt;    pushl   %esi&lt;br/&gt;    call    _SnDIM&lt;br/&gt;    addl    $24, %esp&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Only the second parameter changes for the four consecutive calls of SnDIM(). Twelve of these twenty &lt;i&gt;push&lt;/i&gt; instructions (60 percent) are redundant.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;    pushl    %esi&lt;br/&gt;    pushl    $0&lt;br/&gt;    pushl    $2&lt;br/&gt;    pushl    $0&lt;br/&gt;    pushl    $11&lt;br/&gt;    movl     _GVAR, %eax        # [1]&lt;br/&gt;    movl     7252(%eax), %eax&lt;br/&gt;    pushl    %eax&lt;br/&gt;    call     _FDacc&lt;br/&gt;    movl     _GVAR, %eax        # [1]&lt;br/&gt;    addl     $24, %esp&lt;br/&gt;    movl     472(%eax), %ebx&lt;br/&gt;    pushl    %ebx&lt;br/&gt;    pushl    $0&lt;br/&gt;    pushl    $2&lt;br/&gt;    pushl    $4&lt;br/&gt;    pushl    $11&lt;br/&gt;    movl     7252(%eax), %ecx&lt;br/&gt;    pushl    %ecx&lt;br/&gt;    call    _FDacc&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Only parameters 3 and 6 change for the both calls of FDacc(). While we are bound to&lt;i&gt; push&lt;/i&gt; instructions, there is no way to change just these parameters. We have to push all six parameters, again, because the parameter on top must be changed.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;    addl    $20, %esp&lt;br/&gt;    movl    _GVAR, %eax        # [1]&lt;br/&gt;    movl    $0, 10464(%eax)&lt;br/&gt;    pushl   %esi&lt;br/&gt;    call    _CtrWn&lt;br/&gt;    movl    %esi, (%esp)&lt;br/&gt;    call    _DlgShow&lt;br/&gt;    jmp    L41&lt;br/&gt;&lt;br/&gt;    .p2align 4,,7&lt;br/&gt;&lt;br/&gt;L12:movl    %ecx, %eax&lt;br/&gt;    andl    $65535, %eax&lt;br/&gt;    subl    $4672, %eax&lt;br/&gt;    cmpl    $30, %eax&lt;br/&gt;    ja    L37&lt;br/&gt;    jmp    *L36(,%eax,4)&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Again, it were better to extract the low word via one &lt;i&gt;movzwl 0x10(%ebp),%ecx&lt;/i&gt; rather than to use the chosen way. Probably, parallel execution was considered to be too fast?&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;    .p2align 2&lt;br/&gt;    .align 2,0xcc&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;I don't know what the second &lt;i&gt;.align&lt;/i&gt; is good for. Any hints? My jump table is too large, because C programmers do not think about side effects like blowing up code while assigning resource IDs to 'straight' numbers. Due to the gaps between those IDs, there are a lot of superfluous entries in this jump table.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;L36:.long    L14&lt;br/&gt;    .long    L15&lt;br/&gt;    .long    L16&lt;br/&gt;    .long    L17&lt;br/&gt;    .long    L2&lt;br/&gt;    .long    L37&lt;br/&gt;    .long    L37&lt;br/&gt;    .long    L37&lt;br/&gt;    .long    L18&lt;br/&gt;    .long    L19&lt;br/&gt;    .long    L37&lt;br/&gt;    .long    L37&lt;br/&gt;    .long    L37&lt;br/&gt;    .long    L37&lt;br/&gt;    .long    L37&lt;br/&gt;    .long    L37&lt;br/&gt;    .long    L21&lt;br/&gt;    .long    L23&lt;br/&gt;    .long    L25&lt;br/&gt;    .long    L27&lt;br/&gt;    .long    L29&lt;br/&gt;    .long    L31&lt;br/&gt;    .long    L33&lt;br/&gt;    .long    L37&lt;br/&gt;    .long    L21&lt;br/&gt;    .long    L23&lt;br/&gt;    .long    L25&lt;br/&gt;    .long    L27&lt;br/&gt;    .long    L29&lt;br/&gt;    .long    L31&lt;br/&gt;    .long    L33&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;By the way: Jump tables belong to the &lt;i&gt;.data&lt;/i&gt;, not to the &lt;i&gt;.code&lt;/i&gt; segment. AS supports jump tables in the &lt;i&gt;.data&lt;/i&gt; segment, so there's no need to violate the recommendations of AMD and LETNi as all versions of GCC do...&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;L14:movl    _GVAR, %edx        # [1], [2]&lt;br/&gt;    movl    10464(%edx), %eax&lt;br/&gt;    andb    $225, %ah&lt;br/&gt;    orb     $16, %ah&lt;br/&gt;&lt;br/&gt;    .p2align 4,,7&lt;br/&gt;&lt;br/&gt;L40:movl    %eax, 10464(%edx)&lt;br/&gt;    jmp    L2&lt;br/&gt;&lt;br/&gt;L15:movl    _GVAR, %edx        # [1], [2]&lt;br/&gt;    movl    10464(%edx), %eax&lt;br/&gt;    andb    $225, %ah&lt;br/&gt;    orb     $8, %ah&lt;br/&gt;    jmp    L40&lt;br/&gt;&lt;br/&gt;L16:movl    _GVAR, %edx        # [1], [2]&lt;br/&gt;    movl    10464(%edx), %eax&lt;br/&gt;    andb    $225, %ah&lt;br/&gt;    orb     $4, %ah&lt;br/&gt;    jmp    L40&lt;br/&gt;&lt;br/&gt;L17:movl    _GVAR, %edx        # [1], [2]&lt;br/&gt;    movl    10464(%edx), %eax&lt;br/&gt;    andb    $225, %ah&lt;br/&gt;    orb     $2, %ah&lt;br/&gt;    jmp    L40&lt;br/&gt;&lt;br/&gt;L18:movl    _GVAR, %edx        # [1], [2]&lt;br/&gt;    movl    10464(%edx), %eax&lt;br/&gt;    andl    $-385, %eax&lt;br/&gt;    orb     $1, %ah&lt;br/&gt;    jmp    L40&lt;br/&gt;&lt;br/&gt;L19:movl    _GVAR, %edx        # [1], [2]&lt;br/&gt;    movl    10464(%edx), %eax&lt;br/&gt;    andl    $-385, %eax&lt;br/&gt;    orb     $-128, %al&lt;br/&gt;    jmp    L40&lt;br/&gt;&lt;br/&gt;L21:movl    _GVAR, %edx        # [1], [2]&lt;br/&gt;    movl    10464(%edx), %eax&lt;br/&gt;    andl    $-128, %eax&lt;br/&gt;    orl     $64, %eax&lt;br/&gt;    jmp    L40&lt;br/&gt;&lt;br/&gt;L23:movl    _GVAR, %edx        # [1], [2]&lt;br/&gt;    movl    10464(%edx), %eax&lt;br/&gt;    andl    $-384, %eax&lt;br/&gt;    orl     $32, %eax&lt;br/&gt;    jmp    L40&lt;br/&gt;&lt;br/&gt;L25:movl    _GVAR, %edx        # [1], [2]&lt;br/&gt;    movl    10464(%edx), %eax&lt;br/&gt;    andl    $-384, %eax&lt;br/&gt;    orl     $16, %eax&lt;br/&gt;    jmp    L40&lt;br/&gt;&lt;br/&gt;L27:movl    _GVAR, %edx        # [1], [2]&lt;br/&gt;    movl    10464(%edx), %eax&lt;br/&gt;    andl    $-384, %eax&lt;br/&gt;    orl     $8, %eax&lt;br/&gt;    jmp    L40&lt;br/&gt;&lt;br/&gt;L29:movl    _GVAR, %edx        # [1], [2]&lt;br/&gt;    movl    10464(%edx), %eax&lt;br/&gt;    andl    $-384, %eax&lt;br/&gt;    orl     $4, %eax&lt;br/&gt;    jmp    L40&lt;br/&gt;&lt;br/&gt;L31:movl    _GVAR, %edx        # [1], [2]&lt;br/&gt;    movl    10464(%edx), %eax&lt;br/&gt;    andl    $-384, %eax&lt;br/&gt;    orl     $2, %eax&lt;br/&gt;    jmp    L40&lt;br/&gt;&lt;br/&gt;L33:movl    _GVAR, %edx        # [1], [2]&lt;br/&gt;    movl    10464(%edx), %eax&lt;br/&gt;    andl    $-384, %eax&lt;br/&gt;    orl     $1, %eax&lt;br/&gt;    jmp    L40&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Many redundant instructions unnecessarily blow up code size. The first two lines of all jump targets could be reduced to a common read before the distributor branches to the selected table entry.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;.comm _hab,16&lt;br/&gt;.comm _DEBUG,16&lt;br/&gt;.comm _USE_LDF,16&lt;br/&gt;.comm _LDR_AVAIL,16&lt;br/&gt;.comm _MSGLD,16&lt;br/&gt;.comm _BMM,16&lt;br/&gt;.comm _BNR,16&lt;br/&gt;.comm _GVAR,16&lt;br/&gt;.comm _BST,16&lt;br/&gt;.comm _BBF,16&lt;br/&gt;.comm _TST,16&lt;br/&gt;.comm _MHSTR,16&lt;br/&gt;.comm _LDF,16&lt;br/&gt;.comm _DUMPLINE,16&lt;br/&gt;.comm _DUMPCNT,16&lt;br/&gt;.comm _OLH_MODE,16&lt;br/&gt;.comm _SEC,16&lt;br/&gt;.comm _XXX,16&lt;br/&gt;.comm _FLD_XXX,16&lt;br/&gt;.comm _FLD_SEC,16&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;We only use _GVAR in this file - enumerating all global variables is quite stupid. All globals are defined as doublewords (32 bit), so we might ask why GCC expands all these variables to a size of 128 bit, filling up the &lt;i&gt;.data&lt;/i&gt; segment with garbage.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;br/&gt;&lt;b&gt;Footnotes&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;[1] This is the typical side effect of the C convention saying &lt;i&gt;'Thou shalt not save ECX and EDX'&lt;/i&gt;. Using two of six registers as a sanitary fill for temporary data, we reduce the set of safe registers to three, forcing us to reload frequently required&lt;/font&gt;&lt;font face='arial'&gt; parameters from memory over and over again. Reloading parameters more than two times eats up the advantage of omitting the two &lt;i&gt;push&lt;/i&gt; and &lt;i&gt;pop&lt;/i&gt; instructions to save and restore the content of ECX and EDX. After the third reload operation, we are on the losing side. Adding unnecessary loads of parameters to our code slows down our functions and does not speed them up.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;[2] Loading _GVAR into EDX and the flags MVP_FLAGS into EAX &lt;i&gt;could&lt;/i&gt; be done in L12. Applying some brain didn't just save 24 lines of code, it also sped up the 13 target functions, because loading and evaluating MVP_FLAGS were separated and the dependency chain were reduced to two single rather than three consecutive dependencies.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='arial'&gt;Go to the &lt;a href='http://st-intelligentdesign.blogspot.com/2010/04/improvements.html'&gt;next post&lt;/a&gt; (05 - Improvements).&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;div class='zemanta-pixie'&gt;&lt;img src='http://img.zemanta.com/pixy.gif?x-id=e305a505-4e08-8023-a027-f1d852085e17' alt='' class='zemanta-pixie-img'/&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4412535206273251260-6078931053018077570?l=st-intelligentdesign.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://st-intelligentdesign.blogspot.com/feeds/6078931053018077570/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/caveats.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/6078931053018077570'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/6078931053018077570'/><link rel='alternate' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/caveats.html' title='04 - Caveats'/><author><name>Bernhard Schornak</name><uri>http://www.blogger.com/profile/07864510983569379361</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_Z2WbH3F-E_Q/S4KZO1ynfyI/AAAAAAAAABQ/jgXB2qAV2Xc/S220/BS.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4412535206273251260.post-6759777282416260506</id><published>2010-04-13T03:07:00.001+02:00</published><updated>2010-04-14T14:39:59.143+02:00</updated><title type='text'>03 - Stack Frames</title><content type='html'>&lt;div xmlns='http://www.w3.org/1999/xhtml'&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;When the session manager passes control to our main() function, we surely create a message queue and a frame window, but we also might play some music or those mega-cool animations which are a must for all recent applications swimming along the main stream.&lt;br/&gt;&lt;br/&gt;Whatever programs might do when they get control over the computer, the first lines of any function, including &lt;i&gt;main()&lt;/i&gt;, always follow the set routine. First: A stack frame is built. Second: All used registers are saved on the stack. Actually, ECX and EDX never are saved because the C-conventions say so (see &lt;i&gt;Conventions&lt;/i&gt;). Third: The code found in the function body performs all tasks the function is coded for. Fourth: All saved registers are restored. Fifth: The stack frame is destroyed (released). Sixth: The final &lt;i&gt;ret&lt;/i&gt; is executed and the function returns to its caller.&lt;br/&gt;&lt;br/&gt;One thing should be mentioned explicitely: We only need a stack frame, if we have to store local variables or other temporary data structures on the stack. Functions without local variables do not need a stack frame at all. Building a stack frame lasts at least 6 clock cycles and occupies 10 byte of code. To save superfluous activities, you should switch on the &lt;b&gt;fomit-frame-pointer&lt;/b&gt; option (&lt;i&gt;GCC&lt;/i&gt;) by default. It skips building stack frames where they are not required. &lt;br/&gt;&lt;br/&gt;Grasping how a base pointer is used and how it works is the key to understand conventional programming techniques. Because this is a very important issue, all explanations are very detailed. Some may find them too lengthy, but we should give others the chance to get in touch with all aspects, so (hopefully) anyone is able to gather the knowledge required to apply the learned stuff in real life.&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;b&gt;Example&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;The following example shows how conventional stack frames are created. It is trivial code used in every program. Even huge monster applications like operating systems, browsers, audio studios, et cetera use the same pattern. They only differ in the size of their stack frames.&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;font face='monospace'&gt;pushl %ebp               # save old base pointer&lt;br/&gt;movl %esp,%ebp           # load new base pointer&lt;br/&gt;subl $0x08,%esp          # reserve 2 local variables&lt;br/&gt;pushl %ebx               # save EBX&lt;br/&gt;movl 0x08(%ebp),%eax     # EAX = argument count&lt;br/&gt;movl 0x0C(%ebp),%ebx     # EBX = argument vector&lt;br/&gt;movl $0x00,-0x04(%ebp)   # local variable 1 =  0&lt;br/&gt;movl $0x20,-0x08(%ebp)   # local variable 2 = 32&lt;br/&gt;pushl %eax               # copy EAX to -0x10(%ebp)&lt;br/&gt;pushl %ebx               # copy EBX to -0x14(%ebp)&lt;/font&gt;&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;&lt;b&gt;&lt;big&gt;&lt;big&gt;The Stack&lt;/big&gt;&lt;/big&gt;&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;After execution of the above instructions, the stack looks like this:&lt;br/&gt;&lt;br/&gt;&lt;img style='max-width: 800px;' src='http://lh4.ggpht.com/_Z2WbH3F-E_Q/S8Ovd2D9lgI/AAAAAAAAAFw/l56NA8zCofU/stack2.png'/&gt;&lt;br/&gt;&lt;br/&gt;ESP always points to the current stack element. In our case, this is the address where the content of EBX was copied to. Depending on the instructions in the function body, ESP moves down towards stack bottom or back towards stack top. In properly designed functions, the content of ESP never can be greater than or equal to the content of EBP. This tells us that the base pointer not only is used to address stack elements. It also marks the border between the current stack frame and the stack frame of the calling function.&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;b&gt;pushl %ebp&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;The first step to create a new stack frame is saving the content of the old base pointer of the calling function on the stack. The content of the old base pointer &lt;i&gt;must not&lt;/i&gt; be changed under any circumstances! Otherwise, the calling function uses an invalid basis to address its local variables. It will crash if it tries to read or write data to an address stored in a local variable, because that address, formerly stored at -0x0C[EBP], now might be found at 0x3C[EBP]. At the latest, the calling function commits suicide while trying to return to its caller. It fills the base pointer with random data, then loads other random data into the instruction pointer. The attempt to execute that 'code' raises one of those exceptions and an 'access violation' is reported on the user's screen.&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;b&gt;movl %esp,%ebp&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;In the second step, we copy the address of the current stack element to EBP. The new base pointer now contains the address, where the content of the old base pointer is stored. The next stack element below the new base pointer is the topmost stack element we are allowed to write to. All stack elements above EBP - including 0x00[EBP]! - only should be used to read from, but never to write to! From here on, we use the base pointer to address local variables and parameters passed to our function. In other words, these data are addressed with offsets to the 'frozen' base pointer.&lt;br/&gt;&lt;br/&gt;As shown in &lt;i&gt;figure 03&lt;/i&gt;, the old base pointer is stored at offset zero to the current base pointer. We write this as 0x00[EBP], where the &lt;b&gt;0x&lt;/b&gt; tells us this is a hexadecimal number. Everything related to programming should be written in hexadecimal notation. It is much easier to read, numbers are shorter and always formatted correctly. For example, 0xF100 tells us at first sight: 'It's the second page in memory block 0xF000.' Its decimal equivalent 61696 does not tell us anything. We have to start a calculator to translate it into 0xF100, wasting precious time.&lt;br/&gt;&lt;br/&gt;Back to the base pointer. Passed parameters and the return address to the calling function are stored in stack elements above the one the base pointer points to. Therefore, they are addressed with positive offsets. Our return address is stored at 0x04[EBP], followed by one or more optional parameters, where parameter 1 always is stored at 0x08[EBP], parameter 2 at 0x0C[EBP], and so on. The 'top down' order is caused by a C-convention, saying that all parameters are &lt;i&gt;push&lt;/i&gt;ed onto the stack back to forth, starting with the last parameter the compiler finds inside those round brackets following the name of the called function.&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;b&gt;subl $0x08,%esp&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;The real creation of a stack frame is done with the third and last step. First, we calculate the required size for all variables and structures, then we subtract the result from ESP. Of course, the result must be rounded up to the next multiple of the standard datasize, before we subtract it from the stack pointer. If we subtracted an 'odd' number, the stack became corrupted and was not valid any longer. Everything addressed via ESP, e.g. &lt;i&gt;call&lt;/i&gt; or &lt;i&gt;push&lt;/i&gt;, was misaligned, causing a lot of penalty cycles. If ESP accidentally was set to 0xFFE2 instead of 0xFFE4 (e.g. by subtracting 0x0A rather than the proper value 0x08) in our sample code, then we &lt;i&gt;push&lt;/i&gt; two local variables 0x89ABCDEF and 0x01234567 onto the stack and copy variable 2 from -0x08[EBP] to EBX, EBX finally contained the invalid number 0xCDEF0123 - the doubleword currently stored at address 0xFFE4.&lt;br/&gt;&lt;br/&gt;With this subtraction, we reserve the subtracted amount of byte on the stack. This reserved area is safe from being overwritten by any following &lt;i&gt;call&lt;/i&gt; or &lt;i&gt;push&lt;/i&gt; instruction, because ESP always is set to the next lower stack element before a write operation. Because all local variables are lying below the stack element where the old base pointer is stored, they are addressed with negative offsets. The first is stored at address -0x04[EBP], the 2nd at -0x08[EBP], and so on. It neither matters how you name your local variables nor if you count them from top to bottom or vice versa. The only important thing is to remember where which of them is stored.&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;b&gt;pushl %ebx&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;Saves the content of EBX on the stack. Following the C-conventions, the content of EBX, EDI and ESI must be saved before overwriting them and must be restored before returning to the calling function. In other words: The content of these registers must be preserved by all functions (see &lt;i&gt;Conventions&lt;/i&gt;).&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;b&gt;Urgent Needs&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;In conventional code, the content of EBP never must be changed. If you cannot avoid to use EBP for general purposes, because you urgently need an extra register, you have to save it before using it. You should restore EBP immediately after passing the bottleneck. Not worth mentioning, but, nevertheless: EBP cannot be used to adress local variables while it holds your private data...&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;b&gt;The Function Body&lt;/b&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;ESP always points to the address of the current stack element. Looking at our example, it is quite obvious that we are going to call another function. Our local variable 1 might be a counter which is incremented each time the called function returns TRUE. Local variable 2 might be a loop counter, so the function might be called 32 times, counting how many times a specific condition was met.&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;Stack Frame Destruction&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;After all instructions in the function body are executed, we first have to restore the saved registers. Next, we have to destroy the stack frame to release the area we reserved for our private use. Finally, we return to the calling function. There are two possible ways to perform these tasks:&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;font face='monospace'&gt;addl $0x08,%esp   # correction after 2 PUSH instructions&lt;br/&gt;popl %ebx         # restore EBX&lt;br/&gt;leave             # destroy stack frame (VP 3, 1 byte)&lt;br/&gt;ret               # return to caller&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;or&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;addl $0x08,%esp   # correction after 2 PUSH instructions&lt;br/&gt;popl %ebx         # restore EBX&lt;br/&gt;movl %ebp,%esp    # restore ESP (DP 1, 2 byte)&lt;br/&gt;popl %ebp         # restore EBP (VP 4, 1 byte)&lt;br/&gt;ret               # return to caller&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Both variants restore ESP and EBP. However, &lt;i&gt;leave&lt;/i&gt; is two clock cycles faster and two byte smaller. The correction of ESP is required to restore the preserved content of EBX. Because we &lt;i&gt;push&lt;/i&gt;ed two parameters onto the stack after we &lt;i&gt;push&lt;/i&gt;ed EBX, ESP is eight byte below the address where the content of EBX is stored. To POP the proper content into EBX, we have to add these eight byte to ESP.&lt;br/&gt;&lt;br/&gt;If no registers were saved on the stack, no correction of ESP is required and &lt;i&gt;leave&lt;/i&gt; and &lt;i&gt;ret&lt;/i&gt; are the only instructions we need to restore ESP and EBP before we return to the caller.&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;b&gt;Return To Caller&lt;/b&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;In our example, &lt;i&gt;ret&lt;/i&gt; copies the return address to the session manager of the operating system to EIP and continues execution of its code. The session manager closes our program, frees still allocated resources and passes the content we stored in EAX to the command interpreter.&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;big&gt;&lt;big&gt;&lt;b&gt;What Did We Learn?&lt;/b&gt;&lt;/big&gt;&lt;/big&gt;&lt;br/&gt;&lt;br/&gt;All introduced mechanisms are, except the size of the stack frame and the amount of preserved registers, identical for all functions following the C-conventions. As far as I know, all of the known operating systems follow these coventions. The advantages are obvious. Once coded functions are portable from one version to the next or from one operating system to another. In the latter case, minor changes must be applied if functions of the target OS have other names or await parameters in different order. Putting it all together, conventions simplify re-use and porting of existing code.&lt;br/&gt;&lt;br/&gt;But - every object in our universe has at least two (or more) sides. The disadvantages of the introduced methods, mechanisms and conventions are not as obvious as the advantages, because they are hidden in the deepest and darkest parts of the machine room. Only experienced, well trained technicians are able to find out why the machine chokes and does not run as fast as expected. In the &lt;a href='http://st-intelligentdesign.blogspot.com/2010/04/caveats.html'&gt;next post&lt;/a&gt;, we will analyse what C-conventions do on assembly language level.&lt;br/&gt;&lt;/font&gt;&lt;/div&gt;&lt;blockquote&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;/blockquote&gt;&lt;br/&gt;&lt;br/&gt;&lt;div class='zemanta-pixie'&gt;&lt;img src='http://img.zemanta.com/pixy.gif?x-id=146928a8-b5c2-88e8-bf1b-ab54a715b4f9' alt='' class='zemanta-pixie-img'/&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4412535206273251260-6759777282416260506?l=st-intelligentdesign.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://st-intelligentdesign.blogspot.com/feeds/6759777282416260506/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/stack-frames.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/6759777282416260506'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/6759777282416260506'/><link rel='alternate' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/stack-frames.html' title='03 - Stack Frames'/><author><name>Bernhard Schornak</name><uri>http://www.blogger.com/profile/07864510983569379361</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_Z2WbH3F-E_Q/S4KZO1ynfyI/AAAAAAAAABQ/jgXB2qAV2Xc/S220/BS.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://lh4.ggpht.com/_Z2WbH3F-E_Q/S8Ovd2D9lgI/AAAAAAAAAFw/l56NA8zCofU/s72-c/stack2.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4412535206273251260.post-3064278949553777249</id><published>2010-04-13T01:29:00.001+02:00</published><updated>2010-04-14T14:36:51.071+02:00</updated><title type='text'>02 - Basics</title><content type='html'>&lt;div xmlns='http://www.w3.org/1999/xhtml'&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;The assembly language dialect used in this paper is not the one commonly used in the world of Windows, also known as &lt;i&gt;LETNi&lt;/i&gt; syntax. It is &lt;a href='http://st-intelligentdesign.blogspot.com/2010/04/14-at-syntax.html'&gt;&lt;i&gt;AT&amp;amp;T&lt;/i&gt;&lt;/a&gt; syntax, known in the world of Linux and Unix. When I began to write code for the x86 platform in 1993, &lt;i&gt;GCC&lt;/i&gt; was the only free development tool one could get, so I had to use &lt;a href='http://st-intelligentdesign.blogspot.com/2010/04/14-at-syntax.html'&gt;&lt;i&gt;AT&amp;amp;T&lt;/i&gt;&lt;/a&gt; syntax. If you only know &lt;i&gt;LETNi&lt;/i&gt; syntax, there is a short introduction to &lt;a href='http://st-intelligentdesign.blogspot.com/2010/04/14-at-syntax.html'&gt;&lt;i&gt;AT&amp;amp;T&lt;/i&gt;&lt;/a&gt; syntax in one of the next posts to learn the difference between the both. If you worked with &lt;i&gt;AS&lt;/i&gt; for a short time, you don't want to return to the complicated and perversed (or was that reversed?) &lt;i&gt;LETNi &lt;/i&gt;syntax anymore. The programming techniques introduced in this paper do not rely on a specific syntax. However, knowing &lt;a href='http://st-intelligentdesign.blogspot.com/2010/04/14-at-syntax.html'&gt;&lt;i&gt;AT&amp;amp;T&lt;/i&gt;&lt;/a&gt; syntax might help you to understand the sample code. Reaching the goal is what really counts. How we get there is another, slightly different problem.&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;b&gt;&lt;big&gt;&lt;big&gt;The Stack&lt;/big&gt;&lt;/big&gt;&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;Every &lt;i&gt;x86&lt;/i&gt; processor works as a stack machine. Parameters we have to pass to called functions, return addresses to calling functions, local variables and structures are put onto or read from the stack. Unfortunately, most of today's programmers just are used to click together some of those prefabricated code fragments coming along with the daily 20 GB version upgrade for their favourite VisualXYZ(plus-minus-dotnet) development suite. If you ask any of them 'What is a stack?', they probably tell you something about hay or - of course! - money. Sad, but true. Exceptionally sad, because the mechanisms of a software stack are that simple, one were attempted to call the big idea behind nothing else than 'brilliant'.&lt;br/&gt;&lt;br/&gt;Whenever you compile a program, you pass a definition file with the extension &lt;i&gt;.def&lt;/i&gt; to the linker (LINK.EXE or similar). The definition file holds some important information about the program for the session manager of your operating system. While the compiled program is started, the operating system reserves three independent memory blocks (code, data, stack) for the new process and the segment registers CS, DS+ES and SS are set to the address of one of these blocks. Whatever you defined as &lt;b&gt;STACKSIZE&lt;/b&gt; in the definition file, exactly this size is allocated for your stack segment. After allocating the required memory blocks for those three segments, the program code is copied to the code segment, all defined global variables are written to the data segment - they are 'initialised' - and rSP is set to the top of the stack segment. Finally, the address of the array with the command line parameters and the argument count are pushed onto the virgin stack before the session manager calls the &lt;i&gt;main()&lt;/i&gt; function of our program. Entering &lt;i&gt;main()&lt;/i&gt;, the processor starts to execute the code found there, until it stumbles upon the final &lt;i&gt;ret&lt;/i&gt; instruction and passes control back to the session manager.&lt;br/&gt;&lt;br/&gt;Entering our program's main() function, the stack looks like this:&lt;br/&gt;&lt;br/&gt;&lt;img style='max-width: 800px;' src='http://lh6.ggpht.com/_Z2WbH3F-E_Q/S8Ovd6SnBxI/AAAAAAAAAFw/BKBVYRLUuFk/stack4.png'/&gt;&lt;br/&gt;&lt;br/&gt;The top stack element holds the &lt;i&gt;argument vector&lt;/i&gt; argv[] (parameter 2), the next lower stack element holds the &lt;i&gt;argument count&lt;/i&gt; argc (parameter 1). The current stack element, the one ESP points to, holds the return address to the session manager. Whenever the program is terminated, the instruction pointer is loaded with this address and the terminating sequence of the session manager is executed. It first frees those resources our program eventually reserved  for itself, e.g. allocated memory blocks, open files, open devices, and so on. Finally, it releases the allocated segments and cleans up all structures holding control data of our program.&lt;br/&gt;&lt;br/&gt;In the running program, the content of the stack pointer is decreased with every &lt;i&gt;push&lt;/i&gt; instruction, the creation of a stack frame or the call to another function. It is increased whenever we &lt;i&gt;pop&lt;/i&gt; data from the stack, release a stack frame or &lt;i&gt;ret&lt;/i&gt;urn to a calling function. In 32 bit functions, four byte are subtracted from ESP with every &lt;i&gt;push&lt;/i&gt; or &lt;i&gt;call&lt;/i&gt;, while four byte are added to ESP with every &lt;i&gt;pop&lt;/i&gt; or &lt;i&gt;ret&lt;/i&gt;. All subtractions or additions are done by the processor automatically, because they are an integral part of the mentioned instructions. Using a picturesque language, we might state: The stack grows with every &lt;i&gt;push&lt;/i&gt; or &lt;i&gt;call&lt;/i&gt; and shrinks with every &lt;i&gt;pop&lt;/i&gt; or &lt;i&gt;ret&lt;/i&gt;.&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;b&gt;&lt;big&gt;Using The Stack&lt;/big&gt;&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;To make sensible use of the stack, any &lt;i&gt;x86&lt;/i&gt; processor provides the instructions &lt;i&gt;call&lt;/i&gt;, &lt;i&gt;enter&lt;/i&gt;, &lt;i&gt;leave&lt;/i&gt;, &lt;i&gt;pop&lt;/i&gt;, &lt;i&gt;push&lt;/i&gt; and &lt;i&gt;ret&lt;/i&gt;. All these instructions update the content of ESP automatically. Besides these special instructions, ESP can be used like any other register, as well. We are free to add or subtract immediate values or the content of another register to/from the stack pointer and do other funny things with ESP. However, the most utilised action probably is the subtraction of an immediate value from ESP to create a stack frame and the addition of exactly the same value to release that stack frame if we do not need it any longer. A detailed description follows later  on, see &lt;i&gt;About Stackframes&lt;/i&gt;. For now, we focus our attention on some more basic things like the instructions mentioned above. To understand the concept of conventional programming methods, it is very important to know how these instructions work and how they manipulate the stack and ESP.&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;b&gt;CALL&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;If the processor encounters a &lt;i&gt;call&lt;/i&gt; instruction, it first&lt;/font&gt;&lt;font face='arial'&gt; subtracts&lt;/font&gt;&lt;font face='arial'&gt; two, four or eight&lt;/font&gt;&lt;font face='arial'&gt; (&lt;/font&gt;&lt;font face='arial'&gt;depending on the standard datasize)&lt;/font&gt;&lt;font face='arial'&gt; from ESP, then stores&lt;/font&gt;&lt;font face='arial'&gt; the address of the instruction following the &lt;i&gt;call&lt;/i&gt; on the stack. Next, &lt;/font&gt;&lt;font face='arial'&gt;the address passed as a part of the &lt;i&gt;call&lt;/i&gt; instruction is loaded into the instruction pointer. Execution now is continued with the instructions found at the new location, until the processor stumbles upon a &lt;i&gt;ret&lt;/i&gt;.&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;div align='left'&gt;&lt;font face='monospace'&gt;...&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;call DoNothing  # call function DoNothing&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;xorl %eax,%eax  #&amp;lt;- the address of this instruction&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;...             # is stored on the stack and loaded&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;...             # into EIP with the RET instruction&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;DoNothing:      # local function DoNothing&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;ret             # return to caller...&lt;/font&gt;&lt;br/&gt;&lt;/div&gt;&lt;br/&gt;&lt;br/&gt;&lt;b&gt;ENTER&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;Please do not use this vector path instruction - it blocks the processor for entire 13 clock cycles. The usual replacement&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;...&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;pushl %ebp       # save EBP&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;movl %esp,%ebp   # save ESP&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;subl $0x10,%esp  # create stack frame&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;...&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;i&lt;font face='arial'&gt;s executed in four clock cycles. Saving nine clock cycles is a speed improvement of 325 percent. You should prefer the replacement code over &lt;i&gt;enter&lt;/i&gt; under any cicumstances! The 16 byte are just a randomly chosen example to keep the sample code valid. The real size you have to subtract from ESP depends on the amount of local variables and other temporary data your function has to store on the stack.&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;b&gt;LEAVE&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;The &lt;i&gt;leave&lt;/i&gt; instruction is equivalent with the following two instructions:&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;...&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;movl %ebp,%esp   # restore ESP (DP 1, 2 byte)&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;popl %ebp        # restore EBP (VP 4, 1 byte)&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;...&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;Like &lt;i&gt;pop&lt;/i&gt;, &lt;i&gt;leave&lt;/i&gt; is a vector path instruction. Because &lt;i&gt;pop&lt;/i&gt; blocks the processor for four clock cycles and also has to wait for the valid result of the preceeding &lt;i&gt;mov&lt;/i&gt; instruction, &lt;i&gt;leave&lt;/i&gt; actually is two clock cycles faster. Moreover, the one byte &lt;i&gt;leave&lt;/i&gt; is shorter than its three byte replacement. Hence, you should prefer &lt;i&gt;leave&lt;/i&gt; over the alternative method.&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;b&gt;POP&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;&lt;img style='max-width: 800px;' src='http://lh4.ggpht.com/_Z2WbH3F-E_Q/S8OvdQoL8LI/AAAAAAAAAFw/fi9qhf5hGrg/stack1.png'/&gt;&lt;br/&gt;&lt;br/&gt;&lt;i&gt;Pop&lt;/i&gt; copies the content of the current stack element to a register or memory location, then adds, depending on the processor mode and an optional prefix, two, four or eight to the content of the stack pointer. Unlike the direct path &lt;i&gt;push&lt;/i&gt;, &lt;i&gt;pop&lt;/i&gt; is a vector path instruction. It is executed in pipes 0 and 1, while pipe 2 is blocked for the time the instruction is processed. Every &lt;i&gt;pop&lt;/i&gt; instruction lasts four clock cycles. &lt;i&gt;Pop&lt;/i&gt; is a special vector path &lt;i&gt;mov&lt;/i&gt; instruction, where ESP automatically is updated &lt;i&gt;after&lt;/i&gt; it was used to address the source of a copy operation.&lt;br/&gt;&lt;br/&gt;In general, &lt;i&gt;pop&lt;/i&gt; is used to restore the content of a register. You should keep track of the stack pointer, because it is quite difficult to find errors caused by asymmetrically executed &lt;i&gt;push&lt;/i&gt; and &lt;i&gt;pop&lt;/i&gt; instructions. Especially, if some of them are inside a loop while others are not...&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='monospace'&gt;...&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;popl %ebx        # restore EBX [=&amp;gt; ESP + 4(!)]&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;...&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;b&gt;Push&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;&lt;img style='max-width: 800px;' src='http://lh5.ggpht.com/_Z2WbH3F-E_Q/S8OvdGDhGyI/AAAAAAAAAFw/vjfwM4Eb5lo/stack0.png'/&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;i&gt;Push&lt;/i&gt; first adds, depending on the processor mode and an optional prefix, two, four or eight to the stack pointer, then copies an imediate value, the content of a register or the content of a memory location into the stack element the updated ESP points to. &lt;i&gt;Push&lt;/i&gt; is a special direct path &lt;i&gt;mov&lt;/i&gt; instruction, where ESP automatically is updated &lt;i&gt;before&lt;/i&gt; it is used to address the target of a copy operation. While the entire execution time of each &lt;i&gt;push&lt;/i&gt; instruction is 3 clock cycles, ESP is available one clock earlier (after 2 cycles) for the following instructions.&lt;br/&gt;&lt;br/&gt;In general, &lt;i&gt;push&lt;/i&gt; is used to put parameters or register contents onto the stack. It is a good idea to remove 'used' parameters from the stack as soon as possible, because they decrease the available stack size, cause avoidable stack pointer arithmetics, and so on. To remove them, we don't have to &lt;i&gt;pop&lt;/i&gt; them - beware! - from the stack. We just have to add the appropriate amount of byte to the stack pointer. If we &lt;i&gt;push&lt;/i&gt;ed two parameters in a 32 bit function as shown in the below example, we add 8 byte to ESP. If it were eight parameters, we had to add 8 * 4 = 32 byte to ESP, and so on.&lt;br/&gt;&lt;br/&gt;Regardless of the lazy behavior of HLL (high level language) compilers, it is a good idea to correct the content of ESP after each &lt;i&gt;call&lt;/i&gt; instruction if you passed parameters to the called function. Humans have minor problems to keep track of the current content of the stack pointer, especially, if the function body exceeds the size of their display. Compilers can do that much better. However, their output is less optimised than the code written by a human... ;)&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;...&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;pushl %eax       # put parameter 2 onto the stack&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;pushl %ebx       # put parameter 1 onto the stack&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;call _helpling   # call another function&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;addl $0x08,%esp  # correct ESP directly after CALL&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;...&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;b&gt;RET&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;Whenever the processor stumbles upon a &lt;i&gt;ret&lt;/i&gt; instruction, it copies the address stored in the current stack element into the instruction pointer EIP, then adds the standard datasize (2, 4 or 8 byte) to the stack pointer. After ESP was updated, execution continues with the instruction found at the address EIP now points to.&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;...&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;ret              # return to caller&lt;/font&gt;&lt;br/&gt;&lt;font face='monospace'&gt;...&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;font face='arial'&gt;&lt;b&gt;&lt;big&gt;Stack Management&lt;/big&gt;&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;As mentioned in the descriptions of the single instructions, keeping track of the current content of the stack pointer is a &lt;i&gt;must&lt;/i&gt; with highest priority. Because passing of parameters generally is done with &lt;i&gt;push&lt;/i&gt; instructions, while parameters are taken from the stack by adding the appropriate amount of byte to the stack pointer ESP, it is very important to handle these operations with exceptional care. Especially the necessary corrections of the stack pointer bear some potential to be erroneous. Unfortunately, such errors cause unexpected crashes and malfunctions, and it is quite hard to track them down until the culprit causing the mess is found. This reaches a new dimension with the creation of a stack frame. What is this already mentioned, sinister stack frame? The &lt;a href='http://st-intelligentdesign.blogspot.com/2010/04/stack-frames.html'&gt;next post&lt;/a&gt; offers a detailed explanation.&lt;br/&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;br/&gt;&lt;br/&gt;&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;div class='zemanta-pixie'&gt;&lt;img src='http://img.zemanta.com/pixy.gif?x-id=a929ff83-02a7-8b39-9783-0470320a2daa' alt='' class='zemanta-pixie-img'/&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4412535206273251260-3064278949553777249?l=st-intelligentdesign.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://st-intelligentdesign.blogspot.com/feeds/3064278949553777249/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/basics.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/3064278949553777249'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/3064278949553777249'/><link rel='alternate' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/basics.html' title='02 - Basics'/><author><name>Bernhard Schornak</name><uri>http://www.blogger.com/profile/07864510983569379361</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_Z2WbH3F-E_Q/S4KZO1ynfyI/AAAAAAAAABQ/jgXB2qAV2Xc/S220/BS.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://lh6.ggpht.com/_Z2WbH3F-E_Q/S8Ovd6SnBxI/AAAAAAAAAFw/BKBVYRLUuFk/s72-c/stack4.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4412535206273251260.post-777887454296968076</id><published>2010-04-13T00:41:00.001+02:00</published><updated>2010-04-14T14:28:08.108+02:00</updated><title type='text'>01 - Introduction</title><content type='html'>&lt;div xmlns='http://www.w3.org/1999/xhtml'&gt;&lt;div align='justify'&gt;&lt;font face='arial'&gt;Most people probably associate the term &lt;i&gt;Intelligent Design&lt;/i&gt; with the movement of &lt;i&gt;Creationism&lt;/i&gt; rather than a new, revolutionary programming technique. The usurpation of this term is an intended sidesweep. Whatever invented gods and godesses were able to do - a smart programmer can do it much better. This paper is an introduction to the next generation of programming, superior to old fashioned conventions and programming techniques.&lt;br/&gt;&lt;br/&gt;Compared against conventional programming techniques, &lt;i&gt;Intelligent Design&lt;/i&gt; resembles a quality leap. However, some knowledge about the creation and management of a conventional stack is required to understand the important difference between old fashioned programming techniques and&lt;i&gt; Intelligent Design&lt;/i&gt;. To impart the knowledge about conventional methods to the reader, the next pages offer a detailed introduction to stacks, stack frames and how they are managed. Without this knowledge, it probably is impossible to understand the alternative methods and techniques introduced with&lt;i&gt; Intelligent Design&lt;/i&gt;. Old fashioned programming techniques never kept pace with recent processors - the standards of so called &lt;i&gt;high level languages&lt;/i&gt; are designed to work with the first generation of microprocessors as well as most recent quad-core machines. Unfortunatelly, computational power of processors grew by several powers of ten, while software standards never followed any technical evolution. Today, we have mature high speed processors driven by never grown software toddlers. It is quite counter-productive to slow down high speed devices because their 'drivers' cannot handle most of the controls.&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;small&gt;&lt;b&gt;&lt;big&gt;&lt;big&gt;Copyright Note&lt;/big&gt;&lt;/big&gt;&lt;/b&gt;&lt;/small&gt;&lt;br/&gt;&lt;br/&gt;The programming techniques introduced in this paper are mental property of &lt;b&gt;Bernhard Schornak&lt;/b&gt;. They are protected by international copyrights, published under the terms of the &lt;b&gt;&lt;a href='http://ft4fp.blogspot.com/2010/04/ft4fp-license.html'&gt;FT4FP&lt;/a&gt;-License&lt;/b&gt;. Any commercial use, trade or other forms of exploitation to gain profit are strictly prohibited. Knowledge is a common property and should be freely available for every human. It must not be abused as a proprietary ware, only available for those who can afford to feed a few greedy individuals with money.&lt;br/&gt;&lt;br/&gt;This document was written for the ST-Open homepage in 2006. It was slightly modified for this blog.&lt;br/&gt;&lt;br/&gt;(C) 2006-2010: Bernhard Schornak&lt;br/&gt;&lt;/font&gt;&lt;/div&gt;&lt;font face='arial'&gt;&lt;img src='http://img.zemanta.com/pixy.gif?x-id=6366dbac-67a8-8ada-a706-4d479a215f23' alt='' class='zemanta-pixie-img'/&gt;&lt;/font&gt;&lt;img class='zemanta-pixie-img' alt='' src='http://img.zemanta.com/pixy.gif?x-id=0ca00918-42ad-8cf9-9fb2-6c39a59f8cbf'/&gt;&lt;br/&gt;&lt;br/&gt;&lt;div class='zemanta-pixie'&gt;&lt;img src='http://img.zemanta.com/pixy.gif?x-id=4f20c347-c657-8506-ae21-21f7ff2af9c5' alt='' class='zemanta-pixie-img'/&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4412535206273251260-777887454296968076?l=st-intelligentdesign.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://st-intelligentdesign.blogspot.com/feeds/777887454296968076/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/introduction.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/777887454296968076'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4412535206273251260/posts/default/777887454296968076'/><link rel='alternate' type='text/html' href='http://st-intelligentdesign.blogspot.com/2010/04/introduction.html' title='01 - Introduction'/><author><name>Bernhard Schornak</name><uri>http://www.blogger.com/profile/07864510983569379361</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_Z2WbH3F-E_Q/S4KZO1ynfyI/AAAAAAAAABQ/jgXB2qAV2Xc/S220/BS.png'/></author><thr:total>0</thr:total></entry></feed>
