Intelligent Design: 07 - Considerations

As discussed in great detail, any conventional stack frame generally looks like this:

Programmers and compilers are free to write any kind of data to stack locations between the stack's bottom and the stack element below ESP (-0x04[ESP]). To reserve a part of the stack for the private use of the current function, we have to subtract the needed size from ESP. Moving ESP towards stack bottom, we reserve the corresponding area for our private use. No function (except it is compiled with GCC 3.3.5.) ever writes data to stack locations above -04[ESP]. It is very important to understand the connection between the subtraction of the required stack frame size from ESP and exclusive ownership of the reserved area. My alternative method introduced in this paper as well as conventional programming techniques are based on this principle. You just break an absolute taboo if you write to stack locations above -04[ESP]!

Because conventional programming methods push data onto the stack, respectively pop them from the stack, the content of ESP is changing continuously. Due to its randomly changing content, it is impossible to use ESP as base for adressing specific stack elements directly. An additional register, the base pointer EBP, is required for this task. It points to the current top of the stack all of the time, so we safely can adress all stack elements regardless of the current content of ESP.

If we analyse the described process and have a closer look at its details, one thing is quite obvious: We could save all time consuming contortions with additional registers if we replaced the usual push - call - add n,%esp sequenzes with something else. A short look into the data sheets of modern processors reveals us some additional caveats. Let us recapitulate what we know about push and pop instructions and compare them against a bunch of simple mov instructions:

PUSH

Depending on the standard data size, either two, four or eight is subtracted from the stack pointer rSP, setting it to the next lower stack element. After updating rSP, the content of a register, a memory location or an immediate value is copied to the current stack element (00[rSP]). A push reg or push imm instruction is executed within three clock cycles. ESP is blocked for the time it needs to update its content and deliver the stack location where the given data shall be stored. This requires two clock cycles, so consecutive push instructions have a latency of at least two clock cycles.

As an alternative, we might replace one push with three mov instructions. If none of them depends on the result of one of the other two movs, all of them are executed simultaneously in three clock cycles, as well. If we write to continuous memory locations, all writes are done in one gulp, because a mechanism called write combining is triggered.

POP

The content of the current stack element is copied to the register or memory location specified by the pop instruction, then two, four or eight is added to the stack pointer rSP. The current stack element was 'taken from the stack' and the stack pointer now points to one stack element above (our new stack bottom). Other than the direct path push, pop is a vector path instruction. These instructions always are fed to execution pipes 0 and 1, while execution pipe 2 is blocked while the instruction is executed. The latency of each pop instruction is 4 clock cycles.

As an alternative, we might replace one pop with four mov instructions. Three of them are executed simultanuously in three clock cycles, the fourth mov starts execution in the fourth clock cycle.

Conclusions

As the analysis of both instructions shows, the only rational consequence is to replace all pop and push with mov instructions. Doing so not only speeds up passing of parameters, it also turns the flying stack pointer into a static one. As a positive side-effect of the frozen stack pointer, we do not need EBP as base pointer any longer, freeing one valuable register for general purposes. Using no base pointer, we get rid of that mess with positive and negative offsets to EBP - a source of errors, making source code less readable.

Addendum

With Athlon64, family 10h (November 2008), the latency for pop instructions was reduced to three clock cycles (direct path single). Nonetheless, replacing push and pop with mov instructions is a much better improvement, because of the expected positive side-effects coming along with the replacement automatically. Even if Intelligent Design functions will win any race around clock cycles with ease, its concept offers much more improvements than just counting clocks.

Go to the next post (08 - Intelligent Design)

Tuesday, April 13, 2010

07 - Considerations

No comments:

Post a Comment