BASM for beginners, lesson 2
This is the second chapter of the introduction to
BASM programming with Delphi. The first chapter was a short introduction
to integer code and this second one is about floating point code. Our
example functions can evaluate a second order polynomial. The
parameters A, B and C that defines the
polynomial is coded as local constants. Input to the function is the
variable X of type double and the result is also of type double. The
function looks like this.
function
SecondOrderPolynomial1(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;
begin
Result := A*X*X + B*X + C;
end;
Copying the assembler code from the
cpu view gives us
this.
function
SecondOrderPolynomial2(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;
begin
{
push ebp
mov
ebp,esp
add esp,-$08
}
Result := A*X*X + B*X + C;
{
fld
qword ptr [A]
fmul
qword ptr
[ebp+$08]
fmul
qword ptr
[ebp+$08]
fld
qword ptr [B]
fmul
qword ptr
[ebp+$08]
faddp
st(1)
fadd
qword ptr
[C]
fstp
qword ptr
[ebp-$08]
wait
fld
qword ptr
[ebp-$08]
}
{
pop ecx
pop
ecx
pop
ebp
}
end;
Let us explain the asm
code line by line. The begin results in this code,
begin
{
push ebp
mov
ebp,esp
add esp,-$08
}
which sets up a
stackframe for the function. A stack frame is
just a piece of memory that is reserved for the stack. A stack frame is
accessed through two pointers, the basepointer
and the stack pointer. The basepointer is in
ebp and the stack pointer is in
esp. These two registers are reserved for use
by these pointers only. The line push ebp
backs up the basepointer,
The line mov ebp,
esp sets up a new
basepointer which is pointing to the top of the stack. The
line add esp, -$08
moves the stack pointer 8 bytes down. As a curiosity the stacks grows
downward and the last line could more intuitively have been sub
esp, 8. The new
stack frame that was created by these three lines is standing on top of,
or actually hanging under, the last stackframe,
which was probably allocated by the function that called our
SecondOrderPolynomial function.
The next line of Pascal was compiled into no less
than 9 lines of ASM.
Result := A*X*X + B*X
+ C;
{
fld
qword ptr [A]
fmul
qword ptr
[ebp+$08]
fmul
qword ptr
[ebp+$08]
fld
qword ptr [B]
fmul
qword ptr
[ebp+$08]
faddp
st(1)
fadd
qword ptr
[C]
fstp
qword ptr
[ebp-$08]
wait
fld
qword ptr
[ebp-$08]
For those that is used to HP calculators floating
point code is very easy to understand. The first line,
fld qword
ptr [A], loads the constant A onto the
floating point register stack. The line,
fmul
qword ptr
[ebp+$08], multiplies A with X. This makes sense by watching the Pascal
code, but what means "qword
ptr [ebp+$08]".
qword ptr
says "pointer to a quadword, which is the size
of a double. (64 bit). The valueof the pointer
is between the brackets in [ebp+$08].
ebp is the
basepointer and $08 is - well just 8. Because the stack grows down
the memory location 8 bytes above the basepointer
is in the previous stack frame. Here X was placed by the function which
called our function. This placement of a double variable is decided by the
register calling convention. A double variable does not fit into the 32
bit integer registers, but it fits perfectly onto the floating point
registers. Borland decided to pass double variables via the stack, but
passing them in floating point registers would have been more efficient.
The next 3 lines need no further explanation, but the line,
faddp
st(1), needs
some. All floating point instructions start with an f. add is addition.
st(1)
is floating point register 1, which is the second because
st(0) is the first! The floating point
registers are combined into a stack and instructions implicitly works on
the top of the stack which is st(0).
faddp
st(1) is the same as
faddp st(0), st(1)
and it adds register st(0) to register
st(1) and place the result in
st(1). The p in faddp
means pop st(0)
of the stack. This way the result ends up in
st(0). The
line fadd
qword ptr [C]
completes the calculations and the only thing left is to place the result
in st(0). It is actually already there and the
two lines
fstp
qword ptr
[ebp-$08]
fld
qword ptr
[ebp-$08]
are redundant. They
just copy the result into the stack frame and loads it back in again.
Such a waste of precious time and energy :-).
The line wait makes sure that any exceptions that
migth have been raised by one of the floating point instructions is
checked. See Intel SW Developers Manual Volume 2 page 822 for the full
explanation.
Then there is only
three lines of asm back to explain.
{
pop ecx
pop
ecx
pop
ebp
}
end;
These are removing the stack frame, by restoring
the values of esp
and ebp back to the values they had when the
function was entered. This code is much more intuitive and does the same
thing
add
esp, 4
pop ebp
it is also more
effective and I do not know why the compiler is incrementing the stack
pointer in this cumbersome way. Remember that ecx
can be used for free and assigning values to it is just like pouring them
into a waste bucket.
Now we only need to investigate what is hiding
behind the [A] in the line fld
qword ptr [A]. We
know that A must be a pointer to the place where A is placed in memory.
The address of A is coded in the instruction. This is the full line from
the cpu view.
00451E40 DD05803C4500
fld qword ptr [B]
00451E40 is the address of this instruction in the
exe file. DD05803C4500 is the machine code for the line and
fld qword
ptr [B] is the more human readable format of
it. By consulting the Intel SW Developers Manual Volume 2 on page 280 we
see that the opcode for
fld is D9, DD, DB or D9C0 depending on the type of data it should
load. We recognize DD which is the opcode for
fld double. What is left is 05803C4500. 05 is
(Somebody help me ! ). 803C4500 is the 32 bit
address of A.
Let us convert the function into a pure BASM
function now that we have finished analyzing it.
function
SecondOrderPolynomial3(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;
asm
push ebp
mov
ebp,esp
add esp,-$08
//Result := A*X*X + B*X + C;
fld
qword ptr [A]
fmul
qword ptr
[ebp+$08]
fmul
qword ptr
[ebp+$08]
fld
qword ptr [B]
fmul
qword ptr
[ebp+$08]
faddp //st(1)
fadd
qword ptr
[C]
fstp
qword ptr
[ebp-$08]
wait
fld
qword ptr
[ebp-$08]
pop ecx
pop
ecx
pop
ebp
end;
Now comes a few surprises. First the function will
not compile. faddp
st(1) is not recognized as a valid combination
of opcode and operands. By again consulting
the Intel manual we learn that faddp comes in
one version only. It operates on
st(0), st(1)
and it is not necessary to write faddp
st(0), st(1) and
the short form faddp is the only valid one. We
comment out st(1)
and it compiles now.
Second surprise.
Calling the function with X = 2 yields the calculation Y = 2^2+2*2+3 = 11.
SecondOrderPolynomial3 returns 3! We must open the FPU view as well as the
CPU view and trace through the code and watch what is happening. It is
seen that A=1 is correctly loaded into
st(0) by the
fourth line, but the 5. line that should
multiply A by X, 1 by 2, is resulting in st(0)
being a very small number, in effect 0. This tells us that X is near zero
instead of 2. Two things can be wrong. The calling code is transferring a
wrong value of X or we are addressing X incorrectly. By comparing the
calling code when calling function SecondOrderPolynomial3 and
SecondOrderPolynomial1 we see that it is the same and this is not the bug.
It would also be quite surprising if Delphi was suddenly getting this
wrong! Try to step through the calling code while watching the memory pane
in the cpu view.
The little green arrow is the position of the
stackpointer. The calling code looks like this.
push
dword ptr
[ebp-$0c]
push dword
ptr [ebp-$10]
call SecondOrderPolynomial1
Two pointers are pushed onto the stack. One of
them is a pointer to X. I do not what the other one is. By watching the
memory pane we see that the first one is the pointer to X and the second
one is a nil pointer. When we trace into the function we see that the
first two lines are repeated. The compiler automatically inserted the push
ebp and mov
ebp,
esp lines. Because push decrements the stack pointer by 4,
the reference to X went wrong. These two first lines are removed and
everything is ok again.
Now we have finished analyzing the code and know
what it does, we can begin optimizing it.
Let us first change the two
fstp/fld lines that we already have seen is redundant.
function
SecondOrderPolynomial4(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;
asm
//push ebp
//mov
ebp,esp
add esp,-$08
//Result := A*X*X + B*X + C;
fld
qword ptr [A]
fmul
qword ptr
[ebp+$08]
fmul
qword ptr
[ebp+$08]
fld
qword ptr [B]
fmul
qword ptr
[ebp+$08]
faddp //st(1)
fadd
qword ptr
[C]
//fstp
qword ptr
[ebp-$08]
wait
//fld
qword ptr [ebp-$08]
pop ecx
pop
ecx
pop
ebp
end;
This was the only references to the stack frame
which is not needed now.
function
SecondOrderPolynomial5(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;
asm
//push ebp
//mov
ebp,esp
//add esp,-$08
//Result := A*X*X + B*X + C;
fld
qword ptr [A]
fmul
qword ptr
[ebp+$08]
fmul
qword ptr
[ebp+$08]
fld
qword ptr [B]
fmul
qword ptr
[ebp+$08]
faddp //st(1)
fadd
qword ptr
[C]
wait
//pop ecx
//pop ecx
//pop ebp
end;
That removed another 6 lines and reduces the
function to this.
function
SecondOrderPolynomial6(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;
asm
//Result := A*X*X + B*X + C;
fld
qword ptr [A]
fmul
qword ptr
[ebp+$08]
fmul
qword ptr
[ebp+$08]
fld
qword ptr [B]
fmul
qword ptr
[ebp+$08]
faddp
fadd
qword ptr
[C]
wait
end;
X is loaded from memory into the FPU 3 times. It
would be more effective to load it once and then reuse it.
function
SecondOrderPolynomial7(X : Double) : Double;
const
A : Double = 1;
B : Double = 2;
C : Double = 3;
asm
//Result := A*X*X + B*X + C;
fld
qword ptr
[ebp+$08]
fld
qword ptr [A]
fmul
st(0), st(1)
fmul
st(0), st(1)
fld
qword ptr [B]
fmul
st(0), st(2)
ffree
st(2)
faddp
fadd
qword ptr
[C]
wait
end;
I magically came up with this code. The first line
loads X. The second line loads A. The third line multiplies A with X.
The 4. line
multiplies a*X know in st(0) with X. Then we
have calculated the first term. Calculating the second term is done by
loading B and multiply it with X. This was the
last time we needed X in we free's the
register, st(2),
holding it. Now we add term 1 and 2 and popping term 2 of the stack. The
only thing left to do is adding C. The result is now in
st(0)
and the other registers are empty. Then we check for exceptions with wait
and is done. It is seen that no redundant work is done and this
implementation is near optimal.
There exits seven instructions for loading often
used constants into the FPU. One of these constants is 1, which can be
loaded with the instruction fld1. Using it saves one load from memory,
which can be costly in terms of clock cycles if data are not properly
aligned.
function
SecondOrderPolynomial8(X : Double) : Double;
const
//A : Double = 1;
B : Double = 2;
C : Double = 3;
asm
//Result := A*X*X + B*X + C;
fld
qword ptr
[ebp+$08]
//fld qword
ptr [A]
fld1
fmul
st(0), st(1)
fmul
st(0), st(1)
fld
qword ptr [B]
fmul
st(0), st(2)
ffree
st(2)
faddp
fadd
qword ptr
[C]
wait
end;
This ended the second lesson. Stay tuned for more.
Regards
Dennis