BASM for beginners, lesson 4
In this
lesson we will learn about branching, illustrated by an if-else construct.
Conditional floating point move will also be introduced.
The
example function of this lesson is the Min function from the Delphi Math
unit.
function
Min1(const A, B: Single) : Single;
begin
if A < B
then
Result
:= A
else
Result
:= B;
end;
The
compiler generated asm for this function looks like this
function
Min2(const A, B: Single) : Single;
begin
{
00452458
55 push ebp
00452459
8BEC mov ebp,esp
0045245B
51 push ecx
}
if A < B
then
{
0045245C
D9450C fld dword ptr [ebp+$0c]
0045245F
D85D08 fcomp dword ptr [ebp+$08]
00452462
DFE0 fstsw ax
00452464 9E sahf
00452465 7308 jnb +$08
}
Result
:= A
{
00452467
8B450C mov eax,[ebp+$0c]
0045246A
8945FC mov [ebp-$04],eax
0045246D EB06 jmp +$06
}
else
Result := B;
{
0045246F
8B4508 mov eax,[ebp+$08]
00452472
8945FC mov [ebp-$04],eax
}
{
00452475
D945FC fld dword ptr [ebp-$04]
00452478 59 pop ecx
00452479 5D pop ebp
}
end;
This time
I included the address and opcode columns, because we need them later.
Let’s analyze it line by line, like we always do.
function
Min3(const A, B: Single) : Single;
begin
{
push
ebp // Save ebp on stack
mov
ebp,esp // New basepointer is the old stackpointer
push
ecx // subtract 4 from esp
}
if A < B
then
{
fld dword
ptr [ebp+$0c] // Load A on FP stack
fcomp
dword ptr [ebp+$08] // FP compare A to B and pop A from stack
fstsw
ax // Store FP statusword in ax
sahf // Store ah into EFlags register
jnb
+$08 // If not below jump 8 bytes forward
}
Result
:= A
{
mov
eax,[ebp+$0c] // Copy A into eax
mov
[ebp-$04],eax // Copy A into stackframe
jmp
+$06 // Jmp 6 bytes forward
}
else
Result
:= B;
{
mov
eax,[ebp+$08] // Copy B into eax
mov
[ebp-$04],eax // Copy B into stackframe
}
{
fld
dword ptr [ebp-$04] // Load A or B from stackframe onto FP stack
pop
ecx // Add 4 to esp
pop ebp // Restore ebp
}
end;
This time
I commented every line of BASM. Read the details there. The first new
instruction introduced by this example is fcomp. F says floating point
instruction as always. Com says compare and p says pop FP stack. Fcom
compares two floating point values and sets the condition code flags of
the floating point unit, named C0, C1, C2 and C3. These flags are the
equivalents of the EFlags register of the CPU. The flags are checked by
conditional jump instructions, which jump or not depending on the type of
jump and the status of the flags. Conditional jump instructions check the
CPU flags and not the FPU flags and therefore it is necessary to copy the
flags from the FPU to the CPU. This is done by the next two instructions.
fstsw stores the FP flags into the ax register and sahf copy the 8 bits
from ah to the EFlags register. This is a long way for the flags to travel
before they can be used by the jnb instruction. Jnb is short for jump not
below. In the Intel SW Developers Manual Vol. 2 at page 394 there is a
table with all the different jump instructions and a description of the
flags they test. Here it is seen that jnb jumps if CF=1 and ZF=1. Try
tracing through the code with the FPU view and the CPU view open. Watch
how the FPU flags are set, their values are copied into ah and then copied
to the CPU EFlags register.
If the jnb
jump is not taken execution continues in the line after it. This is the
if-part of the if-else construct. If the jump is taken execution is
continued at the line 8 bytes ahead. This is at the start of the else
part. The if- and the else-part is very much the same. As seen in the
Pascal code A or B is copied to the Result variable depending on the
if-condition. Instead of copying A or B directly on the top of the FP
stack, which is the place of a FP Result from a function due to the
register calling convention, the Delphi compiler chose to use the stack as
a temporary location. The fld dword ptr [ebp-$04] instruction copies the
result in the right place.
Observe
that an unconditional jump is needed at the end of the if-block such that
execution does not continue in the else block. This way there is a jump no
matter with branch is taken.
I will
write short word on branch prediction. Branch prediction comes in two
flavors, static and dynamic. The first time a branch is executed the CPU
can have no knowledge about the probability it will be taken or not. In
this situation it uses the static predictor, which says that forward jumps
are not taken and backward branches are taken. Our example jump no 1 is a
forward branch and it will be predicted not taken the first time through.
If we have knowledge about the values A and B, we can use it to code the
if-else such that the if-part is the most often executed and static
prediction is optimized. An unconditional jump needs no prediction - it is
always taken ;-) A backward jump is often part of a loop and most loops
will run more than once (why else have a loop?). This way it makes sense
to statically predict backward jumps taken. Dynamic branch prediction
tries to accumulate knowledge about the probability that a certain jump is
taken or not and then predict it correctly as often as possible.
Now it is
time to transform the function into a pure asm one.
function
Min4(const A, B: Single) : Single;
asm
//push
ebp
//mov
ebp,esp
push ecx
//if A <
B then
fld
dword ptr [ebp+$0c]
fcomp dword ptr [ebp+$08]
fstsw ax
sahf
jnb
@ElseBegin
//Result
:= A
mov
eax,[ebp+$0c]
mov
[ebp-$04],eax
jmp
@ElseEnd
//else
//Result := B;
@ElseBegin :
mov eax,[ebp+$08]
mov [ebp-$04],eax
@ElseEnd :
fld dword ptr [ebp-$04]
pop ecx
//pop ebp
end;
The new
thing this time is that we need some labels. Our analysis of the function
made it clear where we jumped to when, I hope. In general when placing
labels it is a good thing to use the knowledge we have about the structure
of the code. You can open the FPU view and just trace through the code and
see where you go when jumps are taken. If you want to place labels without
having to think, then use some math. Example follows
We have
this jump
00452465
7308 jnb +$08
This is
the line after it
00452467
8B450C mov eax,[ebp+$0c]
This is
the line 8 byte after that line
0045246F
8B4508 mov eax,[ebp+$08]
Take the
address of the line after the jump and add the jump offset to it and you
have the jump target address. This is the math: 00452467 + 8 = 0045246F.
Why do we
add the jump offset to the address of the line after the jump and not to
the address of the line with the jump?
It is time
to optimize.
function
Min5(const A, B: Single) : Single;
asm
push ecx
//if A <
B then
fld
dword ptr [ebp+$0c]
fcomp dword ptr [ebp+$08]
fstsw ax
sahf
jnb
@ElseBegin
//Result
:= A
mov
eax,[ebp+$0c]
mov
[ebp-$04],eax
jmp
@ElseEnd
//else
//Result := B;
@ElseBegin :
mov eax,[ebp+$08]
mov [ebp-$04],eax
@ElseEnd :
fld dword ptr [ebp-$04]
pop ecx
end;
This is
just a cleaned up version of the function. Let’s change the push ecx, pop
ecx instructions with instructions that manipulate the esp register
directly and do not move data between ecx and the stack.
function
Min6(const A, B: Single) : Single;
asm
//push
ecx
sub
esp, 4
//if A <
B then
fld
dword ptr [ebp+$0c]
fcomp
dword ptr [ebp+$08]
fstsw ax
sahf
jnb
@ElseBegin
//Result
:= A
mov
eax,[ebp+$0c]
mov
[ebp-$04],eax
jmp
@ElseEnd
//else
//Result := B;
@ElseBegin :
mov eax,[ebp+$08]
mov [ebp-$04],eax
@ElseEnd :
fld dword ptr [ebp-$04]
//pop ecx
add esp,
4
end;
When the
code was analyzed we saw that the flags were traveling a long way and 2
instructions were needed for this. What about a floating point compare
instruction that sets the EFlags directly? This is the fcomi instruction
which was introduced in the P6 architecture. Lets use it and abandon those
CPU's older than the Pro. It would be fair to believe that these lines
fcomp
dword ptr [ebp+$08]
fstsw ax
sahf
could be
substituted by this
fcomip
dword ptr [ebp+$08]
The fcomi
instruction does however not accept a memory reference as operand.
Therefore it is necessary to load data before issuing it.
fld dword ptr [ebp+$0c]
fcomip
st(0), st(1)
Then
because we loaded data we have to remove them again with the ffree
instruction. A fcomipp instruction would have been nice to have.
fld dword ptr [ebp+$0c]
fcomip
st(0), st(1)
ffree
st(0)
This is a
hell of an optimization, substituting three lines with 3 other. It is
necessary to time the two versions to know whether it was an optimization.
Now the function looks like this.
function
Min7(const A, B: Single) : Single;
asm
sub
esp, 4
//if A
< B then
fld
dword ptr [ebp+$08]
fld
dword ptr [ebp+$0c]
fcomip
st(0), st(1)
ffree
st(0)
//fstsw
ax
//sahf
jnb
@ElseBegin
//Result := A
mov
eax,[ebp+$0c]
mov
[ebp-$04],eax
jmp
@ElseEnd
//else
//Result := B;
@ElseBegin :
mov
eax,[ebp+$08]
mov
[ebp-$04],eax
@ElseEnd
:
fld
dword ptr [ebp-$04]
add
esp, 4
end;
Now we
apply some logic thinking. Why copy the result around? Both A & B is on
the stack for use by the comparison by fcom and the result should be
delivered on the stack. The only thing needed is to delete either A or B
and leave the smallest one of them on the stack
function
Min8(const A, B: Single) : Single;
asm
sub
esp, 4
//if A
< B then
fld
dword ptr [ebp+$08]
fld
dword ptr [ebp+$0c]
//fcomip st(0), st(1)
fcomi
st(0), st(1)
//ffree st(0)
jnb
@ElseBegin
//Result := A
//mov eax,[ebp+$0c]
//mov [ebp-$04],eax
ffree
st(1)
jmp
@ElseEnd
//else
//Result := B;
@ElseBegin :
//mov eax,[ebp+$08]
//mov [ebp-$04],eax
fxch
ffree
st(1)
@ElseEnd
:
//fld dword ptr [ebp-$04]
add esp, 4
end;
Fcomip is
replaced with fcomi because we do not want to remove B from the stack at
this point. Ffree is deleted because it deleted A. Then all the lines that
copied the result around are removed. In the if-block A is the result and
B should be deleted. B is in st(1) and ffree st(1) does the job. In the
else block we want to delete A and leave B in st(0). This is done by
swapping A and B with the fxch instruction and then delete A in st(1) with
ffree. Fxch is nearly free (takes 0 cycles), because if works by register
renaming instead of actually copying data.
function
Min9(const A, B: Single) : Single;
asm
//sub esp, 4
//if A
< B then
fld
dword ptr [ebp+$08]
fld
dword ptr [ebp+$0c]
fcomi
st(0), st(1)
jnb
@ElseBegin
//Result := A
ffree
st(1)
jmp
@ElseEnd
//else
//Result := B;
@ElseBegin :
fxch
ffree
st(1)
@ElseEnd
:
//add esp, 4
end;
Now the
stackframe is not needed and we delete the code that set it up.
function
Min10(const A, B: Single) : Single;
asm
//if A
< B then
fld
dword ptr [ebp+$08]
fld
dword ptr [ebp+$0c]
fcomi
st(0), st(1)
jnb
@ElseBegin
//Result := A
ffree
st(1)
jmp
@ElseEnd
//else
//Result := B;
@ElseBegin :
fxch
ffree
st(1)
@ElseEnd
:
end;
This is a
pretty nice function, but somebody in the newsgroup has told us about
conditional moves. fcmovnb is such a thing - floating point conditional
move not below. It copies data from st(1)-st(7) to st(0) if the condition
is fulfilled. The Eflags are tested for the condition. Fcmov was
introduced in the P6 architecture as well fcomi was.
function
Min11(const A, B: Single) : Single;
asm
fld
dword ptr [ebp+$08]
fld dword ptr [ebp+$0c]
fcomi st(0), st(1)
fcmovnb
st(0), st(1)
ffree
st(1)
end;
Instead of
all the jumping around we copy A to the top of the stack where B is, but
only if A is the smaller of the two. Delete B and done.
This is a
nice clean function with only one piece of redundancy left. The compiler
still pushes and pops ebp even though it is not modified in the function.