BASM for beginners, lesson 1
This series of lessons is dedicated to teach
beginners BASM.
Instead of doing a book with chapters about
instructions, Calling conventions & code optimization etc., we will be
discussing the issues as examples bring them into focus.
The first little example gets us started. It is a
simple function in Pascal with multiplies an integer with the constant 2.
function
MulInt2(I : Integer)
: Integer;
begin
Result := I * 2;
end;
Let us steal the BASM from the CPU view. I compiled
with optimizations turned on.
function MulInt2_BASM(I :
Integer) : Integer;
begin
Result := I * 2;
{
add eax,eax
ret
}
end;
From this we see that I is
transferred to the function in eax and that
the result is transferred back to the caller in eax
too. This is the conventions for the register calling convention which is
the default in Delphi. The actual code is very simple. The times 2
multiplication is obtained by adding I to itself, I+I = 2I. The ret
instruction returns execution to the line after the one which called the
function.
Let us code the function as a pure
asm function.
function MulInt2_BASM2(I :
Integer) : Integer;
asm
//Result := I * 2;
add eax,eax
//ret
end;
Observe that the ret function is supplied by the
inline assembler.
Let us take a look at the calling code.
This is the Pascal code
procedure
TForm1.Button1Click(Sender: TObject);
var
I, J :
Integer;
begin
I :=
StrToInt(IEdit.Text);
J := MulInt2_BASM2(I);
JEdit.Text
:= IntToStr(J);
end;
The important line is
J := MulInt2_BASM2(I);
from the
cpu view
call
StrToInt
call MulInt2_BASM2
mov
esi,eax
After the call to StrToInt
from the line before the one which calls our function, I
is in eax. (StrToInt
is also following the register calling convention). MulInt2_BASM2 is
called and returns the result in eax which is
copied to esi in the next line.
Optimization issues: Multiplication by 2 can be done
in two more ways. Using the mul instruction or
shifting left by one. In the IA32 Intel Architecture Software Developer’s
Manual Volume 2 – Instruction Set Reference page 536
mul is described. It multiplies the value in
eax by the value held in another register and the result is
returned in the register pair edx:eax.
A register pair is needed because a multiplication of two 32 bit numbers
results in a 64 bit result, just like 9*9=81 - two one digit numbers (can)
result in a two digit result.
Because edx gets into use
the issue of which registers must be preserved by a function and which can
be used freely is raised. This is explained in the Delphi help.
"An asm statement must
preserve the EDI, ESI, ESP, EBP, and EBX registers, but can freely modify
the EAX, ECX, and EDX registers."
We can conclude that it is no problem that
edx is modified by the
mul instruction and our function can also be implemented like this.
function MulInt2_BASM3(I :
Integer) : Integer;
asm
//Result := I * 2;
mov
ecx, 2
mul
ecx
end;
ecx
is used also but this is also ok. As long as the result is less than the
range of integer it is returned correctly in eax.
If I is bigger than half the range of integer,
overflow will occur and the result is incorrect.
Implementation with shift
function MulInt2_BASM4(I :
Integer) : Integer;
asm
//Result := I * 2;
shl eax,1
end;
Timing can reveal which implementation is fastest. We
can also consult Intel and AMD documents with latency and throughput
tables. On P4 add & mov are 0.5 clock cycles
latency and throughput, mul is 14-18 cycles
latency and 5 cycles throughput. Shl is 4
clock cycles latency and 1 cycle throughput. The version chosen by Delphi
is the most efficient on P4 and this will probably also be the case on
Athlon and P3.
Issues not covered: mul
versus imul and range checking, other calling
conventions, benchmarking, clock count on other processors, clock count
for call + ret, location of return address for ret etc. Later lessons will
introduce these subjects.