BASM for beginners, lesson 5
Welcome to lesson number 5. Today’s topic is loops. We will take a look
on how the compiler implements for loops, and which optimizations it
applies on them. We will evaluate the efficiency of these optimizations.
function ForLoop(Start, Stop : Integer) : Integer;
var
I : Integer;
begin
Result := 0;
for I := Start to Stop do
begin
Result := Result + I;
end;
end;
This example function does nothing useful except giving us an example for
loop to examine. Let us see what the compiler translates the function
into. In this lesson we try something new and compile with optimizations
off.
function ForLoopNonOpt(Start, Stop : Integer) : Integer;
var
I : Integer;
begin
{
push ebp
mov ebp,esp
add esp,-$14
mov [ebp-$08],edx
mov [ebp-$04],eax
}
Result := 0;
{
xor eax,eax
mov [ebp-$0c],eax
}
for I := Start to Stop do
{
mov eax,[ebp-$04]
mov edx,[ebp-$08]
sub edx,eax
jl +$15
inc edx
mov [ebp-$14],edx
mov [ebp-$10],eax
}
begin
Result := Result + I;
{
mov eax,[ebp-$10]
add [ebp-$0c],eax
inc dword ptr [ebp-$10]
}
end;
{
dec dword ptr [ebp-$14]
jnz -$0e
mov eax,[ebp-$0c]
}
{
mov esp,ebp
pop ebp
ret
}
end;
It is seen that the compiler generates a lot of code when no or few
optimizations are applied. As usual the first 3 lines set up a stackframe.
This time it is 20 byte big (16 hex). The two next lines copy the Start
and Stop variables unto the stack. Start is transferred in eax and Stop is
transferred in edx to the function due to the register calling convention.
The next two lines creates a zero in eax and copy it to the stackframe at
[ebp-$0c], which is the place of the result variable. Then we are ready to
enter the for-loop. Before entering a for-loop it is necessary to test
whether the loop will iterate at all. If Stop is bigger than Start this is
the case. Start and Stop is fetched from their stack locations into eax &
edx. We compute Stop-Start and if this is negative the loop should not be
entered and execution is transferred past the end of the loop by the jl
(jump low) instruction. The next line increments Stop and then it is
copied to location [ebp-$14]. We have no name for the local variable at
this location. The purpose of it also needs some further explanations. It
is a variable introduced by the compiler as an optimization and this is
weird because we compiled with optimizations off. There is only one more
line accessing the NoName variable and this is the dec dword ptr [ebp-$14]
line. This line decrement NoName by one at the end of each loop-iteration
and the loop closing test is a test that it has not reached zero. The dec
instruction sets the flags and the jnz jumps back to the top of the loop
if NoName <> 0. We would expect that I was used as the loop counter and
that it was running from Start to Stop. It is in fact doing so, but it is
not used for control of the loop. This is the sole purpose of NoName. The
advantage of this is that it saves an instruction for comparing I to Stop.
There is also a cost and this is the dec NoName instruction. On P4 the
latency/throughput of cmp is 0.5/0.5 clock cycles and for dec they are
1/0.5. Therefore we would expect this optimization to be a
"deoptimization". The latency and throughput numbers for P4 is found in
the "Intel Pentium 4 and Xeon Processor Optimization" manual from Intel.
Back to the mov [ebp-$10],eax line. It copies I to the stack. The loop
body consist of only one line of Pascal Result := Result + I;. This is
translated into 3 lines of ASM. The first two lines are loading I into eax
and then adding it to Result at the stack at [ebp-$0c]. The third line
increment I. This ended the explanation of the loop code and only two
things are left. The Result must be copied to eax which is the register to
use for returning the result from a register calling convention function.
The last three lines remove the stackframe and return execution to the
line after the line that called the function.
As an exercise let us change the ASM code such that it follows the Pascal
code and our understanding of a for-loop. We start by turning the function
into a pure BASM one. This is done by “outcommenting” the Pascal code and
"incommenting" the ASM code. Defining the two labels LoopEnd and LoopStart
is also necessary. The two jumps are edited such that they jump to the
labels.
function ForLoopBASM1(Start, Stop : Integer) : Integer;
asm
push ebp
mov ebp,esp
add esp,-$14
mov [ebp-$08],edx
mov [ebp-$04],eax
//Result := 0;
xor eax,eax
mov [ebp-$0c],eax
//for I := Start to Stop do
mov eax,[ebp-$04]
mov edx,[ebp-$08]
sub edx,eax
jl @LoopEnd
inc edx
mov [ebp-$14],edx
mov [ebp-$10],eax
//begin
@LoopStart :
//Result := Result + I;
mov eax,[ebp-$10]
add [ebp-$0c],eax
inc dword ptr [ebp-$10]
//end;
dec dword ptr [ebp-$14]
jnz @LoopStart
@LoopEnd :
mov eax,[ebp-$0c]
mov esp,ebp
pop ebp
//ret
end;
The first thing we do is removing the NoName variable.
function ForLoopBASM2(Start, Stop : Integer) : Integer;
asm
push ebp
push ebx //New
mov ebp,esp
add esp,-$14
mov [ebp-$08],edx
mov [ebp-$04],eax
//Result := 0;
xor eax,eax
mov [ebp-$0c],eax
//for I := Start to Stop do
mov eax,[ebp-$04]
mov edx,[ebp-$08]
sub edx,eax
jl @LoopEnd
//inc edx //NoName intialize
//mov [ebp-$14],edx //NoName intialize
mov [ebp-$10],eax
//begin
@LoopStart :
//Result := Result + I;
mov eax,[ebp-$10]
add [ebp-$0c],eax
inc dword ptr [ebp-$10]
//end;
//dec dword ptr [ebp-$14] //NoName decrement
mov ebx, [ebp-$10] //New
mov ecx, [ebp-$08] //New
cmp ebx, ecx //New
//jnz @LoopStart
jbe @LoopStart //New
@LoopEnd :
mov eax,[ebp-$0c]
mov esp,ebp
pop ebx //New
pop ebp
//ret
end;
The lines marked "New" are introduced to make I the loop control variable.
The mov ebx, [ebp-$10] line copies I into ebx The next line copies Stop
into ecx. Then the line cmp ebx, ecx compare them and jbe @LoopStart
transfer execution to the start of the loop if I is below Stop or equal to
it. Because we use ebx and it is not free for use we remember to push and
pop it.
We expect a for-loop to evaluate the loop condition at the top. This test
is split in two by the compiler implementation. Before entering the loop
it is tested that it will execute at least once and the actual loop
closing test is done at the bottom. This is an optimization technique
called loop inversion. Now we change the loop such that this optimization
is removed. Then we see what the benefit of the optimization was.
function ForLoopBASM4(Start, Stop : Integer) : Integer;
asm
push ebp
push ebx
mov ebp,esp
add esp,-$14
mov [ebp-$08],edx
mov [ebp-$04],eax
//Result := 0;
xor eax,eax
mov [ebp-$0c],eax
//for I := Start to Stop do
mov eax,[ebp-$04]
mov edx,[ebp-$08]
//sub edx,eax
//jl @LoopEnd
mov [ebp-$10],eax
//begin
@LoopStart :
mov ebx, [ebp-$10]
mov ecx, [ebp-$08]
cmp ebx, ecx
ja @LoopEnd
//Result := Result + I;
mov eax,[ebp-$10]
add [ebp-$0c],eax
inc dword ptr [ebp-$10]
//end;
//mov ebx, [ebp-$10]
//mov ecx, [ebp-$08]
//cmp ebx, ecx
//jbe @LoopStart
jmp @LoopStart
@LoopEnd :
mov eax,[ebp-$0c]
mov esp,ebp
pop ebx
pop ebp
end;
The loop closing test has been moved to the top and the test has been
inverted. At the place of the test there now is an unconditional jump to
the top. This jump is what the loop inversion optimization is all about.
In the nonoptimized for loop there are two jumps and only one in the
optimized one. The test at the top that tested whether Start was above
Stop is now redundant and is removed. Before making any timing to evaluate
the effect of the two optimizations it is a good idea to optimize away
some or all if possible, of the stack to register and register to stack
moves. This process is called register allocation and is one of the most
important optimizations on all architectures, but it is even more
important on the Intel architecture because of the low number of available
registers. If there is not a register available to all variables it is
crucial which variables get a register. The mov instructions inside the
loop body are the most important ones to get rid of. They are executed as
many times as the number of loop iterations. The instructions outside of
the loop are only executed once. The variables used inside the loop should
be allocated to registers first. This is I, Stop and Result. At this point
we could be smart and take a look at the use of registers as temporaries.
If a variable is always copied into the same temp register it would be
smart to allocate this register for the variable. Stop is in the edx
register when we enter the function and this register is also used as
temporary register for it in all but two lines. This is the two lines of
the loop test that we added. Let us change
mov ecx, [ebp-$08]
cmp ebx, ecx
to
mov edx, [ebp-$08]
cmp ebx, edx
Eax is used by Start in the top of the function and by Result in the rest
of the function. If there is no overlap in usage we can allocate eax for
Result as soon as Start has finished using it. After Start is assigned to
I (mov [ebp-$10],eax) it is not used any more and eax is free to use by
Result, if it was not for those lines where eax is used as temp for I.
mov eax,[ebp-$10]
add [ebp-$0c],eax
inc dword ptr [ebp-$10]
After ecx got out of use by the last change we can use that as temp for I
instead of eax.
mov ecx,[ebp-$10]
add [ebp-$0c],ecx
inc dword ptr [ebp-$10]
The summary of the first part of the register allocation is: Result in
eax, I in ecx and Stop in edx.
Let us first change the lines with Stop. [ebp-$08] is replaced by edx
everywhere.
function ForLoopBASM6(Start, Stop : Integer) : Integer;
asm
push ebp
push ebx
mov ebp,esp
add esp,-$14
//mov [ebp-$08],edx
mov edx,edx
mov [ebp-$04],eax
//Result := 0;
xor eax,eax
mov [ebp-$0c],eax
//for I := Start to Stop do
mov eax,[ebp-$04]
//mov edx,[ebp-$08]
mov edx,edx
mov [ebp-$10],eax
//begin
@LoopStart :
mov ebx, [ebp-$10]
//mov edx, [ebp-$08]
mov edx, edx
cmp ebx, edx
ja @LoopEnd
//Result := Result + I;
mov ecx,[ebp-$10]
add [ebp-$0c],ecx
inc dword ptr [ebp-$10]
//end;
jmp @LoopStart
@LoopEnd :
mov eax,[ebp-$0c]
mov esp,ebp
pop ebx
pop ebp
end;
Then allocate ecx for I by replacing [ebp-$10] by ecx.
function ForLoopBASM7(Start, Stop : Integer) : Integer;
asm
push ebp
push ebx
mov ebp,esp
add esp,-$14
mov edx,edx
mov [ebp-$04],eax
//Result := 0;
xor eax,eax
mov [ebp-$0c],eax
//for I := Start to Stop do
mov eax,[ebp-$04]
mov edx,edx
//mov [ebp-$10],eax
mov ecx,eax
//begin
@LoopStart :
//mov ebx, [ebp-$10]
mov ebx, ecx
mov edx, edx
cmp ebx, edx
ja @LoopEnd
//Result := Result + I;
//mov ecx,[ebp-$10]
mov ecx,ecx
add [ebp-$0c],ecx
//inc dword ptr [ebp-$10]
inc ecx
//end;
jmp @LoopStart
@LoopEnd :
mov eax,[ebp-$0c]
mov esp,ebp
pop ebx
pop ebp
end;
And last eax is allocated for use by Result. Because eax is used in the
top of the function by Start and as temp in the lines that initializes
Result to zero, we need to add an extra line to copy Result into eax after
eax is not in use anymore for these other purposes.
function ForLoopBASM8(Start, Stop : Integer) : Integer;
asm
push ebp
push ebx
mov ebp,esp
add esp,-$14
mov edx,edx
mov [ebp-$04],eax
//Result := 0;
xor eax,eax
mov [ebp-$0c],eax
//for I := Start to Stop do
mov eax,[ebp-$04]
mov edx,edx
mov ecx,eax
mov eax, [ebp-$0c] //New
//begin
@LoopStart :
mov ebx, ecx
mov edx, edx
cmp ebx, edx
ja @LoopEnd
//Result := Result + I;
mov ecx,ecx
//add [ebp-$0c],ecx
add eax,ecx
inc ecx
//end;
jmp @LoopStart
@LoopEnd :
//mov eax,[ebp-$0c]
mov eax,eax
mov esp,ebp
pop ebx
pop ebp
end;
Because we were pretty smart when we chose the registers there is a lot of
lines like mov eax,eax. It is easy to see how redundant they are ;-). Let
us remove them.
function ForLoopBASM9(Start, Stop : Integer) : Integer;
asm
push ebp
push ebx
mov ebp,esp
add esp,-$14
//mov edx,edx
mov [ebp-$04],eax
//Result := 0;
xor eax,eax
mov [ebp-$0c],eax
//for I := Start to Stop do
mov eax,[ebp-$04]
//mov edx,edx
mov ecx,eax
mov eax, [ebp-$0c]
//begin
@LoopStart :
mov ebx, ecx
//mov edx, edx
cmp ebx, edx
ja @LoopEnd
//Result := Result + I;
//mov ecx,ecx
add eax,ecx
inc ecx
//end;
jmp @LoopStart
@LoopEnd :
//mov eax,eax
mov esp,ebp
pop ebx
pop ebp
end;
When optimizing ASM code there is generally two lines of thinking we can
choose to follow. We can think as the human beings we are and try to be
smart, use whatever information we can get from the code. We did some of
this when we selected the registers here. Another line of thought is
trying to do it systematically as an optimizer/compiler has to. This way
we develop algorithms that can be coded in a tool. This tool can later
take over the most boring of optimizations, these we do over and over
again. Removal of the most obviously redundant line of code of all, the
mov eax,eax, was an example of dead code removal, which is basic to any
optimizer.
In the top of the function we still have some references to the stack. To
get rid of these we allocate registers for those variables too. This time
we just pick edi and esi, which are not used elsewhere. Allocate esi for
[ebp-$04] and edi for [ebp-$0c]. Because esi and edi must be preserved by
the function we must push and pop them.
function ForLoopBASM10(Start, Stop : Integer) : Integer;
asm
push ebp
push ebx
push esi
push edi
mov ebp,esp
add esp,-$14
//mov [ebp-$04],eax
mov esi,eax
//Result := 0;
xor eax,eax
//mov [ebp-$0c],eax
mov edi,eax
//for I := Start to Stop do
//mov eax,[ebp-$04]
mov eax,esi
mov ecx,eax
//mov eax, [ebp-$0c]
mov eax, edi
//begin
@LoopStart :
mov ebx, ecx
cmp ebx, edx
ja @LoopEnd
//Result := Result + I;
add eax,ecx
inc ecx
//end;
jmp @LoopStart
@LoopEnd :
mov esp,ebp
pop edi
pop esi
pop ebx
pop ebp
end;
The stack frame is not used anymore and there is no need to set it up.
This removes 4 instructions. Then we observe that the two lines
mov eax,esi
mov ecx,eax
This can be replaced by one line.
mov ecx, esi
This is an example of a simple copy propagation followed by dead code
removal. The value in eax is not used by any other lines than the next
line that copies it back to ecx. It is in fact immediately overwritten by
the line mov eax, edi. Therefore we can replace the second line by mov
ecx, esi and remove the first one, which becomes dead.
function ForLoopBASM11(Start, Stop : Integer) : Integer;
asm
//push ebp
push ebx
push esi
push edi
//mov ebp,esp
//add esp,-$14
mov esi,eax
//Result := 0;
xor eax,eax
mov edi,eax
//for I := Start to Stop do
//mov eax,esi
//mov ecx,eax
mov ecx, esi
mov eax, edi
//begin
@LoopStart :
//mov ebx, ecx
//cmp ebx, edx
cmp ecx, edx
ja @LoopEnd
//Result := Result + I;
add eax,ecx
inc ecx
//end;
jmp @LoopStart
@LoopEnd :
//mov esp,ebp
pop edi
pop esi
pop ebx
//pop ebp
end;
The line xor eax,eax that initializes Result to zero can together with the
line rigth after it be moved some lines down closer to the place where eax
is used for the first time. It should not be moved into the loop that
would change the logic of the function, but just before the line before
loopStart. This removes the need for copying eax into edi and back into
eax again in the line just before the comment line //for I := Start to
Stop do, and in the line before the outcommented begin.
function ForLoopBASM12(Start, Stop : Integer) : Integer;
asm
push ebx
push esi
push edi
mov esi,eax
//for I := Start to Stop do
mov ecx, esi
//Result := 0;
xor eax,eax
//mov edi,eax
//mov eax, edi
//begin
@LoopStart :
cmp ecx, edx
ja @LoopEnd
//Result := Result + I;
add eax,ecx
inc ecx
//end;
jmp @LoopStart
@LoopEnd :
pop edi
pop esi
pop ebx
end;
After having cleaned up we spot two lines of mov which together copy eax
to ecx via esi. This leaves a copy of eax in esi which is not used.
Therefore these two lines can be replaced by one that moves eax directly
into ecx. This is also copy propagation + dead code removal.
function ForLoopBASM13(Start, Stop : Integer) : Integer;
asm
push ebx
//push esi
push edi
//mov esi,eax
//for I := Start to Stop do
//mov ecx, esi
mov ecx, eax
//Result := 0;
xor eax,eax
//begin
@LoopStart :
cmp ecx, edx
ja @LoopEnd
//Result := Result + I;
add eax,ecx
inc ecx
//end;
jmp @LoopStart
@LoopEnd :
pop edi
//pop esi
pop ebx
end;
After having removed the only use of esi there is no need to push and pop
it.
function ForLoopBASM14(Start, Stop : Integer) : Integer;
asm
//push ebx
//push edi
//for I := Start to Stop do
mov ecx, eax
//Result := 0;
xor eax,eax
//begin
@LoopStart :
cmp ecx, edx
ja @LoopEnd
//Result := Result + I;
add eax,ecx
inc ecx
//end;
jmp @LoopStart
@LoopEnd :
//pop edi
//pop ebx
end;
We also, a little late perhaps, observe that ebx and edi is not used
either. After having cleaned up and relocated the comments a little, there
is a nice clean function as a result.
function ForLoopBASM15(Start, Stop : Integer) : Integer;
asm
mov ecx, eax
//Result := 0;
xor eax,eax
//for I := Start to Stop do
@LoopStart :
cmp ecx, edx
ja @LoopEnd
//Result := Result + I;
add eax,ecx
inc ecx
jmp @LoopStart
@LoopEnd :
end;
It took a long time and a lot of optimizations to get here because we
started with the non optimized output from the compiler. This long process
illustrated the amount of work the compiler leaves to the optimizer.
Sometimes we did not use the algorithmic approach to optimization, but we
could have achieved the same result by doing it.
Instead of taking the same long road again with the function with loop
inversion present, we can cheat and compile the Pascal function with
optimizations on. The compiler might do all the optimization we did.
function ForLoopOpt(Start, Stop : Integer) : Integer;
var
I : Integer;
begin
{
}
Result := 0;
{
xor ecx,ecx
}
for I := Start to Stop do
{
sub edx,eax
jl +$08
inc edx
xchg eax,edx
}
begin
Result := Result + I;
{
add ecx,edx
inc edx
}
end;
{
dec eax
jnz -$06
}
{
mov eax,ecx
}
end;
This time Delphi did a really nice job. Only two lines jump into our eyes
as possibly redundant. xchg eax,edx simply exchange the values in eax and
edx and mov eax,ecx copy the Result into eax. Both lines are outside the
loop and make little harm. We now have two functions - one with no loop
optimizations and one with two. To make things complete we need two more
functions, one with loop inversion only and one with the NoName variable
optimization only. In the begining of the lesson we saw how to remove the
two optimizations and this is what I have done to get to the last 2
functions. In the Delphi optimized function above, I optimized away the
xchg instruction by swapping the use of the two registers it exchanged.
Because we want to se the maximum effect of the loop optimizations I have
removed the loop body code doing the operation Result := Result + I;
Here are the four final functions
function ForLoopNoLoopInverNoNoName(Start, Stop : Integer) : Integer;
asm
mov ecx, eax
//Result := 0;
xor eax,eax
//for I := Start to Stop do
@LoopStart :
cmp ecx, edx
ja @LoopEnd
inc ecx
jmp @LoopStart
@LoopEnd :
end;
The loop consists of 4 instructions. 1 cmp, 1 ja, 1 inc and 1 jmp. Latency
and throughput for these instructions on P4 are: cmp 0.5/0.5, ja X/0.5,
inc 0.5/1 and jmp X/0.5 The X means "latency is not applicable to this
instruction". Adding latencies we get: 0.5 + X + 0.5 + X = ? clock cycles
function ForLoopLoopInverNoNoName(Start, Stop : Integer) : Integer;
asm
mov ecx, eax
//Result := 0;
xor eax,eax
//for I := Start to Stop do
cmp ecx, edx
ja @LoopEnd
@LoopStart :
inc ecx
cmp ecx, edx
jbe @LoopStart
@LoopEnd :
end;
This loop consists of 3 instructions also with unknown sum of latency.
function ForLoopNoLoopInverNoName(Start, Stop : Integer) : Integer;
asm
//Result := 0;
xor ecx,ecx
//for I := Start to Stop do
sub edx,eax
cmp edx, 0
@LoopStart :
jz @LoopEnd
inc eax
dec edx
jmp @LoopStart
@LoopEnd :
mov eax,ecx
end;
This loop consists of 4 instructions also with unknown sum of latency. We
observe that the two inc/dec instructions are able to execute in parallel.
Because the dec NoName instruction is not followed by the conditional jmp
it would look like we throw away the benefit of not needing a cmp or test
instruction to set the flags, but the jmp instruction does not change the
flags and they are valid when we reach the jz instruction at the top of
the loop. Only at the first iteration a cmp edx,0 instruction is needed.
function ForLoopLoopInverNoName(Start, Stop : Integer) : Integer;
asm
//Result := 0;
xor ecx,ecx
//for I := Start to Stop do
sub edx,eax
jl @LoopEnd
inc edx
@LoopStart :
inc eax
dec edx
jnz @LoopStart
@LoopEnd :
mov eax,ecx
end;
This loop consists of 3 instructions also with unknown sum of latency.
Here there also is an independent inc/dec pair.
This is the simple benchmark I have used to find the performance of the 4
functions
var
Starttime, Endtime, Runtime : TDateTime;
I, LoopResult : Integer;
RunTimeSec, NoOfLoopsPerSec, NoOfLoops, ClockCount, LoopEnd2Float,
LoopEndFloat, LoopStartFloat : Double;
begin
Starttime := Time;
for I := 1 to LOOPEND2 do
begin
LoopResult := ForLoopNoLoopInverNoName(LOOPSTART, LOOPEND);
end;
Endtime := Time;
Runtime := Endtime - Starttime;
CEdit.Text := IntToStr(LoopResult);
RuntimeEdit4.Text := TimeToStr(Runtime);
RunTimeSec := RunTime*24*60*60;
LoopEnd2Float := LOOPEND2;
LoopEndFloat := LOOPEND;
LoopStartFloat := LOOPSTART;
NoOfLoops := LoopEnd2Float * (LoopEndFloat - LoopStartFloat);
NoOfLoopsPerSec := NoOfLoops / RunTimeSec;
ClockCount := CLOCK / NoOfLoopsPerSec;
ClockCountEdit4.Text := FloatToStrf(ClockCount, ffFixed, 9, 1);
end;
Results on P4 1920 are
No Loop Inversion and No NoName variable 00:00:55 (2.7 Clock cycles)
Loop Inversion but No NoName variable 00:00:39 (1.9 Clock cycles)
No Loop Inversion but NoName variable 00:01:02 (3.0 Clock cycles)
Loop Inversion + NoName 00:00:41 (2.0 Clock cycles)
Results on P3 1400 are
No Loop Inversion and No NoName variable 00:01:26 (3.0 Clock cycles)
Loop Inversion but No NoName variable 00:01:26 (3.0 Clock cycles)
No Loop Inversion but NoName variable 00:01:55 (4.0 Clock cycles)
Loop Inversion + NoName 00:01:26 (3.0 Clock cycles)
Of course the clock count numbers should be integers. On P4 half cycles
are possible due to the double clocked alu's. Our timings are not as good
as we could wish, but for comparing the performance of the loops they are
OK.
The conclusion on P4 is; apply loop inversion only or loop inversion with
NoName variable optimization.
The conclusion on P3 is; Do not apply the NoName variable optimization
alone.
The conclusion on both targets is; apply both optimizations as Delphi
does.
Also observe how efficient P4 is on this code.
Regards
Dennis