BASM for beginners, lesson 7
Welcome to lesson number 7. Today’s subject is floating point BASM.
This was also the subject of an earlier lesson, but this lesson will add
new information. We will look at how to code scalar SSE2 code and how
instructions are scheduled in the fp pipelines. Today’s example function
is evaluating a third order polynomial.
function ArcSinApprox1a(X, A, B, C, D : Double) : Double;
begin
Result := A*X*X*X + B*X*X + C*X + D;
end;
Instead of just analyzing and optimizing this function we will see an
actual use of it. A third order polynomial can approximate the ArcSin
function on its entire interval, [-1;1], with a maximal absolute error of
0.086. This is not impressive but what we develop in this lesson will
extend to higher order polynomials in a straightforward manner, yielding
higher precision.
The parameters A, B, C and D define the shape of the curve for the
function and the values for a fit to the ArcSin approximation with minimal
error must be found. For this purpose we develop an optimizer, which is
also used as the benchmark. Because ArcSin(0) = 0 we immediately see that
D=0 and D can be left out of optimization. We also know that ArcSin is an
odd function and therefore the second order term B*X*X is of no use in the
approximation. This is because a second order term is even and has
symmetry around the Y-axis. Odd order functions have anti symmetry around
the Y-axis with f(X) = -f(-X). All this means that our function could be
reduced to
Result := A*X*X*X + C*X;
We do however not do that because it is more illustrative to use the full
function. ArcSin is a special case and we want to be as general as
possible.
Function number 1a has 6 multiplications and 3 additions. Writing it on
Horner form
Result := ((A*X + B)*X + C)*X + D;
reduces this to 3 multiplications and 3 additions.
Another form is
Result := (A*X + B)*(X*X)+(C*X + D);
which has 4 multiplications and 3 additions.
On modern processors it is as important how much parallelism can be
extracted from the formula as it is how many multiplications and additions
it has. Modern processors such as AMD Athlon, Intel P4 and P3 have
pipelines. Pipelines are a necessity on processors running at high
frequencies, because the amount of work for an addition, subtraction,
multiplication or division can not be done in a single clock cycle. On P4
there is a pipeline called FP_ADD that handles addition and subtraction.
This pipeline has 5 stages, which means that the process of performing an
addition or subtraction is broken down into 5 sub tasks. Therefore
addition and subtraction is done in 5 clock cycles. The advantage of
having a pipeline is that even though it takes 5 cycles to complete an
addition a new one can be started in every clock cycle. This is because
the first add in a series leaves the first stage of the pipeline in the
second clock cycle and then this stage can accept add number 2. If we have
this series of adds the first one will leave the pipeline in cycle 5,
number 2 will leave in cycle 6 etc. Throughput is one add per cycle.
Parallelism is achieved by having up to 5 additions/subtractions in flight
in the pipeline at the same time. The drawback is that if a second of two
following additions need the output from the first addition it has to wait
for the first one to complete. We say that there is a data dependency
between the two instructions and we see the full latency of the additions
which is 2 times 5 cycles.
Let’s use our function as an example to see how pipelines work.
Result := A*X*X*X + B*X*X + C*X + D;
It is seen that the 4 terms can be evaluated in parallel and then be added
as the final act Of course term number 4 is not "evaluated". A*X is the
first operation ready to be scheduled to run in the F_MUL pipeline. The
latency for FMUL on P4 is 7 cycles and A*X will be ready in cycle 7. FMUL
has a throughput of 2 cycles. From this we see that FMUL is not fully
pipelined. The pipeline will accept a new instruction in cycle 3 and not
in cycle 2. B*X is the second instruction ready to execute and it will
start in cycle 3. In cycle 5 the pipeline is again ready to receive a new
instruction and this is C*X. In cycle 7 A*X is complete and (A*X)*X can
start in cycle 8. In cycle 10 B*X is finished and (B*X)*X will start. In
cycle 12 C*X is ready to go to the F_ADD pipeline to be added to D. In
cycle 15 is (A*X)*X finished and (A*X*X)*X can start. In cycle 17 are
(B*X)*X and (C*X) + D complete and they can start in the F_ADD pipeline.
This addition completes in cycle 21, where (A*X*X)*X is also ready Then
the last addition can start in cycle 22. Now there is only one operation
in flight and we must lean back and wait for the full latency of FADD,
which is 5 cycles. In clock 27 the last addition is finished and the job
is done.
These tables give the details. The left column symbolizes the F_MUL
pipeline with 7 stages and the right column symbolizes the F_ADD pipeline
with 5 stages.
F_MUL F_ADD
A*X
Cycle 1
F_MUL F_ADD
A*X
Cycle 2
F_MUL F_ADD
B*X
A*X
Cycle 3
F_MUL F_ADD
B*X
A*X
Cycle 4
F_MUL F_ADD
C*X
B*X
A*X
Cycle 5
F_MUL F_ADD
C*X
B*X
A*X
Cycle 6
F_MUL F_ADD
C*X
B*X
A*X
Cycle 7
F_MUL F_ADD
(A*X)*X
C*X
B*X
Cycle 8
F_MUL F_ADD
(A*X)*X
C*X
B*X
Cycle 9
F_MUL F_ADD
(B*X)*X
(A*X)*X
C*X
Cycle 10
F_MUL F_ADD
(B*X)*X
(A*X)*X
C*X
Cycle 11
F_MUL F_ADD
(C*X)+D
(B*X)*X
(A*X)*X
Cycle 12
F_MUL F_ADD
(C*X)+D
(B*X)*X
(A*X)*X
Cycle 13
F_MUL F_ADD
(C*X)+D
(B*X)*X
(A*X)*X
Cycle 14
F_MUL F_ADD
(A*X*X)*X
(C*X)+D
(B*X)*X
Cycle 15
F_MUL F_ADD
(A*X*X)*X
(C*X)+D
(B*X)*X
Cycle 16
F_MUL F_ADD
(B*X*X)+(C*X+D)
(A*X*X)*X
Cycle 17
F_MUL F_ADD
(B*X*X)+(C*X+D)
(A*X*X)*X
Cycle 18
F_MUL F_ADD
(B*X*X)+(C*X+D)
(A*X*X)*X
Cycle 19
F_MUL F_ADD
(B*X*X)+(C*X+D)
(A*X*X)*X
Cycle 20
F_MUL F_ADD
(B*X*X)+(C*X+D)
(A*X*X)*X
Cycle 21
F_MUL F_ADD
(A*X*X*X)+ (B*X*X+C*X+D)
Cycle 22
F_MUL F_ADD
(A*X*X*X)+ (B*X*X+C*X+D)
Cycle 23
F_MUL F_ADD
(A*X*X*X)+ (B*X*X+C*X+D)
Cycle 24
F_MUL F_ADD
(A*X*X*X)+ (B*X*X+C*X+D)
Cycle 25
F_MUL F_ADD
(A*X*X*X)+ (B*X*X+C*X+D)
Cycle 26
F_MUL F_ADD
Finished
Cycle 27
An out of order core schedules instructions to run as soon as data and
resources are ready. Resources are registers and execution pipelines. I do
not know for sure, but I think that instructions are scheduled to run in
program order except when an instruction stalls. In this situation the
next ready instruction in program order is scheduled. The stalled
instruction will be started as soon as the stall reason is removed. It can
stall because of missing resources or not ready data.
After having seen how instructions are scheduled to run in the pipelines
of a P4 we establish the benchmark. The benchmark is an optimizer that
search for the best possible fit of our polynomial to ArcSin. It is based
on the most simple of all optimization algorithms, which is the exhaustive
search. We simply try a lot of combinations of parameter values and
remember the set of parameters which give the best fit. A and C are run in
the intervals [AStart; AEnd] and [CStart; CEnd], with the stepsizes
AStepSize and CStepsize. This is done in two nested while loops.
StartA := 0;
StartC := -1;
EndA := 1;
EndC := 1;
AStepSize := 1E-2;
CStepSize := 1E-3;
OptA := 9999;
OptC := 9999;
A := StartA;
while A <= EndA do
begin
C := StartC;
while C <= EndC do
begin
Inc(NoOfIterations);
MaxAbsError := CalculateMaxAbsError(A,C, ArcSinArray);
if MaxAbsError <= MinMaxAbsError then
begin
MinMaxAbsError := MaxAbsError;
OptA := A;
OptC := C;
end;
C := C + CStepSize;
end;
A := A + AStepSize;
end;
The CalculateMaxAbsError function calculates a number of points on the X
interval [-1;1], which is the definition interval of the ArcSin function
TMainForm.CalculateMaxAbsError(A, C : Double; ArcSinArray : TArcSinArray)
: Double;
var
X, Y, D, B, Yref, Error, AbsError, MaxAbsError : Double;
begin
B := 0;
D := 0;
MaxAbsError := 0;
X := -1;
repeat
Yref := ArcSin (X);
Y := ArcSinApproxFunction(X, A, B, C, D);
Error := Yref-Y;
AbsError := Abs(Error);
MaxAbsError := Max(MaxAbsError, AbsError);
X := X + XSTEPSIZE;
until(X > 1);
Result := MaxAbsError;
end;
At each point we calculate the error by subtracting the Y value from our
approximation function from the reference Y value obtained from a call to
the Delphi RTL ArcSin function. The error can be positive or negative, but
we are interested in the absolute value. We remember the biggest absolute
error by taking the maximum of the two values MaxAbsError and AbsError
assigning it to MaxAbsError. MaxAbsError is initialized to zero and in the
first evaluation it will get the value of the first error (if it is bigger
than zero). The MaxAbsError is returned as the result from the function
after a full sweep has been completed. In the optimizer function the two
values of A and C that gave the smallest maximum error are remembered
together with the actual MinMaxAbsError.
All that matters in an optimizer of this kind is to be able to evaluate as
many parameter combinations as possible. For this purpose we must optimize
the optimizer ;-) and the functions we evaluate. In this lesson the
purpose is slightly different because all we want is some valid benchmarks
for the functions we optimize. The means are however the same, the code of
the optimizer must take as few cycles as possible such that the cycles the
functions use is the biggest part of the total number of cycles used. The
first optimizer optimization that is done is to realize that there is no
need to evaluate the reference function over and over again. It returns,
of course, the same values no matter which values A and C have. This is
done by sweeping the reference function once and storing the Yref values
in an array.
The next optimization is to compact the lines that evaluate the
MaxAbsError
Long version
Yref := ArcSinArray[I];
Error := Yref-Y;
AbsError := Abs(Error);
To the short version
AbsError := Abs(ArcSinArray[I]-Y);
This helps because Delphi creates a lot of redundant code, when compiling
FP code.
The long version compiles into this
Yref := ArcSinArray[I];
mov eax,[ebp-$14]
mov edx,[eax+ebx*8]
mov [ebp-$48],edx
mov edx,[eax+ebx*8+$04]
mov [ebp-$44],edx
Error := Yref-Y;
fld qword ptr [ebp-$48]
fsub qword ptr [ebp-$30]
fstp qword ptr [ebp-$50]
wait
AbsError := Abs(Error);
fld qword ptr [ebp-$50]
fabs
fstp qword ptr [ebp-$10]
wait
There are a lot of redundancies in this code and we must conclude that
Delphi is doing a bad job on optimizing floating point code. Let us add
some explanations to the code. The first line of Pascal is an assignment
of one double variable to another. This is done by to pairs of mov, one
for the lower 4 byte of the variables and one for the upper 4 byte. The
first line of asm loads the address of the array into eax, which is used
as base for addressing into the array. Ebx is I and it is multiplied by 8
because each entry in the array is 8 byte. The 4 byte offset in the last
two lines (in the last line it is hidden!) is moving the reference into
the middle of element I.
Yref is located in the stack frame at [ebp-$48] and is loaded by the first
line of FP code. Y is located at [ebp-$30] and is subtracted from Yref by
fsub. The result is Error and this is stored at [ebp-$50].
The last line of Pascal compiles into four lines of asm of which the first
starts loading Error. Saving and then loading Error is redundant and the
optimizer should have removed it. Fabs is the Abs function and is probably
one of the shortest function implementations seen ;-) The Delphi compiler
does not have the inlining optimization, but it applies “compiler magic”
to a few functions, one of which is Abs. The last line stores AbsError on
the stack.
The short version compiles into this
mov eax,[ebp-$14]
fld qword ptr [eax+ebx*8]
fsub qword ptr [ebp-$30]
fabs
fstp qword ptr [ebp-$10]
wait
Here there is no redundancy and the compiler should have emitted this code
as result of the long Pascal version as well. All lines of code are
present in the long version, but all redundant lines have been removed.
The first line loads the base address of the array into eax. The second
line loads element I, I is in ebx, unto the fp stack. The third line
subtracts Y from Yref. The fourth line is the Abs function. The fift line
stores the result in the AbsError variable.
There is a peculiarity with this benchmark that I have no explanation of.
The benchmark values are heavily influenced by the way it is activated. If
the keyboard is used to hit the button we get a different score from the
one we get by clicking the button with the mouse! The one who can explain
this will probably get the Nobel prize in Delphi;-)
Another irritating thing is that Delphi does not align double variables
properly. They shall be 8 byte aligned but Delphi only 4 byte aligns them.
The penalty we can get from this is a level 1 cache line split (and L2
cache line splits as well). Loading a variable that splits over two cache
lines takes the double time of loading one that does not. Because double
variables are 8 byte and the L1 cache line of P4 is 64 byte at most 1 of 8
variables can have a split. On P3 that has 32 byte L1 cache lines it is 1
of 4.
The ideal situation is that variables are aligned at 4 byte if they are 4
byte big, 8 if 8 byte etc. To make things simple let us imagine that the
first line in the level one cache is where our variables are loaded. The
first line starts at address 0, that is - memory from address 0 is loaded
into it. Our first variable is aligned and occupies the first 8 bytes at
line 1. Variable number two occupies byte 9-16 …. Variable number 8
occupies byte 57-64 and does not cross the line boundary. If variables are
only 4 byte aligned the first one could start at byte 4 and number 8 could
start at byte 61. The first 4 byte of it will be in line 1 and the next 4
bytes will be in line 2. The processor loads it by first loading the lower
part and then loading the higher part instead of loading it all in one go.
Because of this misalignment of double variables in Delphi our benchmark
will not be as stable as we could wish. Alignment can change when we
recompile especially if code is changed. I have chosen (a bad choice) to
not include code to align variables in the benchmark, but will give an
example of it in a later lesson.
Let us dive into the first function optimization. We start with the
function that uses the naive formula in textbook format.
function ArcSinApprox1a(X, A, B, C, D : Double) : Double;
begin
Result := A*X*X*X + B*X*X + C*X + D;
end;
This function implementation scores the benchmark 43243 on my P4 1600 MHz
clocked at 1920 MHz
Delphi compiled it like this
function ArcSinApprox1b(X, A, B, C, D : Double) : Double;
begin
{
push ebp
mov ebp,esp
add esp,-$08
}
Result := A*X*X*X + B*X*X + C*X + D;
{
fld qword ptr [ebp+$20]
fmul qword ptr [ebp+$28]
fmul qword ptr [ebp+$28]
fmul qword ptr [ebp+$28]
fld qword ptr [ebp+$18]
fmul qword ptr [ebp+$28]
fmul qword ptr [ebp+$28]
faddp st(1)
fld qword ptr [ebp+$10]
fmul qword ptr [ebp+$28]
faddp st(1)
fadd qword ptr [ebp+$08]
fstp qword ptr [ebp-$08]
wait
fld qword ptr [ebp-$08]
}
{
pop ecx
pop ecx
pop ebp
}
end;
The code from the CPU view will not compile because of the instruction
faddp st(1) and we remove st(1). As default the faddp instruction operates
on st(0), st(1) and there is no need to write it out.
function ArcSinApprox1c(X, A, B, C, D : Double) : Double;
asm
//push ebp //Added by compiler
//mov ebp,esp //Added by compiler
add esp,-$08
//Result := A*X*X*X + B*X*X + C*X + D;
fld qword ptr [ebp+$20]
fmul qword ptr [ebp+$28]
fmul qword ptr [ebp+$28]
fmul qword ptr [ebp+$28]
fld qword ptr [ebp+$18]
fmul qword ptr [ebp+$28]
fmul qword ptr [ebp+$28]
faddp //st(1)
fld qword ptr [ebp+$10]
fmul qword ptr [ebp+$28]
faddp //st(1)
fadd qword ptr [ebp+$08]
fstp qword ptr [ebp-$08]
wait
fld qword ptr [ebp-$08]
pop ecx
pop ecx
//pop ebp //Added by compiler
end;
First we observe that there is no need to set up a stackframe. The stack
is actually used for storing the result temporarily and reloading it again
in the lines
fstp qword ptr [ebp-$08]
wait
fld qword ptr [ebp-$08]
but the basepointer and not the stackpointer is used for this. The lines
that use ebp plus a value are accessing the parameters which are located
above the basepointer, which is in the calling functions stackframe. The
stackpointer is not used at all in the function and changing its value is
meaningless. The mov ebp,esp instruction added by the compiler together
with the line add esp,-$08 creates an 8 byte stackframe. Because these
lines change the ebp register it is necessary to back it up by pushing it
to the stack. Unfortunately we can only remove the add esp, 8 line and the
two pop ecx lines that has the purpose of subtracting 8 bytes from the
stackpointer, esp.
function ArcSinApprox1d(X, A, B, C, D : Double) : Double;
asm
//add esp,-$08
//Result := A*X*X*X + B*X*X + C*X + D;
fld qword ptr [ebp+$20]
fmul qword ptr [ebp+$28]
fmul qword ptr [ebp+$28]
fmul qword ptr [ebp+$28]
fld qword ptr [ebp+$18]
fmul qword ptr [ebp+$28]
fmul qword ptr [ebp+$28]
faddp
fld qword ptr [ebp+$10]
fmul qword ptr [ebp+$28]
faddp
fadd qword ptr [ebp+$08]
fstp qword ptr [ebp-$08]
wait
fld qword ptr [ebp-$08]
//pop ecx
//pop ecx
end;
This function implementation scores the benchmark 42391 and performance
actually dipped a little.
The compiler inserts the line mov ebp,esp and we can make it redundant by
using esp instead of ebp.
function ArcSinApprox1e(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
//fld qword ptr [ebp+$20]
fld qword ptr [esp+$20]
//fmul qword ptr [ebp+$28]
fmul qword ptr [esp+$28]
//fmul qword ptr [ebp+$28]
fmul qword ptr [esp+$28]
//fmul qword ptr [ebp+$28]
fmul qword ptr [esp+$28]
//fld qword ptr [ebp+$18]
fld qword ptr [esp+$18]
//fmul qword ptr [ebp+$28]
fmul qword ptr [esp+$28]
//fmul qword ptr [ebp+$28]
fmul qword ptr [esp+$28]
faddp
//fld qword ptr [ebp+$10]
fld qword ptr [esp+$10]
//fmul qword ptr [ebp+$28]
fmul qword ptr [esp+$28]
faddp
//fadd qword ptr [ebp+$08]
fadd qword ptr [esp+$08]
//fstp qword ptr [ebp-$08]
fstp qword ptr [esp-$08]
wait
//fld qword ptr [ebp-$08]
fld qword ptr [esp-$08]
end;
Unfortunately the compiler still inserts the mov instruction and we
performed a copy propagation that gave no optimization because it is not
followed by a dead code removal. Therefore performance is almost the same
– 43094.
Without investigating whether the result stored on the stack is used we
can optimize the lines coping it there and reloading it. The result of
them is that there is a copy of Result left on the stack. They redundantly
pop the results of the FP stack and reload Result from the stack. This
single line has the same effect, but redundancy is removed.
fst qword ptr [ebp-$08]
This optimization is very often possible on Delphi generated code and is
important to remember.
function ArcSinApprox1f(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
fld qword ptr [esp+$20]
fmul qword ptr [esp+$28]
fmul qword ptr [esp+$28]
fmul qword ptr [esp+$28]
fld qword ptr [esp+$18]
fmul qword ptr [esp+$28]
fmul qword ptr [esp+$28]
faddp
fld qword ptr [esp+$10]
fmul qword ptr [esp+$28]
faddp
fadd qword ptr [esp+$08]
//fstp qword ptr [esp-$08]
fst qword ptr [esp-$08]
wait
//fld qword ptr [esp-$08]
end;
This function implementation scores the benchmark 47939 and the
improvement is 11 %
The next question to ask is: Is the copy of the Result on the stack ever
used? To answer it we must inspect the code at the location of the call to
the function.
Y := ArcSinApproxFunction(X, A, B, C, D);
call dword ptr [ArcSinApproxFunction]
fstp qword ptr [ebp-$30]
wait
The first line after the call stores the result in Y and pops the stack.
Seeing this we assume that the result on the stack is not used, but to be
sure we must scan through the rest of the code too. If the rule for the
Register calling convention is that FP results are transferred on the FP
stack it is weird that a copy is also placed on the stack. We conclude
that it is redundant to copy the Result to the stack and remove the line
doing it.
function ArcSinApprox1g(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
fld qword ptr [esp+$20]
fmul qword ptr [esp+$28]
fmul qword ptr [esp+$28]
fmul qword ptr [esp+$28]
fld qword ptr [esp+$18]
fmul qword ptr [esp+$28]
fmul qword ptr [esp+$28]
faddp
fld qword ptr [esp+$10]
fmul qword ptr [esp+$28]
faddp
fadd qword ptr [esp+$08]
//fst qword ptr [esp-$08]
wait
end;
This function implementation scores the benchmark 47405
Instead of writing all the qword ptr [esp+$xx] lines we can write the
names of the variables and let the compiler translate them into addresses.
This actually makes the code more robust. If the location of the variables
should change then the code breaks if we use hard coded addressing. This
will however only happen if the calling convention is changed and this is
not likely to happen very often.
function ArcSinApprox1g_2(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
//fld qword ptr [esp+$20]
fld A
//fmul qword ptr [esp+$28]
fmul X
//fmul qword ptr [esp+$28]
fmul X
//fmul qword ptr [esp+$28]
fmul X
//fld qword ptr [esp+$18]
fld B
//fmul qword ptr [esp+$28]
fmul X
//fmul qword ptr [esp+$28]
fmul X
faddp
//fld qword ptr [esp+$10]
fld C
//fmul qword ptr [esp+$28]
fmul X
faddp
//fadd qword ptr [esp+$08]
fadd D
wait
end;
Try having both types of lines enabled
fld qword ptr [esp+$20]
fld A
and see in the CPU view how the compiler generated exactly the same code
for both versions.
X is used in a lot of lines and it is referenced on the stack. Therefore
it is loaded from the stack into the internal FP registers every time. It
should be faster to load it once into the FP stack and let all uses
reference the FP stack.
function ArcSinApprox1h(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
fld qword ptr [esp+$20]
fld qword ptr [esp+$28] //New
fxch
//fmul qword ptr [esp+$28]
fmul st(0),st(1)
//fmul qword ptr [esp+$28]
fmul st(0),st(1)
//fmul qword ptr [esp+$28]
fmul st(0),st(1)
fld qword ptr [esp+$18]
//fmul qword ptr [esp+$28]
fmul st(0),st(2)
//fmul qword ptr [esp+$28]
fmul st(0),st(2)
faddp
fld qword ptr [esp+$10]
//fmul qword ptr [esp+$28]
fmul st(0),st(2)
ffree st(2)
faddp
fadd qword ptr [esp+$08]
fst qword ptr [esp-$08]
wait
end;
The second line is one we added and it loads X once and for all. Because
it places X on the top of the stack in st(0) and this position is needed
as temp for further operations we exchange st(0) with st(1) with the fxch
instruction. We could of course have changed the position of line 1 and 2
and obtained the same effect. All the lines multiplying st(0) with X
fmul qword ptr [esp+$28]
are changed to
fmul st(0),st(1)
after the FP copy of X has been used for the last time we remove it with
the instruction ffree.
This function implementation scores the benchmark 46882 and performance is
decreased by 1 %. This was a surprise. Fxch is claimed by Intel to be
nearly free, because it works by renaming the internal registers. Let us
check that by removing it
function ArcSinApprox1i(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
fld qword ptr [esp+$28]
fld qword ptr [esp+$20]
//fld qword ptr [esp+$28]
//fxch
fmul st(0),st(1)
fmul st(0),st(1)
fmul st(0),st(1)
fld qword ptr [esp+$18]
fmul st(0),st(2)
fmul st(0),st(2)
faddp
fld qword ptr [esp+$10]
fmul st(0),st(2)
ffree st(2)
faddp
fadd qword ptr [esp+$08]
wait
end;
This function implementation scores the benchmark 45393 and performance is
decreased by 3 %. Fxch is surely not to blame because performance once
more went down. What is going on?
The wait instruction was discussed in an earlier lesson and this time we
just remove it.
function ArcSinApprox1j(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
fld qword ptr [esp+$28]
fld qword ptr [esp+$20]
fmul st(0),st(1)
fmul st(0),st(1)
fmul st(0),st(1)
fld qword ptr [esp+$18]
fmul st(0),st(2)
fmul st(0),st(2)
faddp
fld qword ptr [esp+$10]
fmul st(0),st(2)
ffree st(2)
faddp
fadd qword ptr [esp+$08]
//wait
end;
Performance went down to 44140.
Let us cross check these surprising results by running the functions on a
P3.
ArcSinApprox1a 63613
ArcSinApprox1b 64412
ArcSinApprox1c 64433
ArcSinApprox1d 65062
ArcSinApprox1e 64830
ArcSinApprox1f 62598
ArcSinApprox1g 79586
ArcSinApprox1h 85361
ArcSinApprox1i 80515
ArcSinApprox1j 80192
First of all we see that ArcSinApprox1h is the fastest function on P3.
Thereby it is seen that loading data from the level 1 data cache is more
expensive on P3 than on P4, because changing the code such that X is
loaded only once improved performance on P3 and not on P4. On the other
hand we could also say that it must always be slower to get data from the
cache than from internal registers if the architecture is sound and this
is only true for the P6 architecture here. P4 has a fast L1 data cache
which can be read in only 2 clock cycles but an internal register read
should have a latency of only one cycle. It however looks like reads from
registers are 2 clock cycles.
Then we see that a P3 at 1400 nearly 80 % faster than a P4 at 1920 on this
code. We know that latencies on P3 are shorter, but this is not enough to
explain the huge difference.
The latencies and throughput of the used instructions are on P3
Fadd latency is 3 clock cycles and throughput is 1
Fmul latency is 5 clock cycles and throughput is 1
On P4
Fadd latency is 5 clock cycles and throughput is 1
Fmul latency is 7 clock cycles and throughput is 2
I could not find numbers for fld
The explanation for the very bad performance of P4 on this code must be
the 2 cycle throughput on fmul together with the slow FP register access.
The fmul pipeline only accepts a new instruction on every second cycle
where the P3 pipeline accepts one every cycle.
Scaling the results by clock frequency
47939 / 1920 = 25
85361 / 1400 = 61
reveals that clock by clock on the fastest function version for each
processor P3 is nearly 2.5 times faster than P4. This is truly
astonishing. If P4 should have a slight chance against a P3 we must remove
some of those multiplications. This leads us to the Horner version of our
function.
function ArcSinApprox3a(X, A, B, C, D : Double) : Double;
begin
Result := ((A*X + B)*X + C)*X + D;
end;
Which is compiled into
function ArcSinApprox3b(X, A, B, C, D : Double) : Double;
begin
{
push ebp
mov ebp,esp
add esp,-$08
}
Result := ((A*X + B)*X + C)*X + D;
{
fld qword ptr [ebp+$20]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$18]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$10]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$08]
fstp qword ptr [ebp-$08]
wait
fld qword ptr [ebp-$08]
}
{
pop ecx
pop ecx
pop ebp
}
end;
The first three versions of this function are identical and they
surprisingly score the same benchmark. Our benchmark is not perfect but it
was precise this time ;-)
ArcSinApprox3a 45076
ArcSinApprox3b 45076
ArcSinApprox3c 45076
Optimization follows the same pattern as on the first function. Here is
the first BASM version with no optimizations. The out commented code is
supplied by the compiler.
function ArcSinApprox3c(X, A, B, C, D : Double) : Double;
asm
//push ebp
//mov ebp,esp
add esp,-$08
//Result := ((A*X + B)*X + C)*X + D;
fld qword ptr [ebp+$20]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$18]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$10]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$08]
fstp qword ptr [ebp-$08]
wait
fld qword ptr [ebp-$08]
pop ecx
pop ecx
//pop ebp
end;
First thing is to remove the add esp,-$08 line and the two pop ecx. They
are setting up a stackframe and does nothing but manipulate the
stackpointer, which is not used at all.
function ArcSinApprox3d(X, A, B, C, D : Double) : Double;
asm
//add esp,-$08
//Result := ((A*X + B)*X + C)*X + D;
fld qword ptr [ebp+$20]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$18]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$10]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$08]
fstp qword ptr [ebp-$08]
wait
fld qword ptr [ebp-$08]
//pop ecx
//pop ecx
end;
This function implementation scores the benchmark 43535.
Both of the redundant lines copying the result to stack and back are
removed at the same time.
function ArcSinApprox3e(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
fld qword ptr [ebp+$20]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$18]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$10]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$08]
//fstp qword ptr [ebp-$08]
wait
//fld qword ptr [ebp-$08]
end;
This function implementation scores the benchmark 47237 and the
improvement is 8.5 %
Then we change the code such that X is loaded only once.
function ArcSinApprox3f(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
fld qword ptr [ebp+$20]
fld qword ptr [ebp+$28]
fxch
//fmul qword ptr [ebp+$28]
fmul st(0),st(1)
fadd qword ptr [ebp+$18]
//fmul qword ptr [ebp+$28]
fmul st(0),st(1)
fadd qword ptr [ebp+$10]
//fmul qword ptr [ebp+$28]
fmul st(0),st(1)
ffree st(1)
fadd qword ptr [ebp+$08]
wait
end;
This function implementation scores the benchmark 47226 and performance is
unchanged.
The ffree instruction can be removed by using fmulp instead of fmul, but
to do this we must interchange the two registers used. Only these two
registers are in use and A*B = B*A so there is no problem doing that. We
are not removing any redundancy by this and the two ways of coding the
same thing should perform identically.
function ArcSinApprox3g(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
fld qword ptr [ebp+$20]
fld qword ptr [ebp+$28]
fxch st(1)
fmul st(0),st(1)
fadd qword ptr [ebp+$18]
fmul st(0),st(1)
fadd qword ptr [ebp+$10]
//fmul st(0),st(1)
fmulp st(1),st(0)
//ffree st(1)
fadd qword ptr [ebp+$08]
wait
end;
This function implementation scores the benchmark 47416.
Then we remove the wait instruction.
function ArcSinApprox3h(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
fld qword ptr [ebp+$20]
fld qword ptr [ebp+$28]
fxch st(1)
fmul st(0),st(1)
fadd qword ptr [ebp+$18]
fmul st(0),st(1)
fadd qword ptr [ebp+$10]
fmulp st(1),st(0)
fadd qword ptr [ebp+$08]
//wait
end;
This function implementation scores the benchmark 47059.
The last thing to do is interchanging the lines that load X and A, and
remove the fxch instruction.
function ArcSinApprox3i(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
fld qword ptr [ebp+$28]
fld qword ptr [ebp+$20]
//fld qword ptr [ebp+$28]
//fxch st(1)
fmul st(0),st(1)
fadd qword ptr [ebp+$18]
fmul st(0),st(1)
fadd qword ptr [ebp+$10]
fmulp st(1),st(0)
fadd qword ptr [ebp+$08]
end;
This function implementation scores the benchmark 46544 and performance
went down!
Let us compare the performance of the Horner style function with the naive
one by picking the fastest implementations of both on P4.
ArcSinApprox1g 47939
ArcSinApprox3g 47416
On P3
ArcSinApprox1h 85361
ArcSinApprox3h 87604
There difference is not big, but the naive approach is a little faster on
P4 and slower on P3. The naive approach has more calculations, but
parallelism makes up for it. The Horner way has very little parallelism
and latencies are fully exposed. This is especially bad on P4.
Having this in mind we continue to the third possible approach, which
looks like this.
function ArcSinApprox4b(X, A, B, C, D : Double) : Double;
begin
{
push ebp
mov ebp,esp
add esp,-$08
}
Result := (A*X + B)*(X*X)+(C*X + D);
{
fld qword ptr [ebp+$20]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$18]
fld qword ptr [ebp+$28]
fmul qword ptr [ebp+$28]
fmulp st(1)
fld qword ptr [ebp+$10]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$08]
faddp st(1)
fstp qword ptr [ebp-$08]
wait
fld qword ptr [ebp-$08]
}
{
pop ecx
pop ecx
pop ebp
}
end;
Experienced as we are now optimizing this function is going to be easy and
fast ;-)
This version is as Delphi made it
function ArcSinApprox4c(X, A, B, C, D : Double) : Double;
asm
//push ebp
//mov ebp,esp
add esp,-$08
//Result := (A*X + B)*(X*X)+(C*X + D);
fld qword ptr [ebp+$20]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$18]
fld qword ptr [ebp+$28]
fmul qword ptr [ebp+$28]
fmulp //st(1)
fld qword ptr [ebp+$10]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$08]
faddp //st(1)
fstp qword ptr [ebp-$08]
wait
fld qword ptr [ebp-$08]
pop ecx
pop ecx
//pop ebp
end;
Removing the stack frame and the two lines that store the result in the
stack frame
function ArcSinApprox4d(X, A, B, C, D : Double) : Double;
asm
//add esp,-$08
//Result := (A*X + B)*(X*X)+(C*X + D);
fld qword ptr [ebp+$20]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$18]
fld qword ptr [ebp+$28]
fmul qword ptr [ebp+$28]
fmulp //st(1)
fld qword ptr [ebp+$10]
fmul qword ptr [ebp+$28]
fadd qword ptr [ebp+$08]
faddp //st(1)
//fstp qword ptr [ebp-$08]
wait
//fld qword ptr [ebp-$08]
//pop ecx
//pop ecx
end;
Load X once
function ArcSinApprox4e(X, A, B, C, D : Double) : Double;
asm
//Result := (A*X + B)*(X*X)+(C*X + D);
fld qword ptr [ebp+$20]
fld qword ptr [ebp+$28]
//fmul qword ptr [ebp+$28]
fxch
fmul st(0),st(1)
fadd qword ptr [ebp+$18]
//fld qword ptr [ebp+$28]
fld st(1)
//fmul qword ptr [ebp+$28]
fmul st(0),st(2)
fmulp
fld qword ptr [ebp+$10]
//fmul qword ptr [ebp+$28]
fmul st(0),st(2)
fadd qword ptr [ebp+$08]
faddp
ffree st(1)
wait
end;
Remove fxch and wait.
function ArcSinApprox4f(X, A, B, C, D : Double) : Double;
asm
//Result := (A*X + B)*(X*X)+(C*X + D);
fld qword ptr [ebp+$28]
fld qword ptr [ebp+$20]
//fxch
fmul st(0),st(1)
fadd qword ptr [ebp+$18]
fld st(1)
fmul st(0),st(2)
fmulp
fld qword ptr [ebp+$10]
fmul st(0),st(2)
fadd qword ptr [ebp+$08]
faddp
ffree st(1)
//wait
end;
Reschedule ffree st(1)
function ArcSinApprox4g(X, A, B, C, D : Double) : Double;
asm
//Result := (A*X + B)*(X*X)+(C*X + D);
fld qword ptr [ebp+$28]
fld qword ptr [ebp+$20]
fmul st(0),st(1)
fadd qword ptr [ebp+$18]
fld st(1)
fmul st(0),st(2)
fmulp
fld qword ptr [ebp+$10]
fmul st(0),st(2)
ffree st(2)
fadd qword ptr [ebp+$08]
faddp
//ffree st(1)
end;
Replace fmul/ffree by fmulp
function ArcSinApprox4h(X, A, B, C, D : Double) : Double;
asm
//Result := (A*X + B)*(X*X)+(C*X + D);
fld qword ptr [ebp+$28]
fld qword ptr [ebp+$20]
fmul st(0),st(1)
fadd qword ptr [ebp+$18]
fld st(1)
fmul st(0),st(2)
fmulp
fld qword ptr [ebp+$10]
//fmul st(0),st(2)
fmulp st(2),st(0)
//ffree st(2)
fadd qword ptr [ebp+$08]
faddp
end;
Cleaning up and observing that the compiler still backs up ebp and
modifies esp redundantly.
function ArcSinApprox4i(X, A, B, C, D : Double) : Double;
asm
//Result := (A*X + B)*(X*X)+(C*X + D);
fld qword ptr [ebp+$28]
fld qword ptr [ebp+$20]
fmul st(0),st(1)
fadd qword ptr [ebp+$18]
fld st(1)
fmul st(0),st(2)
fmulp
fld qword ptr [ebp+$10]
fmulp st(2),st(0)
fadd qword ptr [ebp+$08]
faddp
end;
The big question is now how well this function version performs.
ArcSinApprox4a 45228
ArcSinApprox4b 45239
ArcSinApprox4c 45228
ArcSinApprox4d 51813
ArcSinApprox4e 49044
ArcSinApprox4f 48674
ArcSinApprox4g 48852
ArcSinApprox4h 44914
ArcSinApprox4i 44914
We see that “optimizations” from function d to i are “deoptimizations” on
P4 except for g.
On P3
ArcSinApprox4a 68871
ArcSinApprox4b 68871
ArcSinApprox4c 68634
ArcSinApprox4d 86806
ArcSinApprox4e 85727
ArcSinApprox4f 83542
ArcSinApprox4g 80548
ArcSinApprox4h 88378
ArcSinApprox4i 85324
We see that optimizations d and h are very good and optimizations e, f g
and I are bad. It is quite possible that the optimal function
implementation is none of the ones we have made. We could pick version h
and remove the bad optimizations or simply make some more variants and
this way get a faster implementation.
Which function approach is the winner? To find out we pick the fastest
implementation of each approach
On P4
ArcSinApprox1f 47939
ArcSinApprox3g 47416
ArcSinApprox4d 51813
The last version is the fastest. Parallelization is very important on a
modern processor and version 4 beats the others by 9 %.
On P3
ArcSinApprox1h 85361
ArcSinApprox3h 87604
ArcSinApprox4h 88378
The function version 4 is a winner on P3 as well, but with a less margin.
The P4 has an SSE2 instruction set, which contains instructions for double
precision floating point calculations. The main idea with this instruction
set is SIMD calculations. SIMD is an acronym for Single Instruction
Multiple Data. Multiple data is here two FP double precision variables (64
bit) and two sets of these data can be added, subtracted, multiplied or
divided with one instruction. SSE2 also have some instructions for scalar
calculations, which are calculations on one pair of data just like
ordinary FP math in the FPU. The biggest difference between ordinary FP
math and SSE2 scalar math is that FP math is performed in extended
precision and results are rounded to double precision when copied to
double precision variables in RAM/cache. SSE2 math is double precision
calculation and double precision registers. The code examples of this
lesson have relatively few calculations and precision on the FPU will be
double. If we load data, perform all calculations and store the result,
the result will only bee a little less than extended precision when still
on the FPU stack, and will be rounded to exact double precision when
copied to a memory location. SSE2 calculations on the other hand are a
little less than double precision and the result in a register is a little
less than double precision too. If there is only one calculation the
result will be in double precision, but when performing additional
calculations the error from each will sum up. Because the FPU does all
calculations in extended precision and can hold temporary results in
registers, there can be done many calculations before precision declines
below double.
We have seen that the drawback of SSE2 is that precision is double or less
versus the double precision of IA32 FP. What is the advantage? There are
two advantages. Registers are not ordered as a stack, which makes it
easier to arrange code and secondly calculations in double precision are
faster than calculations in extended precision. We must expect scalar SSE2
instructions to have shorter latencies than their IA32 FP counterparts.
Fadd latency is 5
Fsub latency is 5
Fmul latency is 7
Fdiv latency is 38
Addsd latency is 4
Subsd latency is 4
Mulsd
Divsd latency is 35
The P4 optimizations reference manual has no latency and throughput
information for the Mulsd instruction!
We see that latencies are one cycle less for scalar SSE2 in general, but 3
cycles less for division.
Throughput is
Fadd throughput is 1
Fsub throughput is 1
Fmul throughput is 2
Fdiv throughput is 38
Addsd throughput is 2
Subsd throughput is 2
Mulsd
Divsd latency is 35
We see that throughput for addsd and subsd surprisingly are the double of
fadd and fsub.
All that think SSE2 has dedicated hardware and that SIMD is calculation on
two sets of data in parallel raise your hands!
From the manual “Optimizations for Intel P4 and Intel Xeon” latency and
throughput tables at page C-1 it is seen that all SSE2 FP instructions are
executed in the same pipelines as old time FP instructions. This means
that an SIMD addition as example is generating two microinstructions that
execute in the F_ADD pipeline. At clock cycle one the first one enters the
pipeline at the second cycle number 2 enters the pipeline. Because latency
is 4 cycles the first one leaves the pipeline at clock cycle 3 and the
second one leaves at cycle four. This leads us to expect that a scalar
SSE2 add should generate one microinstruction of the same type and have a
latency of 3 cycles and a throughput of 1. From the tables it is seen that
the SIMD version of add, addpd, has the same latency and throughput as the
scalar version, addsd. Either there is an error in the tables or the
scalar instruction also generates two microinstructions of which one is
“blind”, that is have no effect. Come on Intel!
To verify the numbers from the table we create some dedicated code and
time the instructions.
procedure TMainForm.BenchmarkADDSDLatency;
var
RunNo, ClockFrequency : Cardinal;
StartTime, EndTime, RunTime : TDateTime;
NoOfClocksPerRun, RunTimeSec : Double;
const
ONE : Double = 1;
NOOFINSTRUCTIONS : Cardinal = 895;
begin
ADDSDThroughputEdit.Text := 'Running';
ADDSDThroughputEdit.Color := clBlue;
Update;
StartTime := Time;
for RunNo := 1 to MAXNOOFRUNS do
begin
asm
movsd xmm0, ONE
movsd xmm1, xmm0
movsd xmm2, xmm0
movsd xmm3, xmm0
movsd xmm4, xmm0
movsd xmm5, xmm0
movsd xmm6, xmm0
movsd xmm7, xmm0
addsd xmm0, xmm1
addsd xmm0, xmm1
addsd xmm0, xmm1
addsd xmm0, xmm1
addsd xmm0, xmm1
addsd xmm0, xmm1
addsd xmm0, xmm1
//Repeat the addsd block of code such that there are 128 blocks
end;
end;
EndTime := Time;
RunTime := EndTime - StartTime;
RunTimeSec := (24 * 60 *60 * RunTime);
ClockFrequency := StrToInt(ClockFrequencyEdit.Text);
NoOfClocksPerRun := (RunTimeSec / MaxNoOfRuns) * ClockFrequency * 1000000
/ NOOFINSTRUCTIONS;
ADDSDThroughputEdit.Text := FloatToStrF(NoOfClocksPerRun, ffFixed, 9, 1);
ADDSDThroughputEdit.Color := clLime;
Update;
end;
The addsd instructions all operate on the same two registers and therefore
they cannot execute in parallel. The second instruction has to wait for
the first to finish and the full latency of the instruction is exposed.
For measuring throughput insert this block 128 times
addsd xmm1, xmm0
addsd xmm2, xmm0
addsd xmm3, xmm0
addsd xmm4, xmm0
addsd xmm5, xmm0
addsd xmm6, xmm0
addsd xmm7, xmm0
Here there are no data dependencies between instructions and they can
execute in parallel. Xmm0 is used as source in every line but this does
not create a data dependency.
Results from run of the code show us that latency is 4 cycles and
throughput is 2 cycles. This is in consistency with the table numbers.
Let us code the three functions in scalar SSE2 and perform some
benchmarks. The 8 SSE2 registers are called xmm0-xmm7 and Delphi has no
register view for them. So we must create our own, by creating a global
(or local) variable for each register, put a watch on them and add code in
the function to copy register contents to variables. It is somewhat
cumbersome to do all this and I am looking forward to Borland creating a
xmm register view. This code shows how I do it.
var
XMM0reg, XMM1reg, XMM2reg, XMM3reg, XMM4reg : Double;
function ArcSinApprox3i(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
fld qword ptr [ebp+$20]
movsd xmm0,qword ptr [ebp+$20]
movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3
fld qword ptr [ebp+$28]
movsd xmm1,qword ptr [ebp+$28]
movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3
fxch st(1)
fmul st(0),st(1)
mulsd xmm0,xmm1
movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3
fadd qword ptr [ebp+$18]
addsd xmm0,qword ptr [ebp+$18]
movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3
fmul st(0),st(1)
mulsd xmm0,xmm1
movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3
fadd qword ptr [ebp+$10]
addsd xmm0,qword ptr [ebp+$10]
movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3
fmulp st(1),st(0)
mulsd xmm0,xmm1
movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3
fadd qword ptr [ebp+$08]
addsd xmm0,qword ptr [ebp+$08]
movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3
movsd [esp-8],xmm0
fld qword ptr [esp-8]
movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3
wait
end;
The code is not using xmm4-xmm7 and there was no need to create a register
view for them. There is added xmm view code after each line of SSE2 code.
All lines but the last two are the FP code with the SSE2 code added such
that every operation is done in FP as well as in SSE2. This way it is
possible to trace through the code and control that the SSE2 version is
doing the same as the classic version. Open the FPU view and see how the
FP stack is updated and control that xmm registers are updated in the same
way. I developed the SSE2 code simply by adding an SSE2 instruction after
each line of FP code.
fld qword ptr [ebp+$20]
movsd xmm0,qword ptr [ebp+$20]
movsd copy one double precision variable from the memory location at
[ebp+$20] into a xmm register. “qword ptr” is not needed but I kept it to
emphasize the pattern between SSE2 code and FP code.
A big difference between FP code and scalar SSE2 code is that the FP
registers are organized as a stack and SSE2 registers are not. At first
while coding the SSE2 code I just ignored this and then after having made
all the SSE2 lines I went back and traced through the lines one by one and
corrected them to work on the correct variable/register. Activate the
function with some variable values that are easy to follow in the two
views (eg. X=2, A=3, B=4, C=5, D=6), and see that first “2” is loaded,
then “3”, then 2 is multiplied by “3” and “2” is overwritten by “6” etc.
The scalar SSE2 counterpart for fmul is mulsd. The sd prefix means Scalar
– Double.
fxch st(1)
fmul st(0),st(1)
mulsd xmm0,xmm1
The SSE2 counterpart for fadd is addsd.
fadd qword ptr [ebp+$18]
addsd xmm0,qword ptr [ebp+$18]
Continue this way line by line.
The FP code leaves the result in st(0), but the SSE2 code leaves the
result in an xmm register. Then the result has to be copied from xmm to
st(0) via a memory location on the stack.
movsd [esp-8],xmm0
fld qword ptr [esp-8]
These two lines do this. At esp-8, 8 bytes above the top of the stack,
there is some place we can use as the temporary location for the result.
The first line copies xmm0 to temp and then the last line loads temp on
the FP stack. These two lines are overhead that will make small SSE2
functions less effective than their FP cousins.
After having double checked the SSE2 code we can remove the
instrumentation code as well as the old FP code, leaving a nice scalar
SSE2 function with only a little extra overhead.
function ArcSinApprox3j(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
movsd xmm0,qword ptr [ebp+$20]
movsd xmm1,qword ptr [ebp+$28]
mulsd xmm0,xmm1
addsd xmm0,qword ptr [ebp+$18]
mulsd xmm0,xmm1
addsd xmm0,qword ptr [ebp+$10]
mulsd xmm0,xmm1
addsd xmm0,qword ptr [ebp+$08]
movsd [esp-8],xmm0
fld qword ptr [esp-8]
end;
It can be even nicer if we remove the not needed “qword ptr” text.
function ArcSinApprox3j(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
movsd xmm0, [ebp+$20]
movsd xmm1, [ebp+$28]
mulsd xmm0,xmm1
addsd xmm0, [ebp+$18]
mulsd xmm0,xmm1
addsd xmm0, [ebp+$10]
mulsd xmm0,xmm1
addsd xmm0, [ebp+$08]
movsd [esp-8],xmm0
fld qword ptr [esp-8]
end;
Change the pointers with the parameter names
function ArcSinApprox3j(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
movsd xmm0, A
movsd xmm1, X
mulsd xmm0,xmm1
addsd xmm0, B
mulsd xmm0,xmm1
addsd xmm0, C
mulsd xmm0,xmm1
addsd xmm0, D
movsd [esp-8],xmm0
fld qword ptr [esp-8]
end;
Well how does this version perform?
The benchmark is 45882
This version is somewhat slower than the FP version which scored 48292. We
need to investigate what is the reason for this. Is it the overhead of the
last two lines or is it due to the 2 cycle throughput of addsd and mulsd?
The overhead can be removed by transferring the result as an out parameter
or we can inline the function. It would be interesting for us to see how
big an advantage it is to inline this relatively small function. After all
there is a good deal of overhead of copying 5 double precision parameters
each 8 byte big. Let us see how much code is actually needed for this.
push dword ptr [ebp+$14]
push dword ptr [ebp+$10]
push dword ptr [ebp+$34]
push dword ptr [ebp+$30]
push dword ptr [ebp+$2c]
push dword ptr [ebp+$28]
push dword ptr [ebp+$24]
push dword ptr [ebp+$20]
push dword ptr [ebp+$1c]
push dword ptr [ebp+$18]
call dword ptr [ArcSinApproxFunction]
fstp qword ptr [ebp+$08]
No less than 10 push instructions each pushing a 4 byte half of each
parameter onto the stack. Imagine that the register calling convention
took its name seriously and transferred the parameters on the FP stack
instead. Then we would have 5 fld, which would also remove the need to
load parameters from the stack in the function. That is – 5 fld in the
function would be replaced by 5 fld at the call place of the function and
10 push instructions would go away. This would lead to a dramatic increase
in performance. Inlining the function would also remove the overhead of
the call/ret pair which is a lot less than the overhead of the mega push,
and this would give as a clue about the performance of the updated
register2 calling convention ;-).
Inlined ArcSinApprox3i 156006
Inlined ArcSinApprox3j 160000
The improvement is an astonishing 400 %.
I truly wish Borland introduces a true register calling convention for
floating point parameters in the near future.
The SSE2 version is only 3 % faster than the IA32 version. This could be
more on a decent SSE2 processor implementation.
Lesson 7 has hereby come to an end. You now know nearly all about floating
point programming ;-)
Regards
Dennis