BASM for beginners, lesson 3
In this third lesson topics such as MMX and SSE2
will be introduced together with Int64 arithmetic. This is the first time we
will see processor dependent optimizations.
The example looks like this
function AddInt64_1(A, B : Int64) : Int64;
begin
Result := A + B;
end;
Let us jump straight into the asm code.
function AddInt64_2(A, B : Int64) : Int64;
begin
{
push ebp
mov ebp,esp
add esp,-$08
}
Result := A + B;
{
mov eax,[ebp+$10]
mov edx,[ebp+$14]
add eax,[ebp+$08]
adc edx,[ebp+$0c]
mov [ebp-$08],eax
mov [ebp-$04],edx
mov eax,[ebp-$08]
mov edx,[ebp-$04]
}
{
pop ecx
pop ecx
pop ebp
//ret
}
end;
The first three lines of code is recognized as
setting up a stackframe like in the previous lessons. This time we know that
the compiler might add the first two for us. The last three lines are also a
well known pattern. Again the compiler might add pop ebp for us. This brings
us into the meat which is these 8 lines
Result := A + B;
{
mov eax,[ebp+$10]
mov edx,[ebp+$14]
add eax,[ebp+$08]
adc edx,[ebp+$0c]
mov [ebp-$08],eax
mov [ebp-$04],edx
mov eax,[ebp-$08]
mov edx,[ebp-$04]
They can be analyzed in pairs because they work
together in tandem doing 64 bit math by splitting the problem up into 32 bit
pieces. The first two lines load A into the register pair eax:edx. They are
loading a contiguous 64 bit block of data from the previous stackframe,
showing us that A was transferred on the stack. The two load pointers are
separated by 4 byte. One of them is pointing to the beginning of A and the
other one is pointing into the middle of A. Then comes two add instructions.
The first is a normal add and the second one is add with carry. The pointers
in these two lines are pointing to B in the same fashion as the two previous
were pointing at A. The first add adds the lower 32 bits of B to the lower
32 bits of A. This might lead to a carry if the sum is to big to fit into 32
bits. This carry is included in the addition of the higher 32 bits. To make
things totally clear lets do a simple example on decimal numbers. We have
the addition 1+2 = 3. Our imaginary datatype for this “in our brain CPU” is
two digits wide. This means that the addition is actually looking like this
01+02=03. There is no carry from the addition of the lower digits into the
higher ones which are zero. Let us take decimal example number two. 13+38=?.
First we add 3+8=11. This results in a carry and a 1 in the lower half of
the result. Then we add Carry+1+3=1+1+3=5. The result is 51. In the third
example we provoke an overflow. 50+51=101. 101 is too big to fit in two
digits and our brain CPU can not perform the calculation. There was a carry
on the addition of the two higher digits. We go back to the code. Two things
can happen now. If we have compiled without range check the result wraps
around. With range check an exception will be thrown. We see that there is
now range check code in our listing and wraparound will occur.
The next two lines save the result into the current
stackframe. The last two lines load the result from the stackframe into eax
and edx where it already was. These 4 lines are redundant. They can be
removed and this also removes the need for a stackframe. It so easy to be an
optimizer ;-)
function AddInt64_6(A, B : Int64) : Int64;
asm
mov eax,[ebp+$10]
mov edx,[ebp+$14]
add eax,[ebp+$08]
adc edx,[ebp+$0c]
end;
This is a
nice small function. The compiler generated code consisted of 16 lines and
we came down to 4 with only little effort. Today Delphi was really sleepy.
Now we think
like this: If we had 64 bit registers the addition could be done with two
lines of code. But the MMX registers are 64 bit wide and this might be worth
taking advantage of. In the Intel SW Developers Manual instructions are not
marked as belonging to IA32, MMX, SSE or SSE2. This information would be
nice to have, but we have to look elsewhere for it. I normally use three
small programs from Intel. The so called computer based tutorials on MMX,
SSE & SSE2. I do not know where to find them on the Intel webside now, but
mail me if you want them. They are simple and nice - very illustrative. In
these I find that a mov for 64 bit from memory into an MMX register is movq.
Q stands for quadword. The mmx registers are named mm0, mm1....mm7. They are
not arranged as a stack, as the FP registers are, and we can pick which one
we like. Let us pick mm0. The first instruction looks like this
movq mm0,
[ebp+$10]
There is to
ways two go now. We can load B into a register too. This makes it easy to
see what is going on by using the FPU window. The MMX registers are aliased
onto the FP registers and the FPU view can show both sets. Switch between FP
and MMX view by select "Display as words/Display as extendeds" in the
shortcut menu. The second way to go is to use the pattern from the IA32
implementation and perform the addition with the memory location of B as
source. The two solutions is expected to perform identically because the CPU
needs to load B into registers before doing the addition and whether it is
done explicitly with mov or explicitly with the add instruction, the number
of micro instructions will be the same. We use the more illustrative first
way. The next line is then a movq again
movq mm1,
[ebp+$08]
Then we have
to go look for an add instruction which would be something like this -
paddq. P for MMX, add for addition and q for quadword. Now we get
disappointed because there is no such MMX instruction. What about SSE? It is
one more disappointment. Finally SSE2 got it and we are happy or are we? If
we use it the code will be targeting P4 and not run P3 or Athlon. Like the
P4 lovers we are we proceed anyway.
paddq mm0,
mm1
This line is
very intuitive. It adds mm1 to mm0.
Only thing
left is to copy the result from mm0 into eax:edx. To do this we need a
double word mov instruction that can take 32 bit from a MMX register as
source and a IA32 register as destination.
movd eax,
mm0
This MMX
instruction does the job. It copies the lower 32 bits of mm0 to eax. Then we
need to copy the upper 32 bits of the result to edx. I could not find an
instruction for that and instead I shift the upper 32 bits down into the
lower 32 bit using a 64 bit MMX rigth shift instruction.
psrlq mm0,
32
Then we copy
movd edx,
mm0
Then we are
done? Unfortunately we have to issue the emms instruction because we have
used MMX instructions. It cleans up the FP stack and leaves in a well
defined empty state. Emms burns 12 cycles on a P4. Together with the shift
which is also ineffective (2 cycles throughput and latency) on P4 our
solution is not especially fast and it will only run on P4 and this AMD
thing nobody has yet :-(
This ended
the third lesson. We left the ball hanging in the air. Can we come up with a
more efficient solution? Moving data between MMX register and IA32 registers
is expensive. The calling convention is no good, because data were
transferred on the stack and not in registers. eax->mm0 is 2 cycles. The
other way is 5 cycles. emms is 12 cycles. Addition is only 2 cycles.
Overhead is plenty.
Regards
Dennis
|