Going the assembler way
Sometimes, when you definitely have to squeeze everything from the code, there is only one solution—rewrite it in assembler. My response to any such idea is always the same—don't do it! Rewriting code in an assembler is almost always much more trouble than it is worth.
I do admit that there are legitimate reasons for writing assembler code. I looked around and quickly found five areas where an assembler is still significantly present. They are memory managers, graphical code, cryptography routines (encryption, hashing), compression, and interfacing with hardware.
Even in these areas, situations change quickly. I tested some small assembler routines from the graphical library, GraphicEx, and was quite surprised to find out that they are not significantly faster than the equivalent Delphi code.
The biggest gain that you'll get from using an assembler is when you want to process a large buffer of data (such as a bitmap) and then do the same operation on all elements. In such cases, you can maybe use the SSE2 instructions which run circles around the slow 386 instruction set that Delphi compiler uses.
As assembler is not my game, (I can read it but I can't write good optimized assembler code), my example is extremely simple. The code in the demo program, AsmCode implements a four-dimensional vector (a record with four floating-point fields) and a method that multiplies two such fields:
type
TVec4 = packed record
X, Y, Z, W: Single;
end;
function Multiply_PAS(const A, B: TVec4): TVec4;
begin
Result.X := A.X * B.X;
Result.Y := A.Y * B.Y;
Result.Z := A.Z * B.Z;
Result.W := A.W * B.W;
end;
As it turns out, this is exactly an operation that can be implemented using SSE2 instructions. In the code shown next, first movups moves vector A into register xmm0. Next, movups does the same for the other vector. Then, the magical instruction mulps multiplies four single-precision values in register xmm0 with four single-precision values in register xmm1. At the end, movups is used to copy the result of the multiplication into the function result:
function Multiply_ASM(const A, B: TVec4): TVec4;
asm
movups xmm0, [A]
movups xmm1, [B]
mulps xmm0, xmm1
movups [Result], xmm0
end;
Running the test shows a clear winner. While Multiply_PAS needs 53 ms to multiply 10 million vectors, Multiply_ASM does that in half the time—24 ms.
As you can see in the previous example, assembler instructions are introduced with the asm statement and ended with end. In the Win32 compiler, you can mix Pascal and assembler code inside one method. This is not allowed with the Win64 compiler. In 64-bit mode, a method can only be written in pure Pascal or in pure assembler.
The asm instruction is only supported by Windows and OS/X compilers. In older sources, you'll also find an assembler instruction which is only supported for backwards compatibility and does nothing.
I'll end this short excursion into the assembler world with some advice. Whenever you are implementing a part of your program in assembler, please also create a Pascal version. The best practice is to use a conditional symbol, PUREPASCAL as a switch. With this approach, we could rewrite the multiplication code as follows:
function Multiply(const A, B: TVec4): TVec4;
{$IFDEF PUREPASCAL}
begin
Result.X := A.X * B.X;
Result.Y := A.Y * B.Y;
Result.Z := A.Z * B.Z;
Result.W := A.W * B.W;
end;
{$ELSE}
asm
movups xmm0, [A]
movups xmm1, [B]
mulps xmm0, xmm1
movups [Result], xmm0
end;
{$ENDIF}