Memory allocation in a parallel world
We've seen how FastMM boosts the reallocation speed. Let's take a look at another optimization which helps a lot when you write a multithreaded code—as we will in the next three chapters.
The life of a memory manager is simple when there is only one thread of execution inside a program. When the memory manager is dealing out the memory, it can be perfectly safe in the knowledge that nothing can interrupt it in this work.
When we deal with parallel processing, however, multiple paths of execution simultaneously execute the same program and work on the same data. (We call them threads and I'll explain them in the next chapter.) Because of that, life from the memory manager's perspective suddenly becomes very dangerous.
For example, let's assume that one thread wants some memory. The memory manager finds a free memory block on a free list and prepares to return it. At that moment, however, another thread also needs some memory from the same allocator. This second execution thread (running in parallel with the first one) would also find a free memory block on the free list. If the first thread didn't yet update the free list, that may even be the same memory block! That can only result in one thing—complete confusion and crashing programs.
It is extremely hard to write a code that manipulates some data structures (such as a free list) in a manner that functions correctly in a multithreaded world. So hard that FastMM doesn't even try it. Instead of that, it regulates access to each allocator with a lock. Each of the 56 small block allocators get their own lock, as do medium and large block allocators.
When a program needs some memory from, say, a 16-byte allocator, FastMM will lock this allocator until the memory is returned to the program. If, during this time, another thread requests a memory from the same 16-byte allocator, it will have to wait until the first thread finishes.
This indeed fixes all problems but introduces a bottleneck—a part of the code where threads must wait to be processed in a serial fashion. If threads do lots of memory allocation, this serialization will completely negate the speed-up that we expected to get from the parallel approach. Such a memory manager would be useless in a parallel world.
To fix that, FastMM introduces memory allocation optimization which only affects small blocks.
When accessing a small block allocator, FastMM will try to lock it. If that fails, it will not wait for the allocator to become unlocked, but will try to lock the allocator for the next block size. If that succeeds, it will return memory from the second allocator. That will indeed waste more memory, but will help with the execution speed. If the second allocator also cannot be locked, FastMM will try to lock the allocator for yet the next block size. If the third allocator can be locked, you'll get back memory from it. Otherwise, FastMM will repeat the process from the beginning.
This process can be somehow described with the following pseudo-code:
allocIdx := find best allocator for the memory block
repeat
if can lock allocIdx then
break;
Inc(allocIdx);
if can lock allocIdx then
break;
Inc(allocIdx);
if can lock allocIdx then
break;
Dec(allocIdx, 2)
until false
allocate memory from allocIdx allocator
unlock allocIdx
A careful reader would notice that this code fails when the first line finds the last allocator in the table, or the one before that. Instead of adding some conditional code to work around the problem, FastMM rather repeats the last allocator in the list three times. The table of small allocators actually ends with the following sizes: 1,984; 2,176; 2,384; 2,608; 2,608; 2,608. When requesting a block size above 2,384 the first line in the pseudo-code above will always find the first 2,608 allocator, so there will always be two more after it.
This approach works great when memory is allocated but hides another problem. And how can I better explain a problem than with a demonstration ...?
An example of this problem can be found in the program, ParallelAllocations. If you run it and click the Run button, the code will compare the serial version of some algorithm with a parallel one. I'm aware that I did not explain parallel programming at all, but the code is so simple that even somebody without any understanding of the topic will guess what it does.
The core of a test runs a loop with the Execute method on all objects in a list. If a parallelTest flag is set, the loop is executed in parallel, otherwise it is executed serially. The only mystery part in the code, TParallel.For does exactly what it says—executes a for loop in parallel.
if parallelTest then
TParallel.For(0, fList.Count - 1,
procedure(i: integer)
begin
fList[i].Execute;
end)
else
for i := 0 to fList.Count - 1 do
fList[i].Execute;
If you'll be running the program, make sure that you execute it without the debugger (Ctrl + Shift + F9 will do that). Running with the debugger slows down parallel execution and can skew the measurements.
On my test machine I got the following results:
In essence, parallelizing the program made it almost 4 times faster. Great result!
Well, no. Not a great result. You see, the machine I was testing on has 12 cores. If all would be running in parallel, I would expect an almost 12x speed-up, not a mere 4-times improvement!
If you take a look at the code, you'll see that each Execute allocates a ton of objects. It is obvious (even more given the topic of the current chapter) that a problem lies in the memory manager. The question remains though, where exactly lies this problem and how can we find it?
I ran into exactly the same problem a few years ago. A highly parallel application which processes gigabytes and gigabytes of data was not running fast enough. There were no obvious problematic points and I suspected that the culprit was FastMM. I tried swapping the memory manager for a more multithreading-friendly one and, indeed, the problem was somehow reduced but I still wanted to know where the original sin lied in my code. I also wanted to continue using FastMM as it offers great debugging tools.
In the end, I found no other solution than to dig in the FastMM internals, find out how it works, and add some logging there. More specifically, I wanted to know when a thread is waiting for a memory manager to become unlocked. I also wanted to know at which locations in my program this happens the most.
To cut a (very) long story short, I extended FastMM with support for this kind of logging. This extension was later integrated into the main FastMM branch. As these changes are not included in Delphi, you have to take some steps to use this code.
Firstly, you have to download FastMM from the official repository at https://github.com/pleriche/FastMM4. Then you have to unpack it somewhere on the disk and add FastMM4 as a first unit in the project file (.dpr). For example, the ParallelAllocation program starts like this:
program ParallelAllocation;
uses
FastMM4 in 'FastMM\FastMM4.pas',
Vcl.Forms,
ParallelAllocationMain in 'ParallelAllocationMain.pas' {frmParallelAllocation};
When you have done that, you should firstly rebuild your program and test if everything is still working. (It should but you never know ...)
To enable the memory manager logging, you have to define a conditional symbol LogLockContention, rebuild (as FastMM4 has to be recompiled) and, of course, run the program without the debugger.
If you do that, you'll see that the program runs quite a bit slower than before. On my test machine, the parallel version was only 1.6x faster than the serial one. The logging takes its toll, but that is not important. The important part will appear when you close the program.
At that point, the logger will collect all results and sort them by frequency. The 10 most frequent sources of locking in the program will be saved to a file called <programname>_MemoryManager_EventLog.txt. You will find it in the folder with the <programname>.exe. The three most frequent sources of locking will also be displayed on the screen.
The following screenshot shows a cropped version of this log. Some important parts are marked out:
For starters, we can see that at this location the program waited 19,020 times for a memory manager to become unlocked. Next, we can see that the memory function that caused the problem was FreeMem. Furthermore, we can see that somebody tried to delete from a list (InternalDoDelete) and that this deletion was called from TSpeedTest.Execute, line 130. FreeMem was called because the list in question is actually a TObjectList and deleting elements from the list caused it to be destroyed.
The most important part here is the memory function causing the problem—FreeMem. Of course! Allocations are optimized. If an allocator is locked, the next one will be used and so on. Releasing memory, however, is not optimized! When we release a memory block, it must be returned to the same allocator that it came from. If two threads want to release memory to the same allocator at the same time, one will have to wait.
I had an idea on how to improve this situation by adding a small stack (called release stack) to each allocator. When FreeMem is called and it cannot lock the allocator, the address of the memory block that is to be released will be stored on that stack. FreeMem will then quickly exit.
When a FreeMem successfully locks an allocator, it firstly releases its own memory block. Then it checks if anything is waiting on the release stack and releases these memory blocks too (if there are any).
This change is also included in the main FastMM branch, but it is not activated by default as it increases the overall memory consumption of the program. However, in some situations it can do miracles and if you are developing multithreaded programs you certainly should test it out.
To enable release stacks, open the project settings for the program, remove the conditional define LogLockContention (as that slows the program down) and add the conditional define UseReleaseStack. Rebuild, as FastMM4.pas has to be recompiled.
On my test machine, I got much better results with this option enabled. Instead of a 3,9x speed-up, the parallel version was 6,3x faster than the serial one. The factor is not even close to 12x, as the threads do too much fighting for the memory, but the improvement is still significant:
That is as far as FastMM will take us. For a faster execution, we need a more multithreading-friendly memory manager.