Strings
String is probably the most used data type in Delphi. They also have a very interesting implementation, optimized for fast execution. To be exact, AnsiString (with all its variations) and UnicodeString (also known as string) are optimized. The WideString type is implemented in a different manner. As short strings (declarations like string[17]) are only used for backward compatibility, I won't discuss them in this book.
Let's deal with the more widely used AnsiString and UnicodeString first. A data of such type is represented with a pointer to a block, allocated from the memory manager. If a string is empty, this pointer will be nil (which is at the CPU level represented with number zero) and if a string is not empty, it will point to some memory.
Strings are managed types and as such are always initialized to the default value, nil. When you set a string to a value (when you assign a constant to it, for example, s := 'Delphi'), the compiler allocates a memory block to store the string and copies a new value into that block.
Something similar happens when you extend a string (s := s + '!'). In such cases, the compiler will reallocate the memory holding the string (more on that in the next chapter), and copy new characters to that memory.
Initializing a string and appending to a string are therefore relatively expensive operations as both allocate memory. (Memory manager performs some interesting optimizations to make string modification much cheaper—in terms of CPU cycles—than one would expect. Read more about that in the next chapter.)
Modifying part of the string without changing its length is, however, a pretty simple operation. Code just has to move the appropriate number of bytes from one place in memory to another and that's that.
Interesting things happen when you assign one non-empty string to another (s := '42'; s1 := s;). To make such operations fast and efficient, the compiler just points both variables to the same memory and increments an integer-sized counter (called reference count) in that same memory. A reference count represents the number of variables that are currently sharing the same string data.
A reference count is initialized to 1 when a string is initialized (meaning that only one variable is using this string). After the s1 := s, both s and s1 are pointing to the same memory and the reference count is set to 2.
If one of these two variables is no longer accessible (for example, it was a local variable in a method which is no longer executing), the compiler will generate a life cycle management code which decrements the reference count. If the new reference count is zero, nobody is using the string data anymore and that memory block is released back to the system.
After s := s1 we, therefore, have two strings pointing to the same memory. But what happens if one of these strings is modified? What happens, for example, if we do s1[1] := 'a'? It would be very bad if that would also modify the original string, s.
Again, the compiler comes to the rescue. Before any modification of a string, the code will check whether the string's reference count is larger than one. If so, this string is shared with another variable. To prevent any mess, the code will at that point allocate new memory for the modified string, copy the current contents into this memory, and decrement the original reference count. After that, it will change the string so it will point to the new memory and modify the content of that memory. This mechanism is called copy-on-write.
You can also force this behavior without actually modifying the string by calling the UniqueString function. After s1 := s; UniqueString(s1); both variables will point to separate parts of memory and both will have a reference count of 1.
The following code, taken from the DataTypes demo, demonstrates a part of that behavior. Firstly, the code initializes s1 and s2 so that they point to the same string. It will then log two items for each string—a pointer to the string memory and the contents of the string. After that, the code will modify one of the strings and log the same information again:
procedure TfrmDataTypes.btnCopyOnWriteClick(Sender: TObject);
var
s1, s2: string;
begin
s1 := 'Delphi';
s2 := s1;
ListBox1.Items.Add(Format('s1 = %p [%s], s2 = %p [%s]',
[PPointer(@s1)^, s1, PPointer(@s2)^, s2]));
s2[1] := 'd';
ListBox1.Items.Add(Format('s1 = %p [%s], s2 = %p [%s]',
[PPointer(@s1)^, s1, PPointer(@s2)^, s2]));
end;
If you run this program and click on the string copy-on-write button, you'll see that in the first log line, both pointers will be the same and in the second, the pointers will be different:
This wraps up the implementation behind Ansi and Unicode strings. The WideString type is, however, implemented completely differently. It was designed to be used in OLE applications where strings can be sent from one application to another. Because of that, all WideStrings are allocated with Windows' OLE memory allocator, not with standard Delphi's memory management mechanism.
There is also no copy-on-write implemented for WideStrings. When you assign one WideString to another, new memory is allocated and data is copied. Because of that, WideStrings are slower than Ansi and Unicode strings, and as such, should not be used needlessly.