上QQ阅读APP看书，第一时间看更新

Data structures in practice

Enough with the theory already! I know that you, like me, prefer to talk through the code. As one program explains more than a thousand words could, I have prepared a simple demo project: RandomWordSearch.

This program functions as a very convoluted random word generator. When started, it will load a list of 370,101 English words from a file. It will also prepare three internal data structures preloaded with these words:

Random word generator

The program shows three buttons to the user. All three run basically the same code. The only difference is the test function which is passed to the centralized word generator as a parameter:

procedure TfrmRandomWordSearch.FindGoodWord(const wordTest: TWordCheckDelegate);
var
  word: string;
  isWordOK: boolean;
  time: TStopwatch;
begin
  time := TStopwatch.StartNew;
  repeat
    word := GenerateWord;
    isWordOK := wordTest(word);
  until isWordOK or (time.ElapsedMilliseconds > 10000);
  if isWordOK then
    lbWords.ItemIndex := lbWords.Items.Add(Format('%s (%d ms)', [word, time.ElapsedMilliseconds]))
  else
    lbWords.ItemIndex := lbWords.Items.Add('timeout');
end;

The core of the FindGoodWord method can be easily described:

Generate a random word by calling GenerateWord.
Call the test function wordTest on that word. If this function returns False, repeat Step 1. Otherwise show the word.

The code is a bit more complicated because it also checks that the word generation part runs for at most 10 seconds and reports a timeout if no valid word was found in that time.

The random word generator GenerateWord is incredibly simple. It just appends together lowercase English letters until the specified length (settable in the user interface) is reached:

function TfrmRandomWordSearch.GenerateWord: string;
var
  pos: integer;
begin
  Result := '';
  for pos := 1 to inpWordLength.Value do
    Result := Result + Chr(Ord('a') + Random(Ord('z') - Ord('a') + 1));
end;

Let's now check the data preparation phase. The not very interesting (and not shown here) OnCreate handler loads data from a file into a TStringList and calls the LoadWords method:

procedure TfrmRandomWordSearch.LoadWords(wordList: TStringList);
var
  word: string;
begin
  FWordsUnsorted := TStringList.Create;
  FWordsUnsorted.Assign(wordList);

  FWordsSorted := TStringList.Create;
  FWordsSorted.Assign(wordList);
  FWordsSorted.Sorted := True;

  FWordsDictionary := TDictionary<string,boolean>.Create(wordList.Count);
  for word in wordList do
    FWordsDictionary.Add(word, True);
end;

The first data structure is an unsorted TStringList, FWordsUnsorted. Data is just copied from the list of all words by calling the Assign method.

The second data structure is a sorted TStringList, FWordsSorted. Data is firstly copied from the list of all words. The list is then sorted by setting FWordsSorted.Sorted := True.

The last data structure is a TDictionary, FWordsDictionary. A TDictionary always stores pairs of keys and values. In our case, we only need the keys part as there is no data associated with any specific word, but Delphi doesn't allow us to ignore the values part and so the code defines the value as a boolean and always sets it to True.

Although the dictionaries can grow, they work faster if we can initially set the number of elements that will be stored inside. In this case that is simple—we can just use the length of the wordList as a parameter to TDictionary.Create.

The only interesting part of the code left is OnClick handlers for all three buttons. All three call the FindGoodWord method, but each passes in a different test function.

When you click on the Unsorted list button, the test function checks whether the word can be found in the FWordsUnsorted list by calling the IndexOf function. As we will mostly be checking non-English words (remember, they are just random strings of letters), this IndexOf will typically have to compare all 307,101 words before returning -1:

procedure TfrmRandomWordSearch.btnUnsortedListClick(Sender: TObject);
begin
  FindGoodWord(
    function (const word: string): boolean
    begin
      Result := FWordsUnsorted.IndexOf(word) >= 0;
    end);
end;

When you click on the Sorted list button, the test function calls FWordsSorted.IndexOf. As this TStringList is sorted, IndexOf will use a binary search that will need at most log(307101) = 19 (rounded up) comparisons to find out that a word is not found in the list. As this is much less than 307.101, we can expect that finding words with this approach will be much faster:

procedure TfrmRandomWordSearch.btnSortedListClick(Sender: TObject);
begin
  FindGoodWord(
    function (const word: string): boolean
    begin
      Result := FWordsSorted.IndexOf(word) >= 0;
    end);
end;

A click on the last button, Dictionary, calls FWordsDictionary.ContainsKey to check if the word can be found in the dictionary, and that can usually be done in just one step. Admittedly, this is a bit of a slower operation than comparing two strings, but still the TDictionary approach should be faster than any of the TStringList methods:

procedure TfrmRandomWordSearch.btnDictionaryClick(Sender: TObject);
begin
  FindGoodWord(
    function (const word: string): boolean
    begin
      Result := FWordsDictionary.ContainsKey(word);
    end);
end;

If we use the terminology from the last section, we can say that the O(n) algorithm (unsorted list) will run much slower than the O(log n) algorithm (sorted list), and that the O(1) algorithm (dictionary) will be the fastest of them all. Let's check this in practice.

Start the program and click on the Unsorted list button a few times. You'll see that it typically needs few hundred milliseconds to a few seconds to generate a new word. As the process is random and dependent on CPU speed, your numbers may differ quite a lot from mine. If you are only getting timeout messages, you are running on a slow machine and you should decrease the Word length to 3.

If you increment the Word length to 5 and click the button again, you'll notice that the average calculation time will grow up to a few seconds. You may even get an occasional timeout. Increase it to 6 and you'll mostly be getting timeouts. We are clearly hitting the limits of this approach.

Prepare now to be dazed! Click on the Sorted list button (while keeping the Word length at 6) and words will again be calculated blazingly fast. On my computer, the code only needs 10 to 100 milliseconds to find a new word:

Testing with different word length and with first two algorithms

To better see the difference between a sorted list and a dictionary, we have to crank up the word length again. Setting it to 7 worked well for me. The sorted list needed from a few 100 milliseconds to a few seconds to find a new word while the dictionary approach mostly found a new word in under 100 milliseconds.

Increase the Word length to 8 and the sorted list will start to time out while the dictionary will still work. Our O(1) approach is indeed faster than the O(log n) code:

Comparing sorted list (first six words and the two after "timeout") with the dictionary approach (next five and last two).

While the knowledge of algorithms and their complexity is useful, it will not help in every occasion. Sometimes you already use the best possible algorithm but your program is still too slow. At such times, you'll want to find the slow parts of the program and speed them up with whatever trick possible. To do that, you have to find them first, which is the topic of the rest of this chapter.