Hands-On High Performance Programming with Qt 5

上QQ阅读APP看书，第一时间看更新

Examples of compiler tricks

Let's start with something fun, to show you that compilers are really clever. Unfortunately, you can often hear that compilers are dumb, but you shouldn't believe that. In ancient times, compilers were dumb and there was a difference in writing ++x instead of x = x +1 as they resulted in different PDP-11 instructions. But not anymore. Let me convince you with some examples of compiler cleverness.

We will start with simple summation. But, in order to check what the compiler really makes out of our code, we need yet another performance tool which we didn't mention in Chapter 2, Profiling to Find Bottlenecks— the Compiler Explorer by Matt Godbold. In order to use it, you just go to https://godbolt.org and paste your code in the left pane as seen in the following screenshot.

The assembler output for a selected compiler will then immediately appear in the right pane:

In the following examples, I used clang 6.0.0 and gcc 8.2 64-bit compilers with -O3 optimization level. In this book's resources (https://github.com/PacktPublishing/Hands-On-High-performance-with-QT/tree/master/Chapter%203), there's a file with all of the examples. Take it to Compiler Explorer and play with it!

Let's go back to our summation. We can write a simple function adding elements of an array like this:

int sum() 
{
  int a[] = { 1, 2, 3, 4, 5, 6 };
  int sum = 0;
  for(auto x : a)
  {
    sum += x;
  }
  return sum;
}

Then, both compilers are able to replace the whole grand logic with a simple code:

mov  eax, 21   # set return register to 21
ret

Isn't that amazing? The compiler just provided compile time computation on its own, because it understood what is going on in the code. Clever! Now, we try something more difficult—we will sum up an unknown number series as in the next example:

unsigned int sumSeriesTo(unsigned int x)
{
  unsigned int sum = 0;
  for(size_t i = 0; i < x; ++i)
  {
    sum += i;
  }
  return sum;
}

Now, when we call this function with a constant value, return sumSeriesTo(20), the compiler is again able to look through it and replace it with the result straight away: mov eax, 190! If we then use a variable function parameter, return sumSeriesTo(arg), it simply inlines the whole sumSeriesTo() function on the spot. And now it gets interesting—GCC automatically vectorizes the summation in the inlined function using SSE2 instructions, but clang pulls off an even better trick; it replaces the whole code by a closed summation formula, that is, sum = x*(x+1)/2. Clever!

The usual optimizations of replacing multiplication and division by shifts are automatically applied too, but with a little twist. For smaller arguments, the shift is replaced by an even more efficient lea (that is, calculate address) instruction as shown here:

lea  eax, [0 + rdi*8]   # multiply arg 1 by 8, save in return register
ret

You see, you don't have to sweat over small optimizations! The very costly modulo division should equally be replaced by a bitwise and operation for operands that are powers of two, such as the following:

x % y  =>  x & (y - 1)  // if y is a power of 2

And indeed, both compilers spotted the opportunity:

mov  eax, edi   # copy param into return register
and  eax, 31    # x & (32 - 1) 
ret

Another simple optimization we already mentioned in Chapter 1, Understanding Performant Programs, namely the replacement of ternary expressions with bit manipulations for integer values, got applied for a simpler expression such as int x=a<0?1111:2222;, but in case of other common transformations removing the need for branching, that is, a+=(b?0:c)=>a+=(b-1)&c, this possibility couldn't be detected by any of the compilers. We can see that compilers cannot do everything they theoretically should. If your code is really performance-critical, you can do such optimizations yourself. Here, the usual caveats apply:

Measure for bottleneck and measure after the change was made. Reverse the change if the trick didn't help.

But let's look at another quite marvelous trick that compilers are proficient in—the heap elision. Let's say I write a very dumb code like this:

int dumb()
{
  auto val = new int(44);
  return *val;
}

Then, both compilers are able to reduce it to a simple mov eax, 44! No dumb allocation, no memory leak! Here is a more complicated code using smart pointers:

int smartDumb()
{
  auto val1 = std::make_unique<int>(22);
  auto val2 = std::make_unique<int>(22);
  return *val1 + *val2;
}

It can be simplified as before and it results in the same mov eax, 44! If we return to our first summation example and replace the a[] array with std::vector<int>, clang is still able just to return the value without calculations. The vector will however still be allocated, probably because of its internal dynamic memory, but it won't be used. With the last experimental version of clang (as of the time of this writing) even that allocation is gone. Doubly clever! Good boy!

The last example is the classic popcount, that is, count of bits set in an integer value. We are using the classic C implementation shown as follows:

unsigned int countBits(unsigned int x)
{
  unsigned int count = 0;
  while (x)
  {
    ++count;
    x &= (x - 1);
  }
  return count;
}

The x &= (x - 1) line resets the last nonzero bit. Here, both compilers were first unable to optimize the code, but after adding the option, march=haswell, to hint at the additional instruction sets, interesting things should be seen. The GCC compiler used the special blsr (that is, reset lowest set bit) instruction for the x &= (x - 1) line, which is admittedly clever. But clang was able to up the ante: it replaced the whole mess by one single popcnt instruction! Nice one! We should probably do the same, using a compiler intrinsics, but more on that later.

You have seen that compilers can be pretty smart—and they are improving every day! This should teach you something, namely, that you shouldn't try to optimize small stuff. Of course, compilers aren't perfect; they can get confused and omit some optimizations, but we should optimize by hand only after we already identified the bottleneck. Chances are that we end up with convoluted and unreadable code that, in turn, will inhibit other compiler optimizations. I know, I said it before, but this one is really important!

本周热推：

意识障碍康复评定与治疗学躁郁之心：我与躁郁症共处的30年（经典版）（上）鼻咽癌诊疗规范手册牛津通识读本：癌症（中文版）中医病机辨证学（第二版）