Wednesday 27 November 2013

Some STL functions are Simpler Than You Might Think

The STL functions generate and generate_n are quite useful, but when I first used them I thought I must be missing something. It turns out though that these functions, like many in the standard library, are really simple.

The first time I used generate_n was to initialise a vector. I looked at the documentation and quickly found how to achieve this, but sub-consciously I assumed I was only scratching the surface of what it could do. I probably thought that way about all STL functions, and in many cases I would have been right. For an example try stepping into the line:

std::cout << "Hello, world\n";

If you understand everything that happens in the above line of code you're probably someone who gets paid to implement the standard library. The fact that the operator<< is from the std namespace should have you scratching your head until you've read about Name Lookup, more specifically Koenig Lookup.

However, with generate_n, things are much more straight forward. In fact, generate_n is most easily understood, and in my case demystified, by looking at the implementation. Here's the crux of the function from VC++ 2012:

template<class _OutIt,
  class _Diff,
  class _Fn0> inline
  _OutIt _Generate_n(_OutIt _Dest, _Diff _Count, _Fn0 _Func)
  {
  for (; 0 < _Count; --_Count, ++_Dest)
    *_Dest = _Func();
  return (_Dest);
  }


That's as difficult as it gets

The above code is probably as easy to understand as any documentation on generate_n. But, more importantly, when you get this, you understand the function completely. There's nothing else to it, there's no smoke nor mirrors, no __asm__, and it doesn't link to any binaries to do the real work. It's just a small bit of C++ code.

The generate function is similar except it expects two input iterators and the for loop looks like:

for (; _First != _Last; ++_First)

If you look at the standard itself this shouldn't come as a surprise. After the function declaration there are just seven lines detailing what both generate and generate_n need to conform to. Many other useful STL functions have a similar level of complexity.

This won't come as a surprise to everyone, but for me it was encouraging to know that although I may never understand the language completely, there are many parts that I can get to the bottom of quite easily. It's also encouraged me to step into standard library functions as a first port of call, and shown me a less intimidating side of the C++ Standard itself. That said, I think I need some more experience before I'm ready to tackle:

std::cout << "Hello, world\n";

Tuesday 12 November 2013

Understanding Debug Iterators in VC++ 2012

In debug mode Visual C++ STL iterators can provide useful information at run-time via failed assertions, but there’s a performance cost and in certain scenarios they can seem to grind your program to a halt. This post shows why that happens, and how to fix it. Consider the following code:

vector<int> myInts;
myInts.push_bach(1);
cout << myInts[4] << endl;

It will compile O.K., and when done so in release mode the resultant program will happily reach into memory beyond the myInts vector and send whatever garbage it finds there to cout. However, in debug mode stepping over the third line produces the popup “Debug Assertion Failed… vector subscript out of range”. That may not be a big deal in the above case but if the index were calculated in a loop dependent on a number of variables that information could be very useful.

For another example consider what’s wrong with now adding the following lines:

auto it = myInts.begin();
myInts.push_back(2);
cout << *it << endl;

Chances are this code doesn't do what you want it to because pushing the 2 onto the vector probably requires it to grow, which requires it to move, which leaves the iterator it pointing at unallocated memory. Again, in release mode cout will get whatever it is now pointing at, but in debug mode we get “vector iterator not dereferencable”. While dereferencable may not be a real word you’ll soon figure out what it means and it will probably save some debugging time.

However, these extra checks come at a cost. Pretty much any call to an iterator involves a lot of run-time checks. For example dereferencing an iterator involves checking that the parent container isn’t null, checking the iterator’s internal pointer isn’t null, and checking that the iterator doesn’t point to memory before the parent vector’s first iterator, or after its last iterator. The parent container also has extra concerns when debug iterators are being used. All iterators that point into one container form a singly linked list. This allows the parent container to communicate with those iterators, but walking the list when it needs to is slow.

Example of slow iterators in debug mode

As a simple example, consider a list of 10,000 numbers from 1-100. I wish to find the number of pairs within that list that sum to 100. Here’s some code to achieve that:

auto noCombosSumTo100 = 0;
for (auto it1 = begin(myNums); it1 != end(myNums); it1++) {
  for (auto it2 = it1; it2 != end(myNums); it2++) {
    if (*it1 + *it2 == 100) {
      noCombosSumTo100++;
    }
  }
}

On my machine this takes about 34 milli-seconds fully optimised in release mode, and about 75 seconds in debug. That's about 2200x difference between debug and release!

Obviously this difference can't be entirely attributed to the iterators, there are lots of other opportunities for the compiler to optimise the above code in release mode. However, if I implement the same algorithm using vector indexing, like this:

for (auto index1 = 0; index1 < numTerms; index1++) {
  for (auto index2 = index1; index2 < numTerms; index2++) {
    if (mynums[index1] + mynums[index2] == 100) {
      noCombosSumTo100++;
    }
  }
}

it runs in about 4.5 seconds, so there's clearly a big cost to using debug iterators.

Ways to improve iterator performance

The improvements I'm about to show for the first code snippet could apply to any code, but it's worth understanding that any inefficiencies are exacerbated when debug iterators are involved. To improve the first code snippet while still getting the benefit of debug iterators I can:

These three modifications bring the execution time down to 21 seconds. Interestingly none of them make any difference in the second snippet where built in types are involved. Again, it's worth understanding that these improvements would apply to any user defined type (in debug mode), but it probably wouldn't be just as noticeable as when iterators are involved.

If you want a further speedup then you'll need to sacrifice some additional run-time checks by setting _ITERATOR_DEBUG_LEVEL=1 or 0 (This macro supersedes and combines the functionality of the older _SECURE_SCL and _HAS_ITERATOR_DEBUGGING macros). Doing so decreases the amount of debug help you get, but it brings the original example's execution time down to about 8.5 seconds.

At 8.5 seconds it's still about twice as slow as indexing the vectors, and this is because even with the reduced checking there are more indirections (extra function calls) associated with dereferencing a debug iterator.

In summary...

Remember, the example code used here takes 34 milli-seconds in release mode. The example using vector indexing also takes about 34 milli-seconds in release, and the same algorithm indexing a raw array of ints takes the same time to execute, about 34 milli-seconds. This indicates that the code is being optimised to similar assembly in all three cases.

So, to summarise, STL iterators provide useful debug information for development, they allow you to access STL containers as if you are using a pointer, and they create production code that is as fast as using raw pointers. However, there is a cost in debug mode as they can slow your code down a lot. But, if these costs become too much of a hindrance there are ways to reduce the effect without changing how you are using the iterators.

Friday 1 November 2013

Fast Implementation of the Factorial function in C++

I’ve been investigating the fastest way to implement the factorial function in my MathsUtils library. I had originally implemented it as simply:

template <typename T>
emu64 MathUtils::RecursiveFactorial(const T& x)
{
  return (x <= 1 ? 1 : x * Factorial(x - 1));
}

although I suspected this wasn’t the most efficient way, and I wanted to measure it.
To measure the efficiency of this implementation I used the following benchmark test:

for (ulonglong i=1; i<10000000; i++) {
  for (ulonglong j=1; j<20; j++) {
    RecursiveFactorial(j);
  }
}

(I don't find the factorial for anything greater than 20 as it would be too large to represent with a 64-bit unsigned number)

First I took a closer look at what the compiler was doing with the above factorial implementation, and the assembly for a debug build was as expected, it implements recursion. With this implementation the above test takes about 14 seconds on my machine in debug mode.

I found that the assembly the compiler writes for this function doesn't change regardless of if a const value is passed in. This would change in the future if I mark the function with constexpr. VC++ 2012 doesn’t support this feature yet but it would allow the compiler to evaluate a function's return value so you get it for free at run time.

I then thought about inlining the function. If the compiler did this blindly then it might hang while writing an inline function, inside an inline function, etc… Compilers are smarter than this though and looking at the assembly showed that using inline didn't make any difference to the output produced by the compiler. This is understandable as the inline keyword is only a hint to the compiler, one it is free to ignore.

I found, however, that it is possible to make the compiler inline a recursive function using the inline_recursing pragma (for VC++, there are equivalents for other compilers). When this is used the recursive function will be unrolled and written inline up to a maximum of 16 times, and this is also the default inline recursion depth. The depth can be changed using inline_depth.This Stack Overflow answer gives a good explanation of how a recursive function can be inlined to a depth of three.

To compile using these pragmas you need to turn off some debugging options, so after I'd seen from the assembly that the function was indeed being inlined I switched to release mode. Here are the results for the original RecursiveFactorial() function on the benchmark test:
  • Without inlining - 7.9 seconds
  • With inlining and a depth of five - 6.7 seconds.
I found that an inline recursion depth of five gave the best results.

As an interesting aside, the above test, shown here again:

for (ulonglong i=1; i<10000000; i++) {
  for (ulonglong j=1; j<20; j++) {
    RecursiveFactorial(j);
  }
}

actually runs instantaneously in release mode. This is because, in release mode, the compiler detects that the call to RecursiveFactorial() is dead code, so it's eliminated from the compiled assembly. To force it to execute I had to add the result to a running total, and even then the calls were skipped unless I used the total afterwards, so I printed it to the console.

Having spent some time tweaking the recursive factorial function I still had a suspicion that a simple for loop would be faster, so I tested the following:

template <typename T>
ulonglong ForLoopFactorial(const T& value)
{
  ulonglong returnValue = 1;
  for (ulonglong i = 2; i <= value; i++) {
    returnValue *= i;
  }
  return returnValue;
}

In release mode this takes 5.1 seconds to run on my machine. So, despite digging into the first option, and learning a few things, the simpler for loop wins for speed, so that's what I've updated my MathUtils library to use. Unit tests still pass so I haven't broken anything.

I've put the code for this spike on my GitHub.