Friday, 1 November 2013

Fast Implementation of the Factorial function in C++

I’ve been investigating the fastest way to implement the factorial function in my MathsUtils library. I had originally implemented it as simply:

template <typename T>
emu64 MathUtils::RecursiveFactorial(const T& x)
{
  return (x <= 1 ? 1 : x * Factorial(x - 1));
}

although I suspected this wasn’t the most efficient way, and I wanted to measure it.
To measure the efficiency of this implementation I used the following benchmark test:

for (ulonglong i=1; i<10000000; i++) {
  for (ulonglong j=1; j<20; j++) {
    RecursiveFactorial(j);
  }
}

(I don't find the factorial for anything greater than 20 as it would be too large to represent with a 64-bit unsigned number)

First I took a closer look at what the compiler was doing with the above factorial implementation, and the assembly for a debug build was as expected, it implements recursion. With this implementation the above test takes about 14 seconds on my machine in debug mode.

I found that the assembly the compiler writes for this function doesn't change regardless of if a const value is passed in. This would change in the future if I mark the function with constexpr. VC++ 2012 doesn’t support this feature yet but it would allow the compiler to evaluate a function's return value so you get it for free at run time.

I then thought about inlining the function. If the compiler did this blindly then it might hang while writing an inline function, inside an inline function, etc… Compilers are smarter than this though and looking at the assembly showed that using inline didn't make any difference to the output produced by the compiler. This is understandable as the inline keyword is only a hint to the compiler, one it is free to ignore.

I found, however, that it is possible to make the compiler inline a recursive function using the inline_recursing pragma (for VC++, there are equivalents for other compilers). When this is used the recursive function will be unrolled and written inline up to a maximum of 16 times, and this is also the default inline recursion depth. The depth can be changed using inline_depth.This Stack Overflow answer gives a good explanation of how a recursive function can be inlined to a depth of three.

To compile using these pragmas you need to turn off some debugging options, so after I'd seen from the assembly that the function was indeed being inlined I switched to release mode. Here are the results for the original RecursiveFactorial() function on the benchmark test:
  • Without inlining - 7.9 seconds
  • With inlining and a depth of five - 6.7 seconds.
I found that an inline recursion depth of five gave the best results.

As an interesting aside, the above test, shown here again:

for (ulonglong i=1; i<10000000; i++) {
  for (ulonglong j=1; j<20; j++) {
    RecursiveFactorial(j);
  }
}

actually runs instantaneously in release mode. This is because, in release mode, the compiler detects that the call to RecursiveFactorial() is dead code, so it's eliminated from the compiled assembly. To force it to execute I had to add the result to a running total, and even then the calls were skipped unless I used the total afterwards, so I printed it to the console.

Having spent some time tweaking the recursive factorial function I still had a suspicion that a simple for loop would be faster, so I tested the following:

template <typename T>
ulonglong ForLoopFactorial(const T& value)
{
  ulonglong returnValue = 1;
  for (ulonglong i = 2; i <= value; i++) {
    returnValue *= i;
  }
  return returnValue;
}

In release mode this takes 5.1 seconds to run on my machine. So, despite digging into the first option, and learning a few things, the simpler for loop wins for speed, so that's what I've updated my MathUtils library to use. Unit tests still pass so I haven't broken anything.

I've put the code for this spike on my GitHub.

No comments: