C++ Loop Optimization: Part 2

After the first article on loop optimization my cousin pointed out to me that in some cases, tail recursion can actually be faster then loops in C with the help of tail recursion elimination. It took several tries to come up with a version that the compiler was able to optimize. I knew I got the version correct when the application stopped segfaulting (I believe it was running out of stack space).

My experimentation shows that some optimized cases are faster (unoptimized versions will not complete execution) then loops. If anyone has any more experience with this, please attach a better optimized version and I'll give it a run on my machine.

struct Increment
{
  int m_value;
 
  Increment(int i)
    : m_value(i)
  {
 
  }
 
  int operator()()
  {
    return m_value++;
  }
};
 
template<typename InItr>
uint64_t sum(InItr begin, const InItr &end, uint64_t sumsofar)
{
  if (begin == end) 
  {
    return sumsofar;
  } else {
    ++begin;
    return sum(begin, end, sumsofar + *begin);
  }
}
 
int main()
{
  std::vector<uint32_t> vec;
 
  std::generate_n(back_inserter(vec), 10000000, Increment(0));
 
  for (int i = 0; i < 1000; ++i)
  {
    std::cout << sum(vec.begin(), vec.end(), 0) << std::endl;
  }
}

Tail Recursion Timings

-O2 44.113
-O3 48.278

Functional Timings (From previous article, timings reran)

-O2 42.146
-O3 45.087

Pre-increment Optimization Timings (From previous article, timings reran)

-O2 44.936
-O3 45.434

The tail recursion version is faster then the pre-increment optimized version for -02, but slower for -03. I don't know enough about the compiler optimization techniques to know why -03 would actually slow down all of the test cases.

If nothing else this version is nice and small, pretty easy to read and performs quite well.

Comments

Slight change on your tail

Slight change on your tail recursion alg. This should let you drop carrying along the third argument:

template<typename InItr>
uint64_t sum(InItr begin, const InItr &end)
{
  if (begin == end)
  {
    return 0;
  } else {
    return (*begin++) + sum(begin, end);
  }
}

Using gcc 4.3.2 on the latest Ubuntu (8.10 pre-release), 2.0ghz core 2 duo, I'm getting:

Tail recursion:
-O2 14.4s
-O3 14.3 - 14.5s
w/ slight revision:
-O2 14.1 - 14.2s
-O3 14.2 - 14.4s

Functional:
-O2 14.3 - 14.5s
-O3 14.2s

Pre-increment:
-O2 and -O3: 14.27s

Everything seems to be in the ~14.2-14.5s range. The -O2 for tail recursion with 2 parms might be an exception, but it's close enough to the others to not be able to say for sure.

Not All GCC's Are Create Equally

I had planned to write an entire new article about this, but decided this would suffice.

GCC 4.3+ is much better at optimizing tail recursion than 4.2 (which I used for my tests). GCC 4.2 was not able to optimize your version. On a side note, I'm not entirely sure that this operation is safe, in your version:

return (*begin++) + sum(begin, end);

Because I'm not sure if the order of evaluation is guaranteed by the standard. Maybe someone else can comment on that.

I think it's safe to say that the idea that "you cannot affect the performance of an optimized loop" is still alive and well. All of the numbers timings are so close to each other, that they are virtually indistinguishable.