I recently had a friend point out to me that your typical loop seen in every day code has minor inefficiencies in it which can add up to a good amount of time being wasted. I agreed with him, so I set out to write an article on how to optimize your C++ loops.
The results of the research for this article surprised me. I'll first cover the optimization techniques:
Typical "Unoptimized" C++ Loop
int main() { std::vector<uint32_t> vec; //Fill the vector with some values for(int i=0; i<10000000; i++) { vec.push_back(i); } //Sum up the values in the vector 1000 times for (int i = 0; i < 1000; i++) { uint64_t sum = 0; for (std::vector<uint32_t>::const_iterator itr = vec.begin(); itr != vec.end(); itr++) { sum += *itr; } std::cout << sum << std::endl; } }
| No Optimization | 551.97s |
| -O1 | 48.91s |
| -O2 | 48.74s |
Cached "end()" Iterator
This version of the loop is exactly the same as the unoptimized version, except we are now caching the value of "end()" so that a lookup does not occur on each loop iteration. Not every developer realizes that for (itr = vec.begin(); itr != vec.end(); itr++) results in a call to vec.end() for each loop. Presumably the compiler cannot cache .end() and it may be a very expensive lookup depending on how well or poorly your container is implemented.
int main() { std::vector<uint32_t> vec; for(int i=0; i<10000000; i++) { vec.push_back(i); } for (int i=0; i<1000; i++) { uint64_t sum = 0; //Cache vec.end() to avoid redundant lookups // (we know that it will not change during this loop but the compiler does not) std::vector<uint32_t>::const_iterator itr, end(vec.end()); for (itr = vec.begin(); itr != end; itr++) { sum += *itr; } std::cout << sum << std::endl; } }
| No Optimization | 524.81s (5% improvement) |
| -O1 | Not Tested |
| -O2 | 48.78s |
Pre-increment Instead of Post-Increment Iterators
We've now made a very simple change, we've gone from itr++ to ++itr. The reason why preincrement is faster than postincrement will be covered in a later article. In this particular case, it saved us almost 40% in our very simple loop!
int main() { std::vector<uint32_t> vec; //Preincrement instead of post increment for(int i=0; i<10000000; ++i) { vec.push_back(i); } for (int i=0; i<1000; ++i) { uint64_t sum = 0; std::vector<uint32_t>::const_iterator itr, end(vec.end()); //Preincrement instead of post increment for (itr = vec.begin(); itr != end; ++itr) { sum += *itr; } std::cout << sum << std::endl; } }
| No Optimization | 323.58s (38% Improvement) |
| -O1 | Not Tested |
| -O2 | 48.74s |
Use std::for_each
We've gone the functional route and put to use std::for_each which does the above two things for us, it caches .end() and uses pre- instead of post- increment operators for the iterators.
However, in this case, we now see a net loss in effeciency. Why? Because with optimization turned off the compiler is not allowed to inline the calls to Sum and Increment.
struct Sum { uint64_t m_sum; Sum() : m_sum(0) { } void operator()(uint32_t i) { m_sum += i; } }; struct Increment { int m_value; Increment(int i) : m_value(i) { } int operator()() { return m_value++; } }; int main() { std::vector<uint32_t> vec; //Nice and succinct, use generate_n to generate 10,000,000 //values with the Increment generator std::generate_n(back_inserter(vec), 10000000, Increment(0)); for (int i = 0; i < 1000; ++i) { //Sum and output the values of each vector. std::cout << std::for_each(vec.begin(), vec.end(), Sum()).m_sum << std::endl; } }
| No Optimization | 398.65s (23% Loss) |
| -O1 | 49.95s |
| -O2 | 48.59s |
Technical Conclusion
The preceeding examples and discussion focused on the unoptimized performance characteristics. The surprise? With a modern GCC (these tests were done with GCC 4.2.1) and optimization enabled you cannot affect the performance. At all. Really, anything with -01 or better was within a margin of error, all four examples took approximately 49s each to run, or 85% faster than our fastest un-optimized version.
I should qualify that this conclusion is based on the standard container template classes. It's possible you. someone on your team or a third party has written rather poor containers with expensive .end() lookups and ineffecient iterators such that the techniques above do become more important.
Personal Conclusion
The functional approach is my personal favorite. The code is a tad bit longer, and it is possible that it is slighly slower than the other versions; I did not run enough tests at enough different optimization levels to conclusively determine if it is any slower or faster. However, the code is more reusable: Sum and Increment are generic and could be put in a reusable library, they could even be extended to be templated types to allow you to choose the size of the int you want to support.
Also, the actual loops themselves are much more succicnt and self documenting:
The functional programming guys have known this for a long time, C++ is just recently discovering it.
Comments
Does it need to be in a for loop?
Jason,
You have:
for (int i = 0; i < 1000; ++i)
{
//Sum and output the values of each vector.
std::cout << std::for_each(vec.begin(), vec.end(), Sum()).m_sum << std::endl;
}
Doesn't this mean that you're running the loop 1000 * 1000 times?
Because for_each() will do call your functor for each element, you don't need to put it inside a loop.
Yes, 1000 times
Yes, each summation is run 1000 times, so there are actually 10,000,000 * 1,000 = 10,000,000,000 loop iterations in each version, the for_each goes over each element in the vector, just like the more obvious nested loop versions do.
This was necessary to get the operation to take enough time that I could accurately compare versions. I tried to keep the outer loop relatively short, and balanced the size of the vector to 10,000,000 to make sure that I was measuring loop time and NOT measuring virtual memory performance (ie, hitting swap space).
Algorithms
std::cout << std::for_each(vec.begin(), vec.end(), Sum()).m_sum << std::endl;This could be more precisely written with std::accumulate (AKA a fold). This way you don't even need the functor.
Good point
I was trying to demonstrate more generic loops than just an accumulate, but thats a good point.
I believe that accumulate version would be:
Have you tried to create a
Have you tried to create a vector with 10000000 elements first (vector::reserve) and then fill it with a for loop?
vector::resize could also be useful.
No gains using pre-increment with -O2
I'm getting results similar although not quite as impressive for the no optimization case. However my results are completely different for the -O2 build showing nearly identical results for post/pre-increment (115.88725s vs 115.80723s). This is on an Intel Xeon 3060 @ 2.4 and linux 2.6.28 with gcc 4.3.3. My best guess is this is due to either a CPU or a compiler difference. It would be interesting to get the specs on the target this was tested on for comparison.