POPCNT Instruction Adventure

I stumbled upon the getNumBitsOn64() function and friends in the source code. The function counts the number of 1 bits in a byte/word/dword/… by using table lookups.

A sensible way of doing it but I knew that modern x86 CPUs have a “popcnt” instruction. getNumBitsOn64() isn’t a bottleneck but using a lookup table pollutes the CPU L1 data cache. So using popcnt got to be faster, right? (and low-level optimization is fun).


There was no need for doing inline assembly because GCC has a __builtin_popcount(). I wrote a small test program to measure the speed of the original getNumBitsOn64() versus __builtin_popcount(). __builtin_popcount() was slower. By a factor of 3.


Puzzled, I started digging deeper. With -O3 optimization the original getNumBitsOn64() turned into approximately 72 instructions and 8 memory fetches. __builtin_popcount() eventually called __popcountdi2() which is a small 13-instruction function. But __popcountdi2() also does table lookups. And it is looping while shifting the input value 8 bits at a time, which the CPU apparently is not clever enough to pipeline. I did find that newer versions of GCC has improved that function (GCC bug 36041) but my test system doesn’t have that improvement.


Anyway, I also found that using the options "-O3 -march=corei7 -msse4.2" enables GCC to actually inline the __builtin_popcount() function by using the popcnt instruction. That improved things. Now I found that using __builtin_popcount() was 4 times faster than the original getNumBitsOn64().


But only 4 times? I felt like having a race between a snail and Trabant. The popcnt instruction was faster but I had expected much more than a mere speedup of 4. So I continued digging.


The general wisdom is that on modern CPU you shouldn’t do loop unrolling because it increases the code size (putting pressure on the L1 instruction cache) and the CPUs are clever enough do unrolling and register renaming behind the scenes. But not in this case. I’m not sure why. But the GCC option "-funroll-all-loops" helps there (it is not something I would generally recommend). It increased the speed by a factor of 3.


In the end I ended up with something that was approximately 11 times faster than the original. It hasn’t been committed yet, though.


  1. specifying "-march=corei7 -msse4.2" actually does something
  2. loop unrolling can still help.

And as always: measure performance instead of guessing. But guessing about the expected performance improvement/degradation of a change is fine and can lead you to the “that’s funny…” moments.