Bit manipulations using BMI2

Intel's Haswell design (expected in 2013) will include a new instruction set called BMI2 with various fun bit manipulation instructions. Particularly of note are the pext and pdep instructions which are essentially bit-level gather/scatter operations. Combining two pext operations results in the GRP instruction described in Efficient Permutation Instructions for Fast Software Cryptography, where the authors show how to implement bit level permutations using a variety of instructions. Perhaps not coincidentally at least one of the authors now works at Intel.

Here are the operations written in as C++ templates (Haswell will only support pdep and pext with 32 and 64 bit operands, though). As might be apparent, they are written for clarity rather than speed:

template<typename T>
T pext(T val, T mask)
   {
   T res = 0;
   size_t off = 0;

   for(size_t bit = 0; bit != sizeof(T)*8; ++bit)
      {
      const bool val_bit = (val >> bit) & 1;
      const bool mask_bit = (mask >> bit) & 1;

      if(mask_bit)
         {
         res |= static_cast<T>(val_bit) << off;
         ++off;
         }
      }

   return res;
   }

template<typename T>
T pdep(T val, T mask)
   {
   T res = 0;
   size_t off = 0;

   for(size_t bit = 0; bit != sizeof(T)*8; ++bit)
      {
      const bool val_bit = (val >> off) & 1;
      const bool mask_bit = (mask >> bit) & 1;

      if(mask_bit)
         {
         res |= static_cast<T>(val_bit) << bit;
         ++off;
         }
      }

   return res;
   }

These have been checked fairly heavily (though obviously not exhaustively), against what is produced by Intel's emulator.

The paper shows that any arbitrary 32-bit permutation can be done using 10 pexts, 5 ors and 5 shifts. For 64-bit permutations, it's 12, 6, and 6. A big unknown is what kind of latency and throughput these instructions will have in Haswell, but pclmulqdq, which seems at least a metaphorically similiar operation, takes 14 cycles with at most two instructions outstanding - assuming those rates for pext, a 32-bit permutation might take 80 cycles or so.

Here's the DES P permutation:

uint32_t des_p(uint32_t val)
   {
   val = (pext(val, 0x07137FE0) << 16) | pext(val, 0xF8EC801F);
   val = (pext(val, 0x75196E8C) << 16) | pext(val, 0x8AE69173);
   val = (pext(val, 0x56A3CCE4) << 16) | pext(val, 0xA95C331B);
   val = (pext(val, 0xAA539AC9) << 16) | pext(val, 0x55AC6536);
   val = (pext(val, 0x96665A69) << 16) | pext(val, 0x6999A596);
   return val;
   }

And here's DES's initial permutation:

uint64_t des_ip(uint64_t val)
   {
   val = (pext(val, 0x00FF00FF00FF00FF) << 32) | pext(val, 0xFF00FF00FF00FF00);
   val = (pext(val, 0x00FF00FF00FF00FF) << 32) | pext(val, 0xFF00FF00FF00FF00);
   val = (pext(val, 0x00FF00FF00FF00FF) << 32) | pext(val, 0xFF00FF00FF00FF00);
   val = (pext(val, 0xCCCCCCCCCCCCCCCC) << 32) | pext(val, 0x3333333333333333);
   val = (pext(val, 0xCCCCCCCCCCCCCCCC) << 32) | pext(val, 0x3333333333333333);
   val = (pext(val, 0x5555555555555555) << 32) | pext(val, 0xAAAAAAAAAAAAAAAA);
   return val;
   }

The regularity of the IP operation is pretty obvious, and suggests one could get by with a much less powerful operation (and indeed Richard Outerbridge and others have gotten IP down to 30 simple operations)

Finally, PRESENT's permutation, also a very regular operation:

uint64_t present_p(uint64_t val)
   {
   val = (pext(val, 0xF0F0F0F0F0F0F0F0) << 32) | pext(val, 0x0F0F0F0F0F0F0F0F);
   val = (pext(val, 0xF0F0F0F0F0F0F0F0) << 32) | pext(val, 0x0F0F0F0F0F0F0F0F);
   val = (pext(val, 0xF0F0F0F0F0F0F0F0) << 32) | pext(val, 0x0F0F0F0F0F0F0F0F);
   val = (pext(val, 0xF0F0F0F0F0F0F0F0) << 32) | pext(val, 0x0F0F0F0F0F0F0F0F);
   val = (pext(val, 0xAAAAAAAAAAAAAAAA) << 32) | pext(val, 0x5555555555555555);
   val = (pext(val, 0xAAAAAAAAAAAAAAAA) << 32) | pext(val, 0x5555555555555555);
   return val;
   }

Each pair of pexts in the above examples is actually a GRP:

(pext(v, mask) << hamming_weight(mask)) | pext(v, ~mask)

which has the effect of moving all the bits of v where mask is 1 to the leftmost part of the output, and all the other bits to the rightmost part.

To generate the sequence of constants, we actually use the same grouping operation. Your input is an array of words specifying where each bit should end up. For a 32 bit permutation, each target is 5 bits because it is in the range [0,25), so 160 bits sufficies to specify the permutation. We convert this to 5 32-bit values, with the lowest bits of each target index in p[0], the 2nd bit of each target index in p[1], and so on. For instance for DES's P, the values turn out to be:

p[0] = 0x07137FE0
p[1] = 0x6BD9232C
p[2] = 0xDD230F1C
p[3] = 0x63665639
p[4] = 0xA5A435AE

Then we can use 15 GRP operations to perform the permutation:

for i in range(0, 5):
    x = GRP(x, p[i])
    for j in range(i+1, 5):
       p[j] = GRP(p[j], p[i])

However you'll note that almost all of these operations do not depend on x, and we can precompute them:

for i in range(1, 5):
   for j in range(0, i):
      p[i] = GRP(p[i], p[j])

producing the values hardcoded into the function above.

This approach works for any arbitrary permutation, which is convenient, but many permutations can be done with fewer operations, which is pretty relevant since it is rare to need to perform arbitrary bit permutations compared to performing specific fixed ones.

There are likely many other useful cryptographic applications to the BMI2 instruction set. Areas that seem particularly ripe for this include GF(2^n) mathematics (including GCM's MAC calculation), the DES round function (likely possible in constant time), perhaps some LFSRs, and a few of the recent hardware-oriented primitives like PRESENT, as in hardware a bit permutation is often free. I will also be interested to see if in a few years, new designs start making use of cheap bit permutations, key-dependent permutations, or more elaborate bit operations like bit interleaving of two words, in the same way that some SHA-3 entries took advantage of the cheap AES round function AES-NI made available. There is always a tradeoff with this, though, as running such an algorithm on processors without BMI2 would mean using a loop such as the ones above, which would be much slower, and possibly even (as with my templates above) not always running in constant time, opening up a dangerous side channel.

On some processors and some compilers, if you replace the conditional in the templates above with:

res |= (static_cast<T>(val_bit & mask_bit)) << bit;
off += mask_bit;

it will run in constant time, but you should definitely verify that is the case by examining the assembly your compiler produces, as a compiler can sometimes surprise you with its choice of instructions, including a conditional jump where you do not expect it.