SIMD byte scans

published on 2019-03-26

Moore's law is dead, and has been for some time now. We can no longer rely on meaningful processor core upgrades every year to accelerate all our software. If we want to keep going faster, we'll have to play a different game. Instead of making a single thread go faster, the usual approach to increase throughput is to introduce multiple processes working in parallel, each working on their own slice of the problem. That works very well when your workload problem can be easily subdivided. Serving web documents, training a neural network or rasterizing 3d graphics are canonical examples. But what if the problem at hand is inherently serial, where every operation depends on the outcome the preceding operation, are we then forever condemned to the marginal single core upgrades?

Not necessarily. In this post, we'll examine a problem that is purely serial in nature and that can still benefit from a form of parallelism. Consider a long stream of bytes, perhaps a computer file or an internet post, and mark the first occurrence of a byte from a predefined set of marker bytes. Compilers do this to identify the start of blocks or functions, while markdown parsers like pulldown_cmark scan for special bytes indicating the start of a link, image or codeblock. Since we are concerned with the first occurrence of a marker byte, it only makes sense to inspect a byte once we know that all preceding bytes are not markers. This dependency makes the problem serial and therefore tricky to parallelize.

This is how we may scan for marker bytes in python:

def scan_markdown_markers(markdown_str):
  bytes = bytearray(markdown_str, 'utf-8')
  markers = [ord(c) for c in "*_~&[]<!|~`\n\r\\"]

  for (ix, c) in enumerate(bytes):
    if c in markers:
      return ix

  return -1

It interprets a given utf8 string as an array of bytes, and then inspects these bytes one-by-one until it finds one that is in the marker set. Indeed, this seems to work:

markdown = "❤️ Rome ![trevi](trip.jpg)"
ix = scan_markdown_markers(markdown)
print(ix, markdown[ix]) # 11 !

This is how many compilers, parsers and decoders have done it for years and decades. However, it seems kind of wasteful that we all have these beefy 64-bit processors, only to load and process a measely 8 bits at a time. We'd be much more effective if we'd be able to leverage the wide registers and operations of our modern processors. And with so called Single Instrucion Multiple Data (SIMD) operations, we can! So even though multithread parallelism does not make sense here, data-level parallelism can.

the algorithm

The idea behind SIMD scanning is that we can inspect many bytes simultaneously and then compute if there's a match and what its index is using a fixed number of instructions. From here on out, we'll assume that we're programming for a modern AMD64 system, since we'll be making extensive use of the PSHUFB instruction (short for packed shuffle bytes) which is exclusive to 64-bit x86 processors with the SSSE3 extension set. This operation works on 16 bytes at a time.

So what's the general strategy here? Rather than processing bytes at a time, we cut up the corpus into 16 byte chunks. Then, we process these chunks serially as follows. First, we determine for all bytes in the chunk whether they are a marker byte or not in parallel using SIMD instructions. This is equivalent to doing 16 table lookups. Then, if any of the bytes are a match, we return its index within the chunk plus sixteen times the chunk number.

vi](trip.jpg)e29da4efb88f526f6d6520215b7472650000000000000000000000ffff0000001111tre00000000000000![❤️ Rome

Let's start with the most interesting part: doing the byte lookups in parallel. Wojciech Muła has composed a great overview on byte lookups using SIMD. He describes a universal algorithm for testing a bytestream for an arbitrary predefined set of marker bytes, as well as a number of specialized algorithms which can be faster for some specific marker sets. Remark that this list of special case is by no means exhaustive. Finding the minimal set of instructions for a given marker set can be quite a fun puzzle.

For the markdown marker set of bytes given by *, _, ~, &, [, ], <, !, |, `, \n, \r and \, none of the special cases apply. Fortunately, we do not need the full generality of the universal algorithm since all our marker bytes are in the ASCII spectrum, which means that their most significant bit is never set. The table below provides an overview of our markers, row indexed by their four least significant, or lower nibble, and column indexed by their upper three bits (their upper nibble).

lo \ hi 0 1 2 3 4 5 6 7
0 `
1 !
2
3
4
5
6 &
7
8
9
a \n *
b [
c < \\|
d \r ]
e ~
f _

Let's run through our algorithm once with a simple example on a single byte. Suppose we want to check whether our marker set contains the byte 0b0010_1010, or 0x2a in hex notation. Its lower nibble is then a and its upper nibble is 2. To check if its in our marker set, we take the ath row and check bit 2. The ath row is given by 0b0000_0101, and bit 2 is the third bit from the right (since indexing starts at 0), which is set! Apparently 0x2a corresponds to *.

If we were to do this with scalar code in python, the procedure below would work.

# prepare the lookup table. this needs to be done only once
markers = [ord(c) for c in "*_~&[]<!|~`\n\r\\"]
bitmap = [0] * 16
for c in markers:
    lower_nibble = c & 0xf
    upper_nibble = c >> 4
    bitmap[lower_nibble] |= 1 << upper_nibble

def is_marker(b):
    lower_nibble = b & 0xf
    upper_nibble = b >> 4
    # check that most significant bit is not set, since all our markers
    # have are ASCII
    msb_set = b & (1 << 7) != 0
    return not msb_set and bitmap[lower_nibble] & (1 << upper_nibble) != 0

Now let's try to vectorize this routine and do it for 16 bytes in parallel. We can't simply use the bytes of one chunk to index into another chunk, since there are 28=2562^8 = 256 different byte values and only 16 bytes in a chunk. We can use its lower nibble to index into another chunk. This works because 24=162^4 = 16, and that's the size of our chunks!

We can perform the lower nibble lookup using PSHUFB, or the _mm_shuffle_epi8 instrinsic on 16 sequential bytes at once. Consider the following pseudocode.

input  = _mm_loadu_si128(chunk_ptr)
bitset = _mm_shuffle_epi8(bitmap, input)

Then, we'll need to get the high nibbles of our bytes. This is typically done by shifting bytes to the right by four, but this SIMD instruction does not exist for reasons that are not widely known. Alternatively, we can treat the chunk as eight 16-bit integers and rightshift those. That instruction does exist. However, this does mean that the low nibble of a preceding integer in a chunk can be shifted into the next integer.

byte 16 (high)byte 16 (low)15th low nibblebyte boundary

This can be problematic as PSFHUB will treat any index byte with its most significant bit set as a zero index. We can get rid of these high nibbles by ANDing those high bits out.

right_shifted = _mm_srli_epi16(input, 4)
hi_nibbles    = _mm_and_si128(right_shifted, mm_set1_epi8(0x0f))

Recall that we created a bitmask by shifting 1 to the left by the value of the high nibble. Again, we have to get creative as there is no SIMD instruction that lets us do such a thing; only fixed shifts are allowed. Instead, we can do another lookup to perform the shift.

bitmask_lookup = _mm_setr_epi8(
    1 << 0, 1 << 1, 1 << 2, 1 << 3, 1 << 4, 1 << 5, 1 << 6, 1 << 7,
    255, 255, 255, 255, 255, 255, 255, 255
)
bitmask        = _mm_shuffle_epi8(bitmask_lookup, hi_nibbles)

An astute reader may have noted that the most significant bit peculiarity of PSHUFB may have produced bad bitsets in the first shuffle since we didn't unset the most significant bits there. While that is true, it does not matter since we set the bitmask to 255 whenever the most significant bit is set. Since no row of our table is all ones, this guarantees that we'll never match a non-ASCII byte.

We're nearly done now; we have produced a bitset and a corresponding bit mask. We just need to AND them together and check that it's not zero. Except, there's a problem. You may have guessed it by now: we can't do that directly with SIMD operations. There is no instruction for inequality testing. As a substitute, we can use that a & (1 << b) != 0 if and only if a & (1 << b) == 1 << b. The two statements are logically equivalent. Moreover, we can express the latter using SIMD instructions!

tmp    = _mm_and_si128(bitset, bitmask)
result = _mm_cmpeq_epi8(tmp, bitmask)
mask   = _mm_movemask_epi8(result)

if mask == 0:
    return -1
else:
    return trailing_zeros(mask)

The _mm_movemask_epi8 intrinsic takes the most significant bit from every of the 16 bytes in our chunk and appends them to create a single 16 bit integer. Note that we typically think of strings and arrays as growing from left to right. Integers however, grow from right to left. So the left-most byte in the chunk corresponds to the right-most bit in the integer. Consequently, the index of the first matching byte is equal to the trailing number of zeroes in the integer. Most modern AMD64 CPUs can compute the number of trailing zeros with a single instruction.

results

Phew, that was a lot of work for a simple scan operation. Luckily, it looks like it may have been worth it. Recent experimentations with bringing this SIMD scanning technique to pulldown_cmark has more than doubled its raw scanning speed and delivers a 10% overall performance boost. A similar routine can also be used to escape HTML characters in the output. Since this involves a marker set which is much simpler and has some special properties (all its lower nibbles are distinct), this can be done using only two SIMD instructions per chunk. As a result, the raw performance increase is even greater there, but since less time is spent on HTML escapes, the overall performance boost seems to be again around 10%.

And that's not where the usefulness of SIMD instructions ends. While they were initially mostly used to add and multiply large vectors of floating point numbers, new applications are still found for them all the time. Only last year, Google engineers were able to employ SIMD instructions to create Swiss tables, a significantly optimized hashmap, one of the most best studied type of data structure out there. We are only bound by our imagination… and the instruction set.