Fast signature scanning with SIMD instructions
Introduction
Ever thought “my signature scanner takes too long!”? Well, you’re probably alone.
We’re going to write a vectorized algorithm for signature scanning; not really out of necessity, but more so for fun.
I’ve also seen some codebases use algorithms that have potential to be considered crimes against humanity, so I’d at least rather they copy-paste from here than copy-paste from UC or MPGH.
Signature Scanning
Signature scanning is a method of finding the address of a specific chunk of code in a binary. Let’s say you have a binary example.exe and you’ve found a function sub_1403a9ec8 that you want to call or hook. How would you refer to it?
The first thing you probably think of is to just hardcode the offset of the function. In a hypothetical scenario where sub_1403a9ec8 is at 0x3A92C8, you just hardcode the offset as example.exe+3A92C8. Simple, right?
Well, what happens then if example.exe receives an update? That offset will change because all the bytes in the executable have now moved around.
The most common approach to getting around this is “signature scanning”, where you find a unique signature for the chunk of code you want an offset for.
Let’s try to visualize it:
updateWorldTitle proc near
push rbx ; 40 53
push rsi ; 56
push rdi ; 57
sub rsp, 30h ; 48 83 EC 30
lea rcx, [r8+0B0h] ; 49 8D 88 B0 00 00 00
mov rdi, r8 ; 49 8B F8
mov rbx, rdx ; 48 8B DA
call cs:QString_clear ; FF 15 A5 0B 38 00
lea rcx, [rbx+30h] ; 48 8D 4B 30
call GetNetState ; E8 FC FD 1A 00
lea rcx, [rsp+20h] ; 48 8D 4C 24 20
movzx ebx, al ; 0F B6 D8
call cs:QString_ctor ; FF 15 26 13 38 00
mov ecx, ebx ; 8B CB
test bl, bl ; 84 DB
jz short loc_fail ; 74 23
sub ecx, 1 ; 83 E9 01
jz short loc_case_1 ; 74 0A
sub ecx, 1 ; 83 E9 01
jz short loc_case_2 ; 74 0C
cmp ecx, 1 ; 83 F9 01
jnz short loc_default ; 75 3E
...
updateWorldTitle endp
If we were to look for a unique signature for the prologue of this function, we’d find:
40 53 56 57 48 83 EC ? 49 8D 88
This is a function signature: a sequence of bytes that can uniquely identify the prologue of the function.
You might be asking yourself: what is the question mark there for? Let’s first determine the purpose of the instruction whose byte we’ve replaced with a question mark.
48 83 EC 30 sub rsp, 30h
This instruction sets up the stack frame; hence, the wildcarded operand byte is the size of the function’s stack frame. Here, our signature makes the assumption that perhaps there will be a change to this operand byte, and therefore it should be a wildcard.
Now, whether this is a good guess or not is up to the person creating the signature. There are times when this is basically a necessity, like if our signature included instructions that loaded the address of a string constant into a register.
Such operations will usually do something called RIP-relative addressing, meaning they use offsets relative from the program counter. For example, any of the operand bytes of the call instructions in the function disassembly we’ve shown above will most likely change across builds.
Oh, and actually, that signature was generated by an IDA plugin. Other plugins will give different results due to having different heuristics. Here’s an arguably better signature, depending on who you ask:
40 53 56 57 48 83 EC 30 49
There are many ways to represent these signatures, but grammar like the one described above is very commonly found. Due to its relative popularity, this is the grammar we will be supporting in our signature scanner today. Here’s an EBNF representation in case you’re enough of a nerd to care. I didn’t really put much effort into it, sorry. I don’t even know if it’s correct.
Signature ::= Byte { " " Byte }
Byte ::= HexPair | "?" | "??"
HexPair ::= HexDigit HexDigit
HexDigit ::= [0-9] | [a-f] | [A-F]
The gist of it is:
- Bytes must be pairs of hexadecimal digits or wildcards (no single-nibble wildcards! nobody does that because.. why)
- Bytes should be separated by spaces, but our parser won’t be strict about it
- Wildcards can be one or two question marks
Since we’ve now defined our problem and our constraints, it’s time to get to the part where we actually write anything.
Algorithms
We’ll start off by writing the most naive possible algorithm. It’s going to be the absolute first thing you think of writing, with no regard to performance. And if you think of an optimization mid-way through writing it (spoiler: you will), do it later.
Also, safety and sanity checks be damned.
Naive
static u64 naive_pattern_match(const u8* buffer, const u64 buffer_size, const String8 pattern) {
for (u64 i = 0; i < buffer_size; ++i) {
u64 match_offset = 0;
u64 p_cursor = 0;
while (p_cursor < pattern.size) {
char c = pattern.s[p_cursor];
if (c == ' ') {
p_cursor++;
continue;
}
if (c == '?') {
p_cursor++;
// NOTE(geni): Consume second ? if necessary
if (p_cursor < pattern.size && pattern.s[p_cursor] == '?') {
p_cursor++;
}
match_offset++;
continue;
}
u8 byte_in_buffer = buffer[i + match_offset];
u8 byte_in_pattern = hexpair_to_u8(pattern.s + p_cursor);
if (byte_in_buffer != byte_in_pattern) {
goto again;
}
match_offset++;
p_cursor += 2;
}
return i;
again:;
}
return 0xFFFFFFFFFFFFFFFF;
}
There’s nothing complicated about this algorithm, though it’s kind of annoying due to having to keep track of two cursors.
This algorithm is very inefficient. Its most glaring inefficiency is having to parse the string inside of the hot loop.
My first idea, probably yours and apparently everyone else’s too, is to take the string parsing out of the pattern matching algorithm and do it as part of the setup, using a bitmask for the wildcard bytes. Pretty much everyone comes up with the same solution for this:
- Parse the non-wildcard bytes into a byte array
pattern, where wildcards will be left as0 - Parse the wildcard bytes into a bitmask
mask, where wildcards will be0and non-wildcards will beFF - For each byte
iindata:- For each byte
jinpattern:- If
data[i + j] AND mask[j]doesn’t equalpattern[j]:- Return to step 3.
- If
- Return offset
i
- For each byte
Or since that was most likely an awful explanation, here’s a diagram that’ll maybe help:
This diagram may make the vectorization of this quite obvious. In fact, in case you haven’t made the connection yet, our problem starts to look a lot like memmem or strstr in a way.
If you think it’d be fun, I suggest not scrolling any farther and trying your hand at writing an algorithm yourself. And if your design sucks, you can look at SIMD implementations of the two aforementioned functions to get some ideas.
Masked
Parsing
static u64 parse_pattern(const String8 pattern, u8* compiled_out, u64 compiled_out_size, u8* mask_out, u64 mask_out_size) {
u8* cur = pattern.s;
u8* end = pattern.s + pattern.size;
u64 i = 0;
while (cur != end) {
if (i >= mask_out_size || i >= compiled_out_size) {
return 0;
}
// NOTE(geni): Skip spaces
if (*cur == ' ') {
++cur;
continue;
}
// NOTE(geni): Handle wildcards
if (*cur == '?') {
mask_out[i] = 0;
compiled_out[i] = 0;
++i;
++cur;
if (cur < end && *cur == '?') {
++cur;
}
continue;
}
// NOTE(geni): Malformed pattern
if (cur + 1 >= end) {
return 0;
}
// NOTE(geni): Handle bytes
u8 byte = hexpair_to_u8(cur);
mask_out[i] = 0xFF;
compiled_out[i] = byte;
cur += 2;
++i;
}
return i;
}
Here we have a basic parser that roughly adheres to the grammar described above. Spaces are optional but whatever, it’s close enough. It’s a very self-explanatory algorithm so I’ll spare you the details.
There is no point in vectorizing this. Patterns are never large enough for vectorization to make a meaningful difference… but it could be fun to do, so maybe I’ll do it as an appendix to this post one day.
Algorithm
Here’s our first implementation of the algorithm we described above:
static u64 mask_pattern_match(const u8* data, const u64 data_size, u8* mask, u8* pattern, u64 pattern_size) {
for (u64 i = 0; i <= data_size - pattern_size; ++i) {
for (u32 j = 0; j < pattern_size; ++j) {
if ((data[i + j] & mask[j]) != pattern[j]) {
goto again;
}
}
return i;
again:;
}
return 0xFFFFFFFFFFFFFFFF;
}
This, of course, can be optimized. If we go back to the naive algorithm, its second glaring inefficiency is that it matches against every byte in the original data until it finds a match! There is no heuristic to perform less matching attempts.
The simplest optimizaton you can make is to branch out of the loop if the first character doesn’t match.
This won’t actually make much of a difference, as the loop below will basically do just that, just with more instructions:
for (u32 j = 0; j < pattern_size; ++j) {
if ((data[i + j] & mask[j]) != pattern[j]) {
goto again;
}
}
But there’s another easy optimization we can make. If the first character matches, we can also attempt matching the last character of the pattern: shift both the pattern and the data forward by pattern_size bytes and compare. If it doesn’t match, don’t bother checking the rest of the bytes.
Both optimizations can be seen here:
static u64 mask_pattern_match(const u8* data, const u64 data_size, u8* mask, u8* pattern, u64 pattern_size) {
for (u64 i = 0; i <= data_size - pattern_size; ++i) {
// NOTE(geni): First byte check
if (data[i] != pattern[0] && mask[0] == 0xFF) {
continue;
}
// NOTE(geni): Last byte check
if (data[i + pattern_size - 1] != pattern[pattern_size - 1] && mask[pattern_size - 1] == 0xFF) {
continue;
}
for (u32 j = 0; j < pattern_size; ++j) {
if ((data[i + j] & mask[j]) != pattern[j]) {
goto again;
}
}
return i;
again:;
}
return 0xFFFFFFFFFFFFFFFF;
}
This is considerably faster than the naive version, but we can do better.
It’s an embarrassingly parallel algorithm, but rather than parallelizing it with threads, we can use SIMD instructions. This also has the obvious benefit of speeding up multi-threading if we were to implement it later.
SSE2
We’ll begin by making an SSE2 version, which will be basically just a direct 1:1 vectorized version of mask_pattern_match.
But to even get started, you might be asking: how exactly will we replicate the matching first and last byte heuristic? It’s actually quite simple.
Our algorithm will iterate through data in 16-byte chunks.
- We broadcast the masked first and last bytes in the pattern to an XMM register
xmm0andxmm1. - As we don’t know beforehand whether a pattern’s first or last byte are wildcards, we must also broadcast the first and last byte of the mask to XMM register
xmm2andxmm3. - Begin iterating through the data in 16-byte chunks, with the current index being
i.- For our first byte check, load
data[i]toxmm4. - Bitwise AND
xmm4andxmm2to mask out wildcards and store inxmm4. - Compare each byte in
xmm4with each byte inxmm0and store inxmm6. This register now contains the bytes in the data matching our first byte. - For our last byte check, load
data[i + pattern_size - 1]toxmm5. - Bitwise AND
xmm5andxmm3to mask out wildcards and store inxmm5. - Compare each byte in
xmm5with each byte inxmm1and store inxmm7. This register now contains the bytes in the data matching our last byte. - Bitwise AND
xmm6andxmm7and store inxmm8. This register now contains local offsets of matches of both the first and last byte.
- For our first byte check, load
If you didn’t understand, I don’t blame you. I suck at explaining these things, so here’s a diagram I made:
If you still don’t get it, just see the implementation below.
The rest is fairly straightforward. We iterate through all the matches and do our comparison virtually the exact same way we did in our non-vectorized version, just on many bytes at once this time.
Below is our implementation:
static u64 sse2_pattern_match(const u8* data, const u64 data_size, u8* mask, u8* pattern, u64 pattern_size) {
const __m128i first_pattern_byte = _mm_set1_epi8(pattern[0] & mask[0]);
const __m128i last_pattern_byte = _mm_set1_epi8(pattern[pattern_size - 1] & mask[pattern_size - 1]);
const __m128i first_mask_byte = _mm_set1_epi8(mask[0]);
const __m128i last_mask_byte = _mm_set1_epi8(mask[pattern_size - 1]);
u64 i = 0;
for (; i < data_size - pattern_size - 16; i += 16) {
__m128i chunk_first = _mm_loadu_si128((__m128i*) (data + i));
__m128i chunk_last = _mm_loadu_si128((__m128i*) (data + i + pattern_size - 1));
chunk_first = _mm_and_si128(chunk_first, first_mask_byte);
chunk_last = _mm_and_si128(chunk_last, last_mask_byte);
__m128i cmp = _mm_cmpeq_epi8(chunk_first, first_pattern_byte);
// NOTE(geni): Only match if BOTH first and last bytes match
cmp = _mm_and_si128(cmp, _mm_cmpeq_epi8(chunk_last, last_pattern_byte));
u32 matches = _mm_movemask_epi8(cmp);
// NOTE(geni): Try all matches
while (matches != 0) {
int local_offset = __builtin_ctz(matches);
u64 offset = i + local_offset;
for (u64 j = 0; j < pattern_size; j += 16) {
__m128i chunk = _mm_loadu_si128((__m128i*) (data + offset + j));
__m128i pattern_data = _mm_loadu_si128((__m128i*) (pattern + j));
__m128i mask_data = _mm_loadu_si128((__m128i*) (mask + j));
__m128i match = _mm_and_si128(chunk, mask_data);
match = _mm_cmpeq_epi8(match, pattern_data);
if (_mm_movemask_epi8(match) != 0xFFFF) {
goto again;
}
}
return offset;
again:
// NOTE(geni): Clear match
matches &= ~(1 << local_offset);
}
}
// NOTE(geni): Fall back to scalar method for last bytes
if (i <= data_size - pattern_size) {
u64 result = mask_pattern_match(data + i, data_size - i, mask, pattern, pattern_size);
if (result != 0xFFFFFFFFFFFFFFFF) {
return i + result;
}
}
return 0xFFFFFFFFFFFFFFFF;
}
You might have noticed that this would read out-of-bounds if pattern_size isn’t a multiple of 16. To overcome this, we make the assumption that the size of our pattern and mask buffers are always a multiple of 16, and that the buffers themselves are null-initialized. If these conditions are met, reading past pattern_size is safe because 0s in the mask buffer represent a wildcard byte, hence the outcome won’t change.
There can be checks in the beginning of the function itself for the buffer sizes in this case, but here I didn’t bother. You could also write non-vectorized code as a fallback, or perhaps write a small setup routine at the start of the function to set up oversized buffers. All up to you.
AVX2
The AVX2 version has some interesting optimizations that we can make, but for the most part doesn’t differ much from the SSE2 version.
We’ll begin by making a straightforward “port” of the SSE2 algorithm:
static u64 avx2_dumb_pattern_match(const u8* data, const u64 data_size, u8* mask, u8* pattern, u64 pattern_size) {
const __m256i first_pattern_byte = _mm256_set1_epi8(pattern[0] & mask[0]);
const __m256i last_pattern_byte = _mm256_set1_epi8(pattern[pattern_size - 1] & mask[pattern_size - 1]);
const __m256i first_mask_byte = _mm256_set1_epi8(mask[0]);
const __m256i last_mask_byte = _mm256_set1_epi8(mask[pattern_size - 1]);
u64 i = 0;
for (; i < data_size - pattern_size - 32; i += 32) {
__m256i chunk_first = _mm256_loadu_si256((__m256i*) (data + i));
__m256i chunk_last = _mm256_loadu_si256((__m256i*) (data + i + pattern_size - 1));
chunk_first = _mm256_and_si256(chunk_first, first_mask_byte);
chunk_last = _mm256_and_si256(chunk_last, last_mask_byte);
__m256i cmp = _mm256_cmpeq_epi8(chunk_first, first_pattern_byte);
// NOTE(geni): Only match if BOTH first and last bytes match
cmp = _mm256_and_si256(cmp, _mm256_cmpeq_epi8(chunk_last, last_pattern_byte));
u32 matches = _mm256_movemask_epi8(cmp);
// NOTE(geni): Try all matches
while (matches != 0) {
u32 local_offset = __builtin_ctz(matches);
u64 offset = i + local_offset;
for (u64 j = 0; j < pattern_size; j += 32) {
__m256i chunk = _mm256_loadu_si256((__m256i*) (data + offset + j));
__m256i pattern_data = _mm256_loadu_si256((__m256i*) (pattern + j));
__m256i mask_data = _mm256_loadu_si256((__m256i*) (mask + j));
__m256i match = _mm256_and_si256(chunk, mask_data);
match = _mm256_cmpeq_epi8(match, pattern_data);
if ((u32) _mm256_movemask_epi8(match) != 0xFFFFFFFF) {
goto again;
}
}
return offset;
again:
// NOTE(geni): Clear match
matches &= ~(1 << local_offset);
}
}
// NOTE(geni): Fall back to scalar method for last bytes
if (i <= data_size - pattern_size) {
u64 result = mask_pattern_match(data + i, data_size - i, mask, pattern, pattern_size);
if (result != 0xFFFFFFFFFFFFFFFF) {
return i + result;
}
}
return 0xFFFFFFFFFFFFFFFF;
}
While this already gives us a major speedup just from being able to process twice the amount of data in one go, we can make some minor adjustments.
BLSR instruction
As we’re targetting AVX2 processors, we can also safely use instructions from the Bit Manipulation Instruction 1 extension.
BMI1 gives us the BLSR r32, r/m32 instruction, which lets us unset the last set bit. When combined with TZCNT (which will be the output of __builtin_ctz if our target supports BMI1), we can pretty quickly iterate through all set bits in a given operand.
We’re going to manually use the intrinsic for BLSR, but the compiler isn’t dumb; if we had just used matches &= matches - 1; instead of matches &= ~(1 << local_offset) the compiler would output the BLSR instruction for us. You can see an example of GCC doing this using Compiler Explorer.
u32 matches = _mm256_movemask_epi8(cmp);
// NOTE(geni): Try all matches
while (matches != 0) {
...
again:
// NOTE(geni): Clear match
matches = _blsr_u32(matches);
}
VPTEST instruction
There’s something more interesting that we can do. AVX introduces the VPTEST instruction, which is described in the Intel Intrinsic Guide as follows:
Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return the ZF value.
Hmm, that sounds quite convenient for our pattern matching logic, if we were to somehow move the bitwise AND to the end… which we totally can! Let’s look at it again:
__m256i match = _mm256_and_si256(chunk, mask_data);
match = _mm256_cmpeq_epi8(match, pattern_data);
if ((u32) _mm256_movemask_epi8(match) != 0xFFFFFFFF) {
goto again;
}
What if we instead XOR-ed chunk with pattern_data then masked it out? We would get something like this:
__m256i zero = _mm256_setzero_si256();
__m256i match = _mm256_xor_si256(chunk, pattern_data);
match = _mm256_and_si256(match, mask_data);
match = _mm256_cmpeq_epi8(match, zero);
if ((u32) _mm256_movemask_epi8(match) != 0xFFFFFFFF) {
goto again;
}
Here’s an updated diagram of the bitwise operations:
This approach seems more inefficient, until we make use of the VPTEST instruction:
__m256i match = _mm256_xor_si256(chunk, pattern_data);
if (!_mm256_testz_si256(match, mask_data)) {
goto again;
}
As VPTEST directly sets ZF, we don’t need to use VPMOVMSKB like before. Whether this approach is actually any faster depends on your architecture, though it should generally be a tad faster.
Analyzing with llvm-mca tells us the VPTEST version is faster on some architectures or basically on par with the original on others. While this type of thing isn’t so black-and-white outside of a vacuum, I think llvm-mca is right here. Here’s the analysis for Skylake:
[0] Code Region - Mask_Compare
Iterations: 100
Instructions: 500
Total Cycles: 132
Total uOps: 500
Dispatch Width: 6
uOps Per Cycle: 3.79
IPC: 3.79
Block RThroughput: 1.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.33 vpand ymm4, ymm0, ymm2
1 1 0.50 vpcmpeqb ymm4, ymm4, ymm1
1 2 1.00 vpmovmskb eax, ymm4
1 1 0.25 cmp eax, -1
1 1 0.50 jne .L_fail1
[0] - SKLDivider
[1] - SKLFPDivider
[2] - SKLPort0
[3] - SKLPort1
[4] - SKLPort2
[5] - SKLPort3
[6] - SKLPort4
[7] - SKLPort5
[8] - SKLPort6
[9] - SKLPort7
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
- - 1.26 1.25 - - - 1.26 1.23 -
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
- - - 0.49 - - - 0.51 - - vpand ymm4, ymm0, ymm2
- - 0.25 0.75 - - - - - - vpcmpeqb ymm4, ymm4, ymm1
- - 1.00 - - - - - - - vpmovmskb eax, ymm4
- - - 0.01 - - - 0.75 0.24 - cmp eax, -1
- - 0.01 - - - - - 0.99 - jne .L_fail1
[1] Code Region - XOR_Mask_Compare
Iterations: 100
Instructions: 600
Total Cycles: 159
Total uOps: 600
Dispatch Width: 6
uOps Per Cycle: 3.77
IPC: 3.77
Block RThroughput: 1.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.33 vpxor ymm4, ymm0, ymm1
1 1 0.33 vpand ymm4, ymm4, ymm2
1 1 0.50 vpcmpeqb ymm4, ymm4, ymm3
1 2 1.00 vpmovmskb eax, ymm4
1 1 0.25 cmp eax, -1
1 1 0.50 jne .L_fail2
[0] - SKLDivider
[1] - SKLFPDivider
[2] - SKLPort0
[3] - SKLPort1
[4] - SKLPort2
[5] - SKLPort3
[6] - SKLPort4
[7] - SKLPort5
[8] - SKLPort6
[9] - SKLPort7
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
- - 1.51 1.50 - - - 1.51 1.48 -
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
- - 0.33 0.18 - - - 0.49 - - vpxor ymm4, ymm0, ymm1
- - 0.01 0.48 - - - 0.51 - - vpand ymm4, ymm4, ymm2
- - 0.16 0.84 - - - - - - vpcmpeqb ymm4, ymm4, ymm3
- - 1.00 - - - - - - - vpmovmskb eax, ymm4
- - 0.01 - - - - 0.51 0.48 - cmp eax, -1
- - - - - - - - 1.00 - jne .L_fail2
[2] Code Region - XOR_VPTEST
Iterations: 100
Instructions: 300
Total Cycles: 107
Total uOps: 400
Dispatch Width: 6
uOps Per Cycle: 3.74
IPC: 2.80
Block RThroughput: 1.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.33 vpxor ymm4, ymm0, ymm1
2 3 1.00 vptest ymm4, ymm2
1 1 0.50 jne .L_fail3
[0] - SKLDivider
[1] - SKLFPDivider
[2] - SKLPort0
[3] - SKLPort1
[4] - SKLPort2
[5] - SKLPort3
[6] - SKLPort4
[7] - SKLPort5
[8] - SKLPort6
[9] - SKLPort7
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
- - 1.01 0.99 - - - 1.01 0.99 -
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
- - - 0.99 - - - 0.01 - - vpxor ymm4, ymm0, ymm1
- - 1.00 - - - - 1.00 - - vptest ymm4, ymm2
- - 0.01 - - - - - 0.99 - jne .L_fail3
Here is our final function:
static u64 avx2_pattern_match(const u8* data, const u64 data_size, u8* mask, u8* pattern, u64 pattern_size) {
const __m256i first_pattern_byte = _mm256_set1_epi8(pattern[0] & mask[0]);
const __m256i last_pattern_byte = _mm256_set1_epi8(pattern[pattern_size - 1] & mask[pattern_size - 1]);
const __m256i first_mask_byte = _mm256_set1_epi8(mask[0]);
const __m256i last_mask_byte = _mm256_set1_epi8(mask[pattern_size - 1]);
u64 i = 0;
for (; i < data_size - pattern_size - 32; i += 32) {
__m256i chunk_first = _mm256_loadu_si256((__m256i*) (data + i));
__m256i chunk_last = _mm256_loadu_si256((__m256i*) (data + i + pattern_size - 1));
chunk_first = _mm256_and_si256(chunk_first, first_mask_byte);
chunk_last = _mm256_and_si256(chunk_last, last_mask_byte);
__m256i cmp = _mm256_cmpeq_epi8(chunk_first, first_pattern_byte);
// NOTE(geni): Only match if BOTH first and last bytes match.
cmp = _mm256_and_si256(cmp, _mm256_cmpeq_epi8(chunk_last, last_pattern_byte));
u32 matches = _mm256_movemask_epi8(cmp);
// NOTE(geni): Try all matches
while (matches != 0) {
// NOTE(geni): Find first matching byte
u32 local_offset = __builtin_ctz(matches);
u64 offset = i + local_offset;
for (u32 j = 0; j < pattern_size; j += 32) {
__m256i chunk = _mm256_loadu_si256((__m256i*) (data + offset + j));
__m256i pattern_data = _mm256_loadu_si256((__m256i*) (pattern + j));
__m256i mask_data = _mm256_loadu_si256((__m256i*) (mask + j));
__m256i match = _mm256_xor_si256(chunk, pattern_data);
if (!_mm256_testz_si256(match, mask_data)) {
goto again;
}
}
return offset;
again:
// NOTE(geni): Clear match
matches = _blsr_u32(matches);
}
}
// NOTE(geni): Fall back to scalar method for last bytes
if (i <= data_size - pattern_size) {
u64 result = mask_pattern_match(data + i, data_size - i, mask, pattern, pattern_size);
if (result != 0xFFFFFFFFFFFFFFFF) {
return i + result;
}
}
return 0xFFFFFFFFFFFFFFFF;
}
Performance
While further optimizations are certainly possible, we won’t be going any farther today. I’m very far from a SIMD expert, but maybe I’ll come revisit this problem one day.
Enough talking; let’s see the numbers. We’ll be testing on an Alder Lake i5-12600k.
The results were averaged over 5 runs of 10000 iterations on a data sample of 5509808 bytes, using a 92 byte signature containing 4 wildcards.
| Method | vs Naive | vs Masked | vs SSE2 | vs AVX2 dumb | vs AVX2 |
|---|---|---|---|---|---|
| Naive | 1.00x | 0.55x | 0.05x | 0.02x | 0.02x |
| Masked | 1.82x | 1.00x | 0.08x | 0.04x | 0.04x |
| SSE2 | 21.71x | 11.96x | 1.00x | 0.53x | 0.52x |
| AVX2 dumb | 41.08x | 22.62x | 1.89x | 1.00x | 0.99x |
| AVX2 | 41.63x | 22.92x | 1.92x | 1.01x | 1.00x |
Pretty acceptable result if you ask me. I haven’t had time to do more benchmarks, I’ll get around to it eventually (surely…).
Single-header library
I’ve put together a single-header library using the algorithms described in this post, available here: https://github.com/geniiii/simdscanner