geni.site

Fast signature scanning with SIMD instructions

Introduction

Ever thought “my signature scanner takes too long!”? Well, you’re probably alone.

We’re going to write a vectorized algorithm for signature scanning; not really out of necessity, but more so for fun.

I’ve also seen some codebases use algorithms that have potential to be considered crimes against humanity, so I’d at least rather they copy-paste from here than copy-paste from UC or MPGH.

Signature Scanning

Signature scanning is a method of finding the address of a specific chunk of code in a binary. Let’s say you have a binary example.exe and you’ve found a function sub_1403a9ec8 that you want to call or hook. How would you refer to it?

The first thing you probably think of is to just hardcode the offset of the function. In a hypothetical scenario where sub_1403a9ec8 is at 0x3A92C8, you just hardcode the offset as example.exe+3A92C8. Simple, right?

Well, what happens then if example.exe receives an update? That offset will change because all the bytes in the executable have now moved around.

The most common approach to getting around this is “signature scanning”, where you find a unique signature for the chunk of code you want an offset for.

Let’s try to visualize it:

updateWorldTitle proc near
    push    rbx                    ; 40 53
    push    rsi                    ; 56
    push    rdi                    ; 57
    sub     rsp, 30h               ; 48 83 EC 30
    lea     rcx, [r8+0B0h]         ; 49 8D 88 B0 00 00 00
    mov     rdi, r8                ; 49 8B F8
    mov     rbx, rdx               ; 48 8B DA
    call    cs:QString_clear       ; FF 15 A5 0B 38 00 
    lea     rcx, [rbx+30h]         ; 48 8D 4B 30 
    call    GetNetState            ; E8 FC FD 1A 00
    lea     rcx, [rsp+20h]         ; 48 8D 4C 24 20
    movzx   ebx, al                ; 0F B6 D8
    call    cs:QString_ctor        ; FF 15 26 13 38 00 
    mov     ecx, ebx               ; 8B CB
    test    bl, bl                 ; 84 DB
    jz      short loc_fail         ; 74 23
    sub     ecx, 1                 ; 83 E9 01
    jz      short loc_case_1       ; 74 0A
    sub     ecx, 1                 ; 83 E9 01
    jz      short loc_case_2       ; 74 0C
    cmp     ecx, 1                 ; 83 F9 01
    jnz     short loc_default      ; 75 3E
    ...
updateWorldTitle endp

If we were to look for a unique signature for the prologue of this function, we’d find:

40 53 56 57 48 83 EC ? 49 8D 88

This is a function signature: a sequence of bytes that can uniquely identify the prologue of the function.
You might be asking yourself: what is the question mark there for? Let’s first determine the purpose of the instruction whose byte we’ve replaced with a question mark.

48 83 EC 30    sub rsp, 30h

This instruction sets up the stack frame; hence, the wildcarded operand byte is the size of the function’s stack frame. Here, our signature makes the assumption that perhaps there will be a change to this operand byte, and therefore it should be a wildcard.

Now, whether this is a good guess or not is up to the person creating the signature. There are times when this is basically a necessity, like if our signature included instructions that loaded the address of a string constant into a register.
Such operations will usually do something called RIP-relative addressing, meaning they use offsets relative from the program counter. For example, any of the operand bytes of the call instructions in the function disassembly we’ve shown above will most likely change across builds.

Oh, and actually, that signature was generated by an IDA plugin. Other plugins will give different results due to having different heuristics. Here’s an arguably better signature, depending on who you ask:

40 53 56 57 48 83 EC 30 49

There are many ways to represent these signatures, but grammar like the one described above is very commonly found. Due to its relative popularity, this is the grammar we will be supporting in our signature scanner today. Here’s an EBNF representation in case you’re enough of a nerd to care. I didn’t really put much effort into it, sorry. I don’t even know if it’s correct.

Signature ::= Byte { " " Byte }
Byte      ::= HexPair | "?" | "??"
HexPair   ::= HexDigit HexDigit
HexDigit  ::= [0-9] | [a-f] | [A-F]

The gist of it is:

  1. Bytes must be pairs of hexadecimal digits or wildcards (no single-nibble wildcards! nobody does that because.. why)
  2. Bytes should be separated by spaces, but our parser won’t be strict about it
  3. Wildcards can be one or two question marks

Since we’ve now defined our problem and our constraints, it’s time to get to the part where we actually write anything.

Algorithms

We’ll start off by writing the most naive possible algorithm. It’s going to be the absolute first thing you think of writing, with no regard to performance. And if you think of an optimization mid-way through writing it (spoiler: you will), do it later.

The implementations below must return the highest possible 64-bit unsigned integer if a match was not found, otherwise they must return the matching offset.
Also, safety and sanity checks be damned.

Naive

static u64 naive_pattern_match(const u8* buffer, const u64 buffer_size, const String8 pattern) {
    for (u64 i = 0; i < buffer_size; ++i) {
        u64 match_offset = 0;
        u64 p_cursor     = 0;

        while (p_cursor < pattern.size) {
            char c = pattern.s[p_cursor];

            if (c == ' ') {
                p_cursor++;
                continue;
            }

            if (c == '?') {
                p_cursor++;
                // NOTE(geni): Consume second ? if necessary
                if (p_cursor < pattern.size && pattern.s[p_cursor] == '?') {
                    p_cursor++;
                }
                match_offset++;
                continue;
            }

            u8 byte_in_buffer  = buffer[i + match_offset];
            u8 byte_in_pattern = hexpair_to_u8(pattern.s + p_cursor);
            if (byte_in_buffer != byte_in_pattern) {
                goto again;
            }

            match_offset++;
            p_cursor += 2;
        }

        return i;
        again:;
    }

    return 0xFFFFFFFFFFFFFFFF;
}

There’s nothing complicated about this algorithm, though it’s kind of annoying due to having to keep track of two cursors.

This algorithm is very inefficient. Its most glaring inefficiency is having to parse the string inside of the hot loop.

My first idea, probably yours and apparently everyone else’s too, is to take the string parsing out of the pattern matching algorithm and do it as part of the setup, using a bitmask for the wildcard bytes. Pretty much everyone comes up with the same solution for this:

  1. Parse the non-wildcard bytes into a byte array pattern, where wildcards will be left as 0
  2. Parse the wildcard bytes into a bitmask mask, where wildcards will be 0 and non-wildcards will be FF
  3. For each byte i in data:
    • For each byte j in pattern:
      • If data[i + j] AND mask[j] doesn’t equal pattern[j]:
        • Return to step 3.
    • Return offset i

Or since that was most likely an awful explanation, here’s a diagram that’ll maybe help:

48
00
00
A8
FF
00
00
FF
48
A9
30
A8
E8
6C
E8
6C
FF
FF
&
48
00
00
A8
E8
6C
mask
data
pattern
==
1

This diagram may make the vectorization of this quite obvious. In fact, in case you haven’t made the connection yet, our problem starts to look a lot like memmem or strstr in a way.

If you think it’d be fun, I suggest not scrolling any farther and trying your hand at writing an algorithm yourself. And if your design sucks, you can look at SIMD implementations of the two aforementioned functions to get some ideas.

Masked

Parsing

static u64 parse_pattern(const String8 pattern, u8* compiled_out, u64 compiled_out_size, u8* mask_out, u64 mask_out_size) {
    u8* cur = pattern.s;
    u8* end = pattern.s + pattern.size;

    u64 i = 0;
    while (cur != end) {
        if (i >= mask_out_size || i >= compiled_out_size) {
            return 0;
        }

        // NOTE(geni): Skip spaces
        if (*cur == ' ') {
            ++cur;
            continue;
        }

        // NOTE(geni): Handle wildcards
        if (*cur == '?') {
            mask_out[i]     = 0;
            compiled_out[i] = 0;
            ++i;
            ++cur;

            if (cur < end && *cur == '?') {
                ++cur;
            }
            continue;
        }

        // NOTE(geni): Malformed pattern
        if (cur + 1 >= end) {
            return 0;
        }

        // NOTE(geni): Handle bytes
        u8 byte         = hexpair_to_u8(cur);
        mask_out[i]     = 0xFF;
        compiled_out[i] = byte;

        cur += 2;
        ++i;
    }
    return i;
}

Here we have a basic parser that roughly adheres to the grammar described above. Spaces are optional but whatever, it’s close enough. It’s a very self-explanatory algorithm so I’ll spare you the details.

There is no point in vectorizing this. Patterns are never large enough for vectorization to make a meaningful difference… but it could be fun to do, so maybe I’ll do it as an appendix to this post one day.

Algorithm

Here’s our first implementation of the algorithm we described above:

static u64 mask_pattern_match(const u8* data, const u64 data_size, u8* mask, u8* pattern, u64 pattern_size) {
    for (u64 i = 0; i <= data_size - pattern_size; ++i) {
        for (u32 j = 0; j < pattern_size; ++j) {
            if ((data[i + j] & mask[j]) != pattern[j]) {
                goto again;
            }
        }
        return i;
    again:;
    }

    return 0xFFFFFFFFFFFFFFFF;
}

This, of course, can be optimized. If we go back to the naive algorithm, its second glaring inefficiency is that it matches against every byte in the original data until it finds a match! There is no heuristic to perform less matching attempts.

The simplest optimizaton you can make is to branch out of the loop if the first character doesn’t match.

This won’t actually make much of a difference, as the loop below will basically do just that, just with more instructions:

for (u32 j = 0; j < pattern_size; ++j) {
    if ((data[i + j] & mask[j]) != pattern[j]) {
        goto again;
    }
}

But there’s another easy optimization we can make. If the first character matches, we can also attempt matching the last character of the pattern: shift both the pattern and the data forward by pattern_size bytes and compare. If it doesn’t match, don’t bother checking the rest of the bytes.

Both optimizations can be seen here:

static u64 mask_pattern_match(const u8* data, const u64 data_size, u8* mask, u8* pattern, u64 pattern_size) {
    for (u64 i = 0; i <= data_size - pattern_size; ++i) {
        // NOTE(geni): First byte check
        if (data[i] != pattern[0] && mask[0] == 0xFF) {
            continue;
        }
        // NOTE(geni): Last byte check
        if (data[i + pattern_size - 1] != pattern[pattern_size - 1] && mask[pattern_size - 1] == 0xFF) {
            continue;
        }
        for (u32 j = 0; j < pattern_size; ++j) {
            if ((data[i + j] & mask[j]) != pattern[j]) {
                goto again;
            }
        }
        return i;
    again:;
    }

    return 0xFFFFFFFFFFFFFFFF;
}

This is considerably faster than the naive version, but we can do better.

It’s an embarrassingly parallel algorithm, but rather than parallelizing it with threads, we can use SIMD instructions. This also has the obvious benefit of speeding up multi-threading if we were to implement it later.

SSE2

We’ll begin by making an SSE2 version, which will be basically just a direct 1:1 vectorized version of mask_pattern_match.
But to even get started, you might be asking: how exactly will we replicate the matching first and last byte heuristic? It’s actually quite simple.

Our algorithm will iterate through data in 16-byte chunks.

  1. We broadcast the masked first and last bytes in the pattern to an XMM register xmm0 and xmm1.
  2. As we don’t know beforehand whether a pattern’s first or last byte are wildcards, we must also broadcast the first and last byte of the mask to XMM register xmm2 and xmm3.
  3. Begin iterating through the data in 16-byte chunks, with the current index being i.
    1. For our first byte check, load data[i] to xmm4.
    2. Bitwise AND xmm4 and xmm2 to mask out wildcards and store in xmm4.
    3. Compare each byte in xmm4 with each byte in xmm0 and store in xmm6. This register now contains the bytes in the data matching our first byte.
    4. For our last byte check, load data[i + pattern_size - 1] to xmm5.
    5. Bitwise AND xmm5 and xmm3 to mask out wildcards and store in xmm5.
    6. Compare each byte in xmm5 with each byte in xmm1 and store in xmm7. This register now contains the bytes in the data matching our last byte.
    7. Bitwise AND xmm6 and xmm7 and store in xmm8. This register now contains local offsets of matches of both the first and last byte.

If you didn’t understand, I don’t blame you. I suck at explaining these things, so here’s a diagram I made:

Algorithm

Setup

Last Byte Check

First Byte Check

Last Byte

Start Byte

xmm0
Pattern P0

xmm2
Mask M0

xmm3
Mask Mn

xmm1
Pattern Pn

Data @ i

xmm4

&

xmm4
Masked

==

xmm6
Matches for P0

Data @ i+len-1

xmm5

&

xmm5
Masked

==

xmm7
Matches for Pn

&

xmm8
Final matches

If you still don’t get it, just see the implementation below.

The rest is fairly straightforward. We iterate through all the matches and do our comparison virtually the exact same way we did in our non-vectorized version, just on many bytes at once this time.

Below is our implementation:

static u64 sse2_pattern_match(const u8* data, const u64 data_size, u8* mask, u8* pattern, u64 pattern_size) {
    const __m128i first_pattern_byte = _mm_set1_epi8(pattern[0] & mask[0]);
    const __m128i last_pattern_byte  = _mm_set1_epi8(pattern[pattern_size - 1] & mask[pattern_size - 1]);

    const __m128i first_mask_byte = _mm_set1_epi8(mask[0]);
    const __m128i last_mask_byte  = _mm_set1_epi8(mask[pattern_size - 1]);

    u64 i = 0;
    for (; i < data_size - pattern_size - 16; i += 16) {
        __m128i chunk_first = _mm_loadu_si128((__m128i*) (data + i));
        __m128i chunk_last  = _mm_loadu_si128((__m128i*) (data + i + pattern_size - 1));
        chunk_first         = _mm_and_si128(chunk_first, first_mask_byte);
        chunk_last          = _mm_and_si128(chunk_last, last_mask_byte);

        __m128i cmp = _mm_cmpeq_epi8(chunk_first, first_pattern_byte);
        // NOTE(geni): Only match if BOTH first and last bytes match
        cmp         = _mm_and_si128(cmp, _mm_cmpeq_epi8(chunk_last, last_pattern_byte));

        u32 matches = _mm_movemask_epi8(cmp);
        // NOTE(geni): Try all matches
        while (matches != 0) {
            int local_offset = __builtin_ctz(matches);
            u64 offset        = i + local_offset;

            for (u64 j = 0; j < pattern_size; j += 16) {
                __m128i chunk = _mm_loadu_si128((__m128i*) (data + offset + j));

                __m128i pattern_data = _mm_loadu_si128((__m128i*) (pattern + j));
                __m128i mask_data    = _mm_loadu_si128((__m128i*) (mask + j));

                __m128i match = _mm_and_si128(chunk, mask_data);
                match        = _mm_cmpeq_epi8(match, pattern_data);

                if (_mm_movemask_epi8(match) != 0xFFFF) {
                    goto again;
                }
            }
            return offset;
        again:
            // NOTE(geni): Clear match
            matches &= ~(1 << local_offset);
        }
    }

    // NOTE(geni): Fall back to scalar method for last bytes
    if (i <= data_size - pattern_size) {
        u64 result = mask_pattern_match(data + i, data_size - i, mask, pattern, pattern_size);
        if (result != 0xFFFFFFFFFFFFFFFF) {
            return i + result;
        }
    }
    return 0xFFFFFFFFFFFFFFFF;
}

You might have noticed that this would read out-of-bounds if pattern_size isn’t a multiple of 16. To overcome this, we make the assumption that the size of our pattern and mask buffers are always a multiple of 16, and that the buffers themselves are null-initialized. If these conditions are met, reading past pattern_size is safe because 0s in the mask buffer represent a wildcard byte, hence the outcome won’t change.

There can be checks in the beginning of the function itself for the buffer sizes in this case, but here I didn’t bother. You could also write non-vectorized code as a fallback, or perhaps write a small setup routine at the start of the function to set up oversized buffers. All up to you.

AVX2

The AVX2 version has some interesting optimizations that we can make, but for the most part doesn’t differ much from the SSE2 version.

We’ll begin by making a straightforward “port” of the SSE2 algorithm:

static u64 avx2_dumb_pattern_match(const u8* data, const u64 data_size, u8* mask, u8* pattern, u64 pattern_size) {
    const __m256i first_pattern_byte = _mm256_set1_epi8(pattern[0] & mask[0]);
    const __m256i last_pattern_byte  = _mm256_set1_epi8(pattern[pattern_size - 1] & mask[pattern_size - 1]);

    const __m256i first_mask_byte = _mm256_set1_epi8(mask[0]);
    const __m256i last_mask_byte  = _mm256_set1_epi8(mask[pattern_size - 1]);

    u64 i = 0;
    for (; i < data_size - pattern_size - 32; i += 32) {
        __m256i chunk_first = _mm256_loadu_si256((__m256i*) (data + i));
        __m256i chunk_last  = _mm256_loadu_si256((__m256i*) (data + i + pattern_size - 1));
        chunk_first         = _mm256_and_si256(chunk_first, first_mask_byte);
        chunk_last          = _mm256_and_si256(chunk_last, last_mask_byte);

        __m256i cmp = _mm256_cmpeq_epi8(chunk_first, first_pattern_byte);
        // NOTE(geni): Only match if BOTH first and last bytes match
        cmp         = _mm256_and_si256(cmp, _mm256_cmpeq_epi8(chunk_last, last_pattern_byte));

        u32 matches = _mm256_movemask_epi8(cmp);
        // NOTE(geni): Try all matches
        while (matches != 0) {
            u32 local_offset = __builtin_ctz(matches);
            u64 offset        = i + local_offset;

            for (u64 j = 0; j < pattern_size; j += 32) {
                __m256i chunk = _mm256_loadu_si256((__m256i*) (data + offset + j));

                __m256i pattern_data = _mm256_loadu_si256((__m256i*) (pattern + j));
                __m256i mask_data    = _mm256_loadu_si256((__m256i*) (mask + j));

                __m256i match = _mm256_and_si256(chunk, mask_data);
                match        = _mm256_cmpeq_epi8(match, pattern_data);

                if ((u32) _mm256_movemask_epi8(match) != 0xFFFFFFFF) {
                    goto again;
                }
            }
            return offset;
        again:
            // NOTE(geni): Clear match
            matches &= ~(1 << local_offset);
        }
    }

    // NOTE(geni): Fall back to scalar method for last bytes
    if (i <= data_size - pattern_size) {
        u64 result = mask_pattern_match(data + i, data_size - i, mask, pattern, pattern_size);
        if (result != 0xFFFFFFFFFFFFFFFF) {
            return i + result;
        }
    }
    return 0xFFFFFFFFFFFFFFFF;
}
The same disclaimer regarding OOB in the SSE2 version also applies here, albeit with multiples of 32 rather than multiples of 16.

While this already gives us a major speedup just from being able to process twice the amount of data in one go, we can make some minor adjustments.

BLSR instruction

As we’re targetting AVX2 processors, we can also safely use instructions from the Bit Manipulation Instruction 1 extension.

BMI1 gives us the BLSR r32, r/m32 instruction, which lets us unset the last set bit. When combined with TZCNT (which will be the output of __builtin_ctz if our target supports BMI1), we can pretty quickly iterate through all set bits in a given operand.

We’re going to manually use the intrinsic for BLSR, but the compiler isn’t dumb; if we had just used matches &= matches - 1; instead of matches &= ~(1 << local_offset) the compiler would output the BLSR instruction for us. You can see an example of GCC doing this using Compiler Explorer.

    u32 matches = _mm256_movemask_epi8(cmp);
    // NOTE(geni): Try all matches
    while (matches != 0) {
        ...
    again:
        // NOTE(geni): Clear match
        matches = _blsr_u32(matches);
    }

VPTEST instruction

There’s something more interesting that we can do. AVX introduces the VPTEST instruction, which is described in the Intel Intrinsic Guide as follows:

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return the ZF value.

Hmm, that sounds quite convenient for our pattern matching logic, if we were to somehow move the bitwise AND to the end… which we totally can! Let’s look at it again:

__m256i match = _mm256_and_si256(chunk, mask_data);
match         = _mm256_cmpeq_epi8(match, pattern_data);

if ((u32) _mm256_movemask_epi8(match) != 0xFFFFFFFF) {
    goto again;
}

What if we instead XOR-ed chunk with pattern_data then masked it out? We would get something like this:

__m256i zero = _mm256_setzero_si256();

__m256i match = _mm256_xor_si256(chunk, pattern_data);
match         = _mm256_and_si256(match, mask_data);
match         = _mm256_cmpeq_epi8(match, zero);

if ((u32) _mm256_movemask_epi8(match) != 0xFFFFFFFF) {
    goto again;
}

Here’s an updated diagram of the bitwise operations:

48
00
00
A8
FF
00
00
FF
48
A9
30
A8
E8
6C
E8
6C
FF
FF
^
00
A9
30
00
00
00
mask
data
pattern
&
1
0
0
0
0
0
0
00
00
00
00
00
00
==

This approach seems more inefficient, until we make use of the VPTEST instruction:

__m256i match = _mm256_xor_si256(chunk, pattern_data);
if (!_mm256_testz_si256(match, mask_data)) {
    goto again;
}

As VPTEST directly sets ZF, we don’t need to use VPMOVMSKB like before. Whether this approach is actually any faster depends on your architecture, though it should generally be a tad faster.

Analyzing with llvm-mca tells us the VPTEST version is faster on some architectures or basically on par with the original on others. While this type of thing isn’t so black-and-white outside of a vacuum, I think llvm-mca is right here. Here’s the analysis for Skylake:

[0] Code Region - Mask_Compare

Iterations:        100
Instructions:      500
Total Cycles:      132
Total uOps:        500

Dispatch Width:    6
uOps Per Cycle:    3.79
IPC:               3.79
Block RThroughput: 1.0

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.33                        vpand ymm4, ymm0, ymm2
 1      1     0.50                        vpcmpeqb      ymm4, ymm4, ymm1
 1      2     1.00                        vpmovmskb     eax, ymm4
 1      1     0.25                        cmp   eax, -1
 1      1     0.50                        jne   .L_fail1

[0]   - SKLDivider
[1]   - SKLFPDivider
[2]   - SKLPort0
[3]   - SKLPort1
[4]   - SKLPort2
[5]   - SKLPort3
[6]   - SKLPort4
[7]   - SKLPort5
[8]   - SKLPort6
[9]   - SKLPort7

Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    
 -      -     1.26   1.25    -      -      -     1.26   1.23    -     

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -      -     0.49    -      -      -     0.51    -      -     vpand     ymm4, ymm0, ymm2
 -      -     0.25   0.75    -      -      -      -      -      -     vpcmpeqb  ymm4, ymm4, ymm1
 -      -     1.00    -      -      -      -      -      -      -     vpmovmskb eax, ymm4
 -      -      -     0.01    -      -      -     0.75   0.24    -     cmp       eax, -1
 -      -     0.01    -      -      -      -      -     0.99    -     jne       .L_fail1

[1] Code Region - XOR_Mask_Compare

Iterations:        100
Instructions:      600
Total Cycles:      159
Total uOps:        600

Dispatch Width:    6
uOps Per Cycle:    3.77
IPC:               3.77
Block RThroughput: 1.0

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.33                        vpxor ymm4, ymm0, ymm1
 1      1     0.33                        vpand ymm4, ymm4, ymm2
 1      1     0.50                        vpcmpeqb      ymm4, ymm4, ymm3
 1      2     1.00                        vpmovmskb     eax, ymm4
 1      1     0.25                        cmp   eax, -1
 1      1     0.50                        jne   .L_fail2

[0]   - SKLDivider
[1]   - SKLFPDivider
[2]   - SKLPort0
[3]   - SKLPort1
[4]   - SKLPort2
[5]   - SKLPort3
[6]   - SKLPort4
[7]   - SKLPort5
[8]   - SKLPort6
[9]   - SKLPort7

Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    
 -      -     1.51   1.50    -      -      -     1.51   1.48    -     

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -     0.33   0.18    -      -      -     0.49    -      -     vpxor     ymm4, ymm0, ymm1
 -      -     0.01   0.48    -      -      -     0.51    -      -     vpand     ymm4, ymm4, ymm2
 -      -     0.16   0.84    -      -      -      -      -      -     vpcmpeqb  ymm4, ymm4, ymm3
 -      -     1.00    -      -      -      -      -      -      -     vpmovmskb eax, ymm4
 -      -     0.01    -      -      -      -     0.51   0.48    -     cmp       eax, -1
 -      -      -      -      -      -      -      -     1.00    -     jne       .L_fail2

[2] Code Region - XOR_VPTEST

Iterations:        100
Instructions:      300
Total Cycles:      107
Total uOps:        400

Dispatch Width:    6
uOps Per Cycle:    3.74
IPC:               2.80
Block RThroughput: 1.0

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.33                        vpxor ymm4, ymm0, ymm1
 2      3     1.00                        vptest        ymm4, ymm2
 1      1     0.50                        jne   .L_fail3

[0]   - SKLDivider
[1]   - SKLFPDivider
[2]   - SKLPort0
[3]   - SKLPort1
[4]   - SKLPort2
[5]   - SKLPort3
[6]   - SKLPort4
[7]   - SKLPort5
[8]   - SKLPort6
[9]   - SKLPort7

Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    
 -      -     1.01   0.99    -      -      -     1.01   0.99    -     

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -      -     0.99    -      -      -     0.01    -      -     vpxor     ymm4, ymm0, ymm1
 -      -     1.00    -      -      -      -     1.00    -      -     vptest    ymm4, ymm2
 -      -     0.01    -      -      -      -      -     0.99    -     jne       .L_fail3

Here is our final function:

static u64 avx2_pattern_match(const u8* data, const u64 data_size, u8* mask, u8* pattern, u64 pattern_size) {
    const __m256i first_pattern_byte = _mm256_set1_epi8(pattern[0] & mask[0]);
    const __m256i last_pattern_byte  = _mm256_set1_epi8(pattern[pattern_size - 1] & mask[pattern_size - 1]);

    const __m256i first_mask_byte = _mm256_set1_epi8(mask[0]);
    const __m256i last_mask_byte  = _mm256_set1_epi8(mask[pattern_size - 1]);

    u64 i = 0;
    for (; i < data_size - pattern_size - 32; i += 32) {
        __m256i chunk_first = _mm256_loadu_si256((__m256i*) (data + i));
        __m256i chunk_last  = _mm256_loadu_si256((__m256i*) (data + i + pattern_size - 1));
        chunk_first         = _mm256_and_si256(chunk_first, first_mask_byte);
        chunk_last          = _mm256_and_si256(chunk_last, last_mask_byte);

        __m256i cmp = _mm256_cmpeq_epi8(chunk_first, first_pattern_byte);
        // NOTE(geni): Only match if BOTH first and last bytes match.
        cmp = _mm256_and_si256(cmp, _mm256_cmpeq_epi8(chunk_last, last_pattern_byte));

        u32 matches = _mm256_movemask_epi8(cmp);
        // NOTE(geni): Try all matches
        while (matches != 0) {
            // NOTE(geni): Find first matching byte
            u32 local_offset = __builtin_ctz(matches);
            u64 offset       = i + local_offset;

            for (u32 j = 0; j < pattern_size; j += 32) {
                __m256i chunk = _mm256_loadu_si256((__m256i*) (data + offset + j));

                __m256i pattern_data = _mm256_loadu_si256((__m256i*) (pattern + j));
                __m256i mask_data    = _mm256_loadu_si256((__m256i*) (mask + j));

                __m256i match = _mm256_xor_si256(chunk, pattern_data);
                if (!_mm256_testz_si256(match, mask_data)) {
                    goto again;
                }
            }
            return offset;
        again:
            // NOTE(geni): Clear match
            matches = _blsr_u32(matches);
        }
    }

    // NOTE(geni): Fall back to scalar method for last bytes
    if (i <= data_size - pattern_size) {
        u64 result = mask_pattern_match(data + i, data_size - i, mask, pattern, pattern_size);
        if (result != 0xFFFFFFFFFFFFFFFF) {
            return i + result;
        }
    }
    return 0xFFFFFFFFFFFFFFFF;
}

Performance

While further optimizations are certainly possible, we won’t be going any farther today. I’m very far from a SIMD expert, but maybe I’ll come revisit this problem one day.

Enough talking; let’s see the numbers. We’ll be testing on an Alder Lake i5-12600k.
The results were averaged over 5 runs of 10000 iterations on a data sample of 5509808 bytes, using a 92 byte signature containing 4 wildcards.

Methodvs Naivevs Maskedvs SSE2vs AVX2 dumbvs AVX2
Naive1.00x0.55x0.05x0.02x0.02x
Masked1.82x1.00x0.08x0.04x0.04x
SSE221.71x11.96x1.00x0.53x0.52x
AVX2 dumb41.08x22.62x1.89x1.00x0.99x
AVX241.63x22.92x1.92x1.01x1.00x

Pretty acceptable result if you ask me. I haven’t had time to do more benchmarks, I’ll get around to it eventually (surely…).

Single-header library

I’ve put together a single-header library using the algorithms described in this post, available here: https://github.com/geniiii/simdscanner