Part 3 Recap

In part 3 we looked at quantization, field_crop memory access, dirty draw commands, SIMD for memory and non-temporal stores. In part 4, we’re going to be looking at SIMD for computation, more quantization and software prefetching.

Back at it again

As always, let’s take a look at our VTune profile.

VTuneSample17

It seems that we’re fairly back-end bound, however not nearly as much as we used to be. At this point, it’s not entirely clear where we could be improving our performance in ai_tick. VTune tells us that most of our operations are fairly efficient.

As a result, I considered that SIMD might be the best approach to improve the performance of this function. It does a lot of computations and if we can make use of the larger SIMD registers for loading as well, we could be more efficient.

To introduce SIMD, we have quite a lot of setup to do.

Currently, most code that requires positions looks like this:

typedef struct
{
    Vec2 tilePos;
} AI_FarmerMveStateHot;

What we need to do, is to split up these structures into an X and Y component array.

Like this:

static float* AI_FarmersMoveHotX = NULL;
static float* AI_FarmersMoveHotY = NULL;

There’s a very simple reason for doing this. We could load our registers with 2 vec2s ([x1, y1, x2, y2]) but then all our following operations are only working on 2 positions. However, by splitting the positions into 2 arrays, we can actually load 4 positions at once and process those elements at the cost of a few more registers ([x1, x2, x3, x4], [y1, y2, y3, y4]).

It also takes away the need to do horizontal adds. (Horizontal adds will add the elements of a vector across a register. For a vector of [x1, y1, x2, y2] it would be x1 + y1 + x2 + y2). We want to avoid doing horizontal adds because their performance is inferior to our usual vector operations. (See here)

Another approach to processing 4 positions at once is to shuffle 4 of our vectors into the X only, Y only ([x1, x2, x3, x4] [y1, y2, y3, y4]) format but this is additional code that we can avoid because we have complete control over the layout of our data.

A few of the changes we had to make were:

Changing the instance buffer to have a positionX and positionY array to facilitate our memcpy in the instance buffer generation.

typedef struct
{
    int16_t spriteIndicesAndScales[GAME_MAX_INSTANCE_COUNT * 2];
    float positionX[GAME_MAX_INSTANCE_COUNT];
    float positionY[GAME_MAX_INSTANCE_COUNT];
} Game_InstanceBuffer;

Removal of the vec2 structure, everything is an array of floats now. Our usage of vec2s is completely erased. Our movement code is now purely floating point math.

float farmerX = AI_FarmersMoveHotX[i];
float farmerY = AI_FarmersMoveHotY[i];
float genFarmerX = AI_FarmersMoveGenX[i];
float genFarmerY = AI_FarmersMoveGenY[i];

float dirVecX = farmerX - genFarmerX;
float dirVecY = farmerY - genFarmerY;
float mag = sqrtf(dirVecX * dirVecX + dirVecY * dirVecY);

float velX = dirVecX * velMag / mag;
float velY = dirVecY * velMag / mag;
AI_FarmersMoveGenX[i] = genFarmerX + velX;
AI_FarmersMoveGenY[i] = genFarmerY + velY;

Finally, another change was the removal of referencing position in tiles and in the crops. This was done because we can determine the position of a tile and crop from it’s tile index. This removes 32 bytes from these structures at the cost of a little more computation.

typedef struct
{
    float grown;
    float lifetime;
    uint32_t cropType;
    uint32_t tileIndex;
} Field_Crop;

typedef struct
{
    Field_Stage stage;
} Field_Tile;

These changes had a better performance impact than I expected.

VTuneSample18

ai_tick improved by one second from 5.3s to 4.1s and field_tick doubled in performance from 1.5s to 0.7s.

Looking at chrome://tracing, ai_tick improved from 4.4ms to 4.1ms, field tick improved from 0.7ms to 0.5ms and our overall tick improved from 6.6ms to 6.1ms.

Link to this change is here.

The SIMD-fication

Here is how our movement code started after we split our positions into X and Y arrays:

float farmerX = AI_FarmersMoveHotX[i];
float farmerY = AI_FarmersMoveHotY[i];
float genFarmerX = AI_FarmersMoveGenX[i];
float genFarmerY = AI_FarmersMoveGenY[i];

float dirVecX = farmerX - genFarmerX;
float dirVecY = farmerY - genFarmerY;
float mag = sqrtf(dirVecX * dirVecX + dirVecY * dirVecY);

float velX = dirVecX * velMag / mag;
float velY = dirVecY * velMag / mag;
AI_FarmersMoveGenX[i] = genFarmerX + velX;
AI_FarmersMoveGenY[i] = genFarmerY + velY;

if (velMag > mag)
{
    // Move is complete
}

The first step to SIMD-ing this code is the linear algebra:

__m256 farmerX = _mm256_load_ps(AI_FarmersMoveHotX + i);
__m256 farmerY = _mm256_load_ps(AI_FarmersMoveHotY + i);
__m256 genFarmerX = _mm256_load_ps(AI_FarmersMoveGenX + i);
__m256 genFarmerY = _mm256_load_ps(AI_FarmersMoveGenY + i);

__m256 dirVecX = _mm256_sub_ps(farmerX, genFarmerX);
__m256 dirVecY = _mm256_sub_ps(farmerY, genFarmerY);
__m256 rmag = _mm256_rsqrt_ps(_mm256_add_ps(_mm256_mul_ps(dirVecX, dirVecX), _mm256_mul_ps(dirVecY, dirVecY)));

__m256 velX = _mm256_mul_ps(dirVecX, _mm256_mul_ps(velMag, rmag));
__m256 velY = _mm256_mul_ps(dirVecY, _mm256_mul_ps(velMag, rmag));
_mm256_store_ps(AI_FarmersMoveGenX + i, _mm256_add_ps(genFarmerX, velX));
_mm256_store_ps(AI_FarmersMoveGenY + i, _mm256_add_ps(genFarmerY, velY));

This looks like a lot, but it maps almost exactly 1 to 1 to our original code. You might notice that instead of using sqrt, we’re using rsqrt. The reason for this is simple, the reciprocal of a square root has a fast approximation that will reduce the precision of our operation but will also improve our performance. rsqrt has a latency of around 5 units, whilst sqrt has a latency of 13. (see the Intel Intrinsics guide and wikipedia)

The next step is more challenging. We now have 8 magnitudes to compare. Before SIMD we could have a simple branch, but now this has become more difficult.

One approach would be to loop through each element and do a single comparison… This is alright, but it’s not very “SIMD”, we’re not processing multiple elements at once anymore.

INSTEAD

We’ll be doing our comparison and storing the indices of the magnitudes that passed our comparison. (There’s an excellent collection of slides on doing this from Andreas Fredriksson detailing how to do this here.)

We’re currently using AVX for our movement code, this allows us to process 8 elements at once instead of 4, which means we also get access to additional intrinsics that we would not typically have access to with SSE.

Let’s take a look at the code and break it down line by line:

// Thanks to https://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask for this great algorithm
// Uses 64bit pdep / pext to save a step in unpacking.
__m256i simd_moveMaskToIndexMask(unsigned int mask /* from movmskps */)
{
    uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101);  // unpack each bit to a byte
    expanded_mask *= 0xFF;    // mask |= mask<<1 | mask<<2 | ... | mask<<7;
    // ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte

    const uint64_t identity_indices = 0x0706050403020100;    // the identity shuffle for vpermps, packed to one index per byte
    uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);

    __m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
    __m256i shufmask = _mm256_cvtepu8_epi32(bytevec);

    return shufmask;
}

int bitMask = (1 << math_min(previousFarmerMoveCount - i, 8)) - 1;

__m256 cmpRes = _mm256_cmp_ps(rvelMag, rmag, _CMP_LT_OQ);
int indexMask = _mm256_movemask_ps(cmpRes) & bitMask;

__m256i indices = _mm256_set_epi32(i, i, i, i, i, i, i, i);
__m256i indexAdd = simd_moveMaskToIndexMask(indexMask);
indices = _mm256_add_epi32(indices, indexAdd);

_mm256_storeu_si256((__m256i*)(AI_FarmerRemovalIndices + removedFarmerCount), indices);
removedFarmerCount += _mm_popcnt_u32(indexMask);

First we need to mask out the comparison of elements that aren’t valid farmers. We can create a bit mask of valid farmers like so:

int bitMask = (1 << math_min(previousFarmerMoveCount - i, 8)) - 1;

This line will simply determine how many farmers we’re processing in this iteration. (If we were at index 8 but only had 14 farmers we would get 14 – 8 which would mean we’re processing 6 farmers and we create a mask of the first 6 bits set.

Then, we do our comparison:

__m256 cmpRes = _mm256_cmp_ps(rvelMag, rmag, _CMP_LT_OQ);
int indexMask = _mm256_movemask_ps(cmpRes) & bitMask;

This compares our velocity against our magnitude, similarly to how we did it before. When a velocity’s magnitude is higher than our distance, the element in the result register will have all of its bits set. (If the magnitudes are [0, 2, 3, 4] and the velocities are [1, 1, 1, 1], the resulting vector would be [0xFF, 0, 0, 0]) This isn’t of much use to us yet, we need to convert this register into a mask where the bits that passed the comparison are set. (For the previous example, the value would be 1000)

As this point, we want to be able to take this mask and convert it to a collection of packed indices.

We want to pack our indices to the left of a register because we want to be able to store only the indices that we want to remove from the collection of farmers. Currently our mask has a value, maybe 01001100, but what we want is a register with the values i + 3, i + 4, i + 7.

To do this, we call simd_moveMaskToIndexMask which is a modified version of the code found here. This function will return a collection of packed integers as a result of the mask provided. (for the mask 01001100 we would get 3, 4, 7)

Then we can take these values and add them to i to get our removed farmer indices.

__m256i indices = _mm256_set_epi32(i, i, i, i, i, i, i, i);
__m256i indexAdd = simd_moveMaskToIndexMask(indexMask);
indices = _mm256_add_epi32(indices, indexAdd);

_mm256_storeu_si256((__m256i*)(AI_FarmerRemovalIndices + removedFarmerCount), indices);

Using the previous example, our resulting vector would be [i + 3, i + 4, i + 7, don’t care, don’t care, don’t care…] We don’t care about the following elements because they didn’t pass the comparison.

And then we move our write index forward by the number of farmers removed:

removedFarmerCount += _mm_popcnt_u32(indexMask); // _mm_popcnt_u32 counts the number of bits set in a u32

Finally we can just loop through our removed farmer indices:

for(uint32_t i = 0; i < removedFarmerCount; i++)
{
    int r = AI_FarmerRemovalIndices[i];
}

That’s it! We SIMD-ed our movement.

Let’s take a look at VTune!

VTuneSample19

We’ve shaved off another second from our ai_tick from 4.1s to 3.2s. chrome://tracing says that our ai_tick is now 3.3ms from 4.1ms. The average tick has improved from 6.1ms to 5.2ms.

You might notice that we’re more back-end bound and our CPI has increased. This is to be expected, SIMD instructions have a higher CPI but this is fine since we’re now processing 8 elements instead of 1 element at a time. As for being more back-end bound, since we’re now processing data quickly, I suspect that we’re not running slow enough to hide the latency of the memory loads. I’m not quite sure how to address this.

Link to this change is here.

Field_tick is back?

Before I SIMD everything else, there’s one change to be made to field_tick.

The Field_Crop structure looks like this:

typedef struct
{
    float grown;
    float lifetime;
    uint32_t cropType;
    uint32_t tileIndex;
} Field_Crop;

Looking at this structure, we don’t actually need lifetime and grown! We can use a single float to represent how long the crop has to live and instead of incrementing and comparing against grown, we will decrement and compare against zero. (We will also split the Field_Crop structure, it will simplify our SIMD code)

typedef struct
{
    uint32_t cropType;
    uint32_t tileIndex;
} Field_Crop;

float* Field_CropLifetimes;

With this simple change, field_tick has improved from 0.49ms to 0.052ms… I doubt we will be looking at field_tick anytime soon.

Link to this change is here.

The Great SIMD-fication

Following the field_tick improvement. I SIMD-ed the rest of the game code. chrome://tracing states that ai_tick improved from 3.3ms to 2.8ms. VTune notes that the performance has slightly improved from 3.2s to 2.8s.

VTuneSample20

Link to this change is here.

The Useless Optimization

I then went on to try to using the streaming instructions _mm_stream_ps and _mm_stream_load_si128 in an attempt to maybe speed things up by not evicting anything I might need from the cache.

…This had no effect…

My reasoning behind the performance not improving was that I had no data to evict that I might need in the first place. And since I still had to read the data, I couldn’t have the benefit of being able to write without loading.

Quantization Again? That’s half the truth

At this point, I couldn’t just “SIMD” everything and get more performance (We already did that!). But I ran into something interesting. AVX has a series of float to half conversion intrinsics that could be used to store our data in half the size of our previous data. (For the curious half is a form of floating point storage that takes up half the size of a typical float (16 bits) more information on them can be found here)

This is very similar to the quantization I implemented in part 3 where we take a range of values and map them to an integer range.

The modifications for this were simple.

I added 2 new macros that did all the work for me on load and store:

#define SIMD_FLOAT_TO_HALF(f) _mm_extract_epi16(_mm_cvtps_ph(_mm_set_ss(f), _MM_FROUND_NO_EXC), 0)
#define SIMD_LOAD_PH_TO_PS(a) _mm256_cvtph_ps(_mm_load_si128((__m128i*)(a)))

And then simply changed the data types from floats to 16 bit integers for storage.

This change had some beneficial effects on all of the gameplay code. According to chrome://tracing ai_tick went from 2.8ms to 2.6ms, field_tick went from 0.052ms to 0.022ms and game_gen_instance_buffer went from 1.5ms to 0.8ms. On average, the tick improved from 4.4ms to 3.4ms!

Link to this change is here.

More failures

At this point, I tried a variety of other things to improve performance:

  • I attempted to store field tiles in 4 bits instead of 8 bits. This was a detriment to the performance overall. I assume that the load needed to write to the tile might have affected the CPUs ability to send of the write without having to load anything into registers?
  • I attempted to store the crop type in the tile index uint8_t instead of storing it as another uint32_t. But this caused the program to also have reduced performance.

The prefetching chronicles

At this point, it got harder to make things faster. I was out of “tricks” for improving the performance. And so I turned to VTune. According to VTune I was still highly back-end bound, especially in the farmer removal code where farmers are transitioned from state to state.

A potential reason why removal is causing our code to be back-end bound is that we have a layer of indirection that keeps track of the indices that we want to remove from the the farmer list. After processing all the farmers, we then loop through this collection of removed indices and remove the farmers associated to those indices.

The problem with this indirection, is that our memory patterns are slightly irregular and our hardware prefetcher can’t predict where our next farmer will be. As a result, I decided to turn to software prefetching.

What is software prefetching?

Software prefetching is a mechanism built in to modern CPUs that allows you to hint to your CPU that you would like it to start prefetching some memory at some address for you. It’s similar to hardware prefetching in the sense that it will go gather memory for you, but it differs by allowing you to ask for memory that has a more irregular pattern of fetching.

A potential downside of asking for memory to be prefetched is that it might evict something from the cache that is currently in use or our cache line might be evicted before we even get to it. (A few sources on software prefetching can be found here and here)

Software prefetching is very tricky to get right, and I don’t know all of the intricacies of using it, but it made sense for the removal of farmers.

The change was simple. The code for transitioning a moving farmer to a farming farmer looked like this:

for(uint32_t i = 0; i < removedFarmerCount; i++)
{
    int r = SIMD_FarmerRemovalIndices[i];

    AI_FarmersFarmHot[AI_FarmerFarmCount + i] = rand_range(AI_FarmerFarmSpeedMin, AI_FarmerFarmSpeedMax);
    AI_FarmersFarmCold[AI_FarmerFarmCount + i] = AI_FarmersMoveCold[r];
    AI_FarmersFarmGenX[AI_FarmerFarmCount + i] = AI_FarmersMoveGenX[r];
    AI_FarmersFarmGenY[AI_FarmerFarmCount + i] = AI_FarmersMoveGenY[r];

    AI_FarmersMoveHotX[r] = AI_FarmersMoveHotX[AI_FarmerMoveCount - 1 - i];
    AI_FarmersMoveHotY[r] = AI_FarmersMoveHotY[AI_FarmerMoveCount - 1 - i];
    AI_FarmersMoveCold[r] = AI_FarmersMoveCold[AI_FarmerMoveCount - 1 - i];
    AI_FarmersMoveGenX[r] = AI_FarmersMoveGenX[AI_FarmerMoveCount - 1 - i];
    AI_FarmersMoveGenY[r] = AI_FarmersMoveGenY[AI_FarmerMoveCount - 1 - i];
}

And now the code looks like this:

for(uint32_t i = 0; i < removedFarmerCount; i++)
{
    int r = SIMD_FarmerRemovalIndices[i];

    int nextRemoval = SIMD_FarmerRemovalIndices[i + 2];
    _mm_prefetch((const char*)(AI_FarmersMoveCold + nextRemoval), _MM_HINT_T0);
    _mm_prefetch((const char*)(AI_FarmersMoveGenX + nextRemoval), _MM_HINT_T0);
    _mm_prefetch((const char*)(AI_FarmersMoveGenY + nextRemoval), _MM_HINT_T0);

    AI_FarmersFarmHot[AI_FarmerFarmCount + i] = rand_range(AI_FarmerFarmSpeedMin, AI_FarmerFarmSpeedMax);
    AI_FarmersFarmCold[AI_FarmerFarmCount + i] = AI_FarmersMoveCold[r];
    AI_FarmersFarmGenX[AI_FarmerFarmCount + i] = AI_FarmersMoveGenX[r];
    AI_FarmersFarmGenY[AI_FarmerFarmCount + i] = AI_FarmersMoveGenY[r];

    AI_FarmersMoveHotX[r] = AI_FarmersMoveHotX[AI_FarmerMoveCount - 1 - i];
    AI_FarmersMoveHotY[r] = AI_FarmersMoveHotY[AI_FarmerMoveCount - 1 - i];
    AI_FarmersMoveCold[r] = AI_FarmersMoveCold[AI_FarmerMoveCount - 1 - i];
    AI_FarmersMoveGenX[r] = AI_FarmersMoveGenX[AI_FarmerMoveCount - 1 - i];
    AI_FarmersMoveGenY[r] = AI_FarmersMoveGenY[AI_FarmerMoveCount - 1 - i];
}

Notice the addition of _mm_prefetch? Those are the SSE prefetch instrinsics. They take a hint that notes which cache level you would prefer the memory you are requesting to be in. I decided that it would go to L1 because I’m going to use it very soon.

You might notice that the nextRemoval index is not the next one, but actually the second one after. The reason for this, is that asking for the next one would most likely not complete the prefetch before the next iteration. Through some experimentation, I found that the second one was an effective value for the prefetch.

This change was coupled with the next change. As a result, we won’t take a look at chrome://tracing just yet.

More quantization?

While looking for some ways to improve the performance of the farmers, I came upon an idea.

If I convert the timers from 16 bit halves to 16 bit integer values, I would be able to load 16 timers instead of 8 at a time. Currently we’re limited to loading 8 at a time because we have to expand them to floats before we can operate on them. If instead, we load and store them as 16 bit integer values, we can use the full power of the 256 registers.

We store these values in integers by multiplying the floating point by 1000 and storing the result in an int16_t. Instead of decrementing floating point, we reduce the precision of our numbers and load our 16 bit integers as a number representing the number of milliseconds left. This means we are limiting our range of precision to 32s and 1ms, which is enough precision for a farmer brain.

With this simple change, our timer code now looks like this:

__m256i farmerSearchTimer = _mm256_load_si256((__m256i*)(AI_FarmersSearchHot + i));
farmerSearchTimer = _mm256_sub_epi16(farmerSearchTimer, delta256);
_mm256_store_si256((__m256i*)(AI_FarmersSearchHot + i), farmerSearchTimer);

And that’s it!

With these changes in hand, chrome://tracing says that ai_tick is now 1.78ms from 2.7ms, and crops are now 0.013ms from 0.021ms. This leaves us with an average tick of 2.7ms. Achieving our goal of reaching less than 3ms.

Link to this change is here.

The bug fix that saved the world

As I was thinking about the program during the week, and trying to think of different ways to improve the performance, I came upon a bug!!

In the index packing code, we were only packing the first 8 indices and forgetting the next 8. This caused our removal code to access uninitialized indices and (spoiler) severely affecting the performance of our ai_tick code!

int indexMask = _mm256_movemask_ps(cmpRes) & bitMask;

__m256i indices = _mm256_set_epi32(i, i, i, i, i, i, i, i);
__m256i indexAdd = simd_moveMaskToIndexMask(indexMask);
indices = _mm256_add_epi32(indices, indexAdd);

_mm256_storeu_si256((__m256i*)(AI_FarmerRemovalIndices + removedFarmerCount), indices);
removedFarmerCount += _mm_popcnt_u32(indexMask);

Looking at this code, you’ll notice that __m256i can only hold 8 32 bit integers. However, we’re processing 16 timers. And since we’re telling our removal count that we’re removing up to 16 timers, this causes us to loop over too much memory.

The quick fix for this problem was simply to handle an additional 8 indices:

__m256i indices = _mm256_set_epi32(i, i, i, i, i, i, i, i);
__m256i indexAdd = simd_moveMaskToIndexMask(indexMask & 0x00FF);
indices = _mm256_add_epi32(indices, indexAdd);

_mm256_storeu_si256((__m256i*)(SIMD_FarmerRemovalIndices + removedFarmerCount), indices);
removedFarmerCount += _mm_popcnt_u32(indexMask & 0x00FF);

__m256i next8Indices = _mm256_set_epi32(i + 8, i + 8, i + 8, i + 8, i + 8, i + 8, i + 8, i + 8);
__m256i next8IndexAdd = simd_moveMaskToIndexMask(indexMask & 0xFF00);
next8Indices = _mm256_add_epi32(next8Indices, next8IndexAdd);

_mm256_storeu_si256((__m256i*)(SIMD_FarmerRemovalIndices + removedFarmerCount), next8Indices);
removedFarmerCount += _mm_popcnt_u32(indexMask & 0xFF00);

And the results are a slight improvement in performance. chrome://tracing says that ai_tick now runs at an average of 1.67ms instead of 1.7ms.

Link to this change is here.

That’s it for part 4!

In this part, we looked at:

  • SIMD computations
  • Half floating point
  • More quantization
  • Software prefetching

Our program now runs at an average 2.6ms per tick. Achieving our goal of less than 3ms per tick!

In part 5, we’ll look into wrapping up the series, we’re going to tackle game_gen_instance_buffer’s alignment issues, attempt to use the AVX gather instructions in order to potentially vectorize our removal loop and reducing the amount of timers that we need to decrement every tick!

Follow the repo here!

Until next time!