What happened again?

In part 2, we looked at branch mispredictions and making effective use of cache memory. We split our farmer structure from a union into a series of smaller structures that allows us to make the most use of our cache line and also significantly reduced our branch mispredictions.

Let’s jump right in!

Jumping off from our changes in part 2, we’ll launch VTune and see where our performance is going!

VTuneSample8

Looking at our capture, we can see that game_gen_instance_buffer completely overshadows the rest of our methods. At 77.4% back-end bound, it is severely limited by our memory access patterns. Addressing that will be our first task.

If we take a look at chrome://tracing we also notice that game_gen_instance_buffer runs at an average of 12 ms per tick.

The profiler notes that this chunk of code is largely back-end bound:

if (Field_Tiles[i].crop != NULL)
{
    uint32_t cropWriteIndex = writeIndex++;
    buffer->instances[cropWriteIndex].spriteIndex = 7.0f + Field_Tiles[i].crop->cropType;
    buffer->instances[cropWriteIndex].scale = 2.0f / Field_Width;

    buffer->instances[cropWriteIndex].pos[0] = Field_Tiles[i].pos.x;
    buffer->instances[cropWriteIndex].pos[1] = Field_Tiles[i].pos.y;
}

More accurately this line is a large contributor to our memory access woes:

buffer->instances[cropWriteIndex].spriteIndex = 7.0f + Field_Tiles[i].crop->cropType;

Why is that? We’ll if we take a look at our Field_Tile structure:

typedef struct
{
    Field_Crop* crop;
    Field_Stage stage;
    Vec2 pos;
} Field_Tile;

Since “crop” is a pointer to a Field_Crop allocated at an arbitrary position on the heap, our hardware prefetchers can’t fetch it for us! (More information on heap allocations using malloc can be found here, See Part 2 for more information about the memory hierarchy)

As a result, we’re going to want to modify how crops are allocated and accessed. Instead of accessing the crops through their tiles, we’re going to move the crops into their own array and loop through the array directly as we generate the rendering information. This will improve our access patterns for the prefetcher and allow our memory to be nicely packed into cache lines.

Following this modification, we might notice that field_tick (inlined into game_tick) might also become faster! This is because instead of indirectly accessing the crops through their tile, we’ll be able to access them linearly in memory.

While implementing this optimization, I noticed that I didn’t correctly address removing the farmer states when they were complete… I forgot to move the iterator index back 1 after the swap which would cause the state that was swapped to not be processed… This was also fixed! …oops…

Another resulting optimization was the conversion of pointers to 32 bit indices, this simplified a lot of the addressing and also reduced the size of the structures on a 64 bit system by 32 bits!

Looking at our Field_Crop and Field_Tile structures we can see that their relationship was inverted:

typedef struct
{
    float grown;
    float lifetime;
    uint32_t cropType;
    uint32_t tileIndex;
    Vec2 pos;
} Field_Crop;

static uint32_t Field_CropCount = 0;
static Field_Crop* Field_Crops = NULL;

typedef struct
{
    Field_Stage stage;
    Vec2 pos;
} Field_Tile;

Field_Crop now references Field_Tiles through an index instead of Field_Tile holding a pointer to the Field_Crop memory, and our Field_Crop memory is now stored in a nice linear array.

Our Field_Tick code used to look like this:

void field_tick(float delta)
{
    for (uint32_t i = 0; i < Field_Width * Field_Height; ++i)
    {
        if (Field_Tiles[i].stage == FieldStage_Planted)
        {
            Field_Crop* crop = Field_Tiles[i].crop;
            crop->lifetime += delta;
            if (crop->lifetime >= crop->grown)
            {
                Field_Tiles[i].stage = FieldStage_Grown;
            }
        }
    }
}

But it now looks like this:

void field_tick(float delta)
{
    for (uint32_t i = 0; i < Field_CropCount; ++i)
    {
        Field_Crop* crop = &Field_Crops[i];
        crop->lifetime += delta;

        if (crop->lifetime >= crop->grown)
        {
            Field_Tile* tile = &Field_Tiles[crop->tileIndex];
            tile->stage = FieldStage_Grown;

            SWAP(Field_Crop, *crop, Field_Crops[Field_CropCount - 1]);
            Field_CropCount--;
            i--;
        }
    }
}

As you might notice, instead of looping through our tiles, we’re looping through our crops, which is the data that we want to access the most. Far better!

Our Field_Crop generation code in game_gen_instance_buffer used to look like so:

for (uint32_t i = 0; i < Field_Width * Field_Height; ++i)
{
    uint32_t writeLoc = writeIndex++;
    buffer->instances[writeLoc].spriteIndex = Game_FieldImageTable[Field_Tiles[i].stage];
    buffer->instances[writeLoc].scale = 2.0f / Field_Width;
    buffer->instances[writeLoc].pos[0] = Field_Tiles[i].pos.x;
    buffer->instances[writeLoc].pos[1] = Field_Tiles[i].pos.y;
    if (Field_Tiles[i].crop != NULL)
    {
        uint32_t cropWriteIndex = writeIndex++;
        buffer->instances[cropWriteIndex].spriteIndex = 7.0f + Field_Tiles[i].crop->cropType;
        buffer->instances[cropWriteIndex].scale = 2.0f / Field_Width;
        buffer->instances[cropWriteIndex].pos[0] = Field_Tiles[i].pos.x;
        buffer->instances[cropWriteIndex].pos[1] = Field_Tiles[i].pos.y;
    }
}

But now it looks like:

for (uint32_t i = 0; i < Field_Width * Field_Height; ++i)
{
    uint32_t writeLoc = writeIndex++;
    buffer->instances[writeLoc].spriteIndex = Game_FieldImageTable[Field_Tiles[i].stage];
    buffer->instances[writeLoc].scale = 2.0f / Field_Width;
    buffer->instances[writeLoc].pos[0] = Field_Tiles[i].pos.x;
    buffer->instances[writeLoc].pos[1] = Field_Tiles[i].pos.y;
}

for (uint32_t i = 0; i < Field_CropCount; i++)
{
    uint32_t cropWriteIndex = writeIndex++;
    buffer->instances[cropWriteIndex].spriteIndex = 7.0f + Field_Crops[i].cropType;
    buffer->instances[cropWriteIndex].scale = 2.0f / Field_Width;
    buffer->instances[cropWriteIndex].pos[0] = Field_Crops[i].pos.x;
    buffer->instances[cropWriteIndex].pos[1] = Field_Crops[i].pos.y;
}

The amount of pointer indirection was reduced and now we access our crops in a nice linear fashion.

VTuneSample10

Taking a look at our results we can see quite a few things. game_gen_instance_buffer has improved by 9 seconds and game_tick (secretly field_tick) has improved by 6 seconds! This was simply by changing our access patterns from memory allocated in random spots on the heap to memory being nicely packed in an array.

When we take a look at chrome://tracing, we can see that our average tick is now 11ms from our 21ms in part 2. field_tick is now 1.2ms from 5ms and game_gen_instance_buffer is now 6ms from 11ms. Excellent results for a relatively simple change.

Link to this change is here.

Dirty what?

Up next, we’re going to tackle how much data we’re storing.

VTuneSample11

Taking a look at VTune tells us that we’re heavily store bound! How do we address this? To begin, we know that our Field_Tiles never change position, they only ever change state. As a result, we can store a series of “state” change commands instead of updating the tiles every tick.

I came across this technique when working on my NES version of GO. When drawing on the NES, you only have VBlank to render everything. If you take too long, too bad, you’re going to get major flickering. Instead, one approach is to only draw what has changed during the frame by storing a series of commands indicating a change and that’s exactly what we’re going to do.

Our command structure looks like this:

typedef struct
{
    uint32_t writeIndex;
    float spriteIndex;
} Field_TileDrawCommand;

And our tile generation code now looks like this:

for (uint32_t i = 0; i < Field_TileDrawCommandCount; ++i)
{
    Field_TileDrawCommand* command = &Field_TileDrawCommands[i];
    buffer->instances[command->writeIndex].spriteIndex = command->spriteIndex;
}
Field_TileDrawCommandCount = 0;

As you can see, the tile generation code is much simpler and will touch far less memory.

VTuneSample12

Looking at our VTune results, you’ll notice that our performance has improved from 11s to 6s! Now our ai_tick and game_gen_instance_buffer functions are even. But we’ll keep tackling game_gen_instance_buffer for this part of the series.

Chrome://tracing tells us that game_gen_instance_buffer averages at 3.3ms from our 6ms and our average tick is now 9ms! That’s much better than our target 16ms. But we’re not done, there’s so much more to do. As a new goal, we’ll try to achieve less than 3ms per tick. Can we do it? I don’t know…

The crop treatment

We’re now going to do the same draw command optimization to our crops as well.

Our crop command looks like this:

typedef struct
{
    uint32_t writeIndex;
    float spriteIndex;
    Vec2 pos;
} Field_CropDrawCommand;

And our crop generation code looks like:

for (uint32_t i = 0; i < Field_CropDrawCommandCount; i++)
{
    Field_CropDrawCommand* command = &Field_CropDrawCommands[i];
    buffer->instances[command->writeIndex].spriteIndex = command->spriteIndex;
    buffer->instances[command->writeIndex].scale = 2.0f / Field_Width;
    buffer->instances[command->writeIndex].pos[0] = command->pos.x;
    buffer->instances[command->writeIndex].pos[1] = command->pos.y;
}
Field_CropDrawCommandCount = 0;

Similarly to the previous optimization, the commands allow us to only update a few bits of data instead of many. This reduces how much memory we have to load and store which relieves the pressure on our memory systems.

VTuneSample13

Looking at our VTune profile, we see that we’re now firmly behind ai_tick at 4s. We shaved off another 2s.

chrome://tracing reports that our average game_gen_instance_buffer tick is now 2.7ms instead of 3.3ms and our average tick is now 8.4ms from 9ms.

Link to this change is here.

The big guns

At this point, we’re going to pull out the big guns to improve the performance of game_gen_instance_buffer. SIMD. We’re now spending a lot of our time storing memory, but we could improve this by using the larger SIMD registers to store our memory.

A quick background on SIMD

SIMD is a feature present in almost all modern CPUs that allows you to work on multiple data at once by having larger registers than what we’re used to. Instead of 64 bit registers, we now have access to 128 bit registers that can act as “arrays” of 4 32 bit elements, 2 64 bit elements or 16 8 bit elements. These registers can act on 2 of these registers as if you were acting on the individual elements of these 2 arrays.

Say we wanted to add 2 128 bit registers of 4 32 bit elements together, in SSE we would call _mm_add_ps(…). If our registers held the values [0, 1, 2, 3] and [4, 5, 6, 7] respectively, _mm_add_ps would return the values [4, 6, 8, 10].

A few links that explain SIMD and ways to use it can be found here, here and on wikipedia.

Back to the game!

However, despite it’s use for mathematics, we’ll be using these larger registers for storing memory more effectively. An interesting trick about these registers is that their memory can be loaded and stored in 128 bit chunks. As a result, instead of storing our memory 64 bits at a time, we’ll be storing and loading our memory 128 bits at a time.

If we look at the code for generating AI_FarmerSearchState:

for (uint32_t i = 0; i < AI_FarmerSearchCount; ++i)
{
    uint32_t writeLoc = writeIndex++;
    buffer->instances[writeLoc].spriteIndex = FarmerState_Search;
    buffer->instances[writeLoc].scale = 0.025f;
    buffer->instances[writeLoc].pos[0] = AI_FarmersSearchCold[i].pos.x;
    buffer->instances[writeLoc].pos[1] = AI_FarmersSearchCold[i].pos.y;
}

We see that we’re storing spriteIndex, scale and pos one at a time. The generated assembly looks like this:

 mov         qword ptr [rcx+rsi],r11
 mov         edi,dword ptr [r9+rdx*8]
 mov         dword ptr [rcx+rsi+8],edi
 mov         edi,dword ptr [r9+rdx*8+4]
 mov         dword ptr [rcx+rsi+0Ch],edi

All “mov”s to rcx+rsi match up to spriteIndex+scale, pos[0] and pos[1] respectively. Instead, our new code will store 4 sprite indices at once, 4 scales at once and 2 farmer positions at once.

How? We’re going to split our buffer into 3 different buffers!

If we take a look at our buffer structure, it looks like this:

#define GAME_MAX_INSTANCE_COUNT 50000000
typedef struct
{
    float spriteIndex;
    float scale;
    float pos[2];
} Game_Instance;

typedef struct
{
    Game_Instance instances[GAME_MAX_INSTANCE_COUNT];
} Game_InstanceBuffer;

We need to split this buffer because we want to be able to load our farmer positions directly from our arrays without having to shuffle things around. If we left things as they were, we would have to add work to load the spriteIndex, scale and position into one register, however if we split the buffers, we’re allowing ourselves to simply load the memory and store it directly using our larger registers.

We’re also going to need to split our positions out of our farmer states into new “Gen” states that will only hold their positions and nothing else in order to be able to copy them without the need to load the rest of the farmer state’s data into memory.

Here is our new instance buffer:

#define GAME_MAX_INSTANCE_COUNT 50000000

typedef struct
{
    float spriteIndices[GAME_MAX_INSTANCE_COUNT];
    float scales[GAME_MAX_INSTANCE_COUNT];
    float positions[GAME_MAX_INSTANCE_COUNT * 2];
} Game_InstanceBuffer;

And our new generation code without SIMD looks like this:

memcpy(&buffer->positions[writeIndex * 2], AI_FarmersSearchGen, sizeof(float) * 2 * AI_FarmerSearchCount);
for (uint32_t i = 0; i < AI_FarmerSearchCount; i++)
{
    uint32_t writeLoc = writeIndex++;
    buffer->spriteIndices[writeLoc] = FarmerState_Search;
    buffer->scales[writeLoc] = 0.025f;
}

Notice that we don’t store our positions one by one anymore. Due to the “Gen” states only storing position, this allows us to straight memcpy them into our positions buffer! A great win because memcpy’s performance is much faster than loading and storing one position at a time.

Let’s take a look at VTune to see how this change affected our performance:

VTuneSample14

Wow! game_gen_instance_buffer is now at 1.827s! …or is it? There’s now a new contender to our perfomance woes, an anonymous function. I suspect that this is the result of the new memcpy calls introduced. As a result, we’ll add it to our game_gen_instance_buffer time which is now 3.5 instead of 4s. Slightly better but not excellent.

chrome://tracing verifies the assumption that the anonymous function is memcpy, as our game_gen_instance_buffer average tick only improved to 2.2ms from 2.7ms.

Link to this change is here.

It’s time to SIMD! (Or is it?)

Following the split, I implemented the vectorization code for the loading and storing.

Here is our new code:

float search = FarmerState_Search;
__m256 searchIndices = _mm256_set_ps(search, search, search, search, search, search, search, search);
_mm256_storeu_ps(&buffer->spriteIndices[writeIndex], searchIndices);
for (uint32_t i = 8 - writeIndex % 8; i < AI_FarmerSearchCount; i+=8)
{
    _mm256_store_ps(&buffer->spriteIndices[writeIndex + i], searchIndices);
}

At this point, I was surprised at how little the performance actually improved. And looking back, it’s because the compiler had already SIMD-ed my code! Scale and indices we’re already getting loaded and stored into the array with the movups instruction.

Here is the assembly pre-SIMD:

 movups      xmmword ptr [rsi+rdi*4],xmm0

Here is the assembly post-SIMD:

 vmovaps     ymmword ptr [rsi+rcx*4],ymm0

This is a lesson in looking at the generated assembly and the profiler before making an assumption at how to optimize the code. The change only made game_gen_instance_buffers run at around 0.2ms faster per tick.

As a result, we’ll be reverting the changes back to the non SIMD-ed version for now, as this will keep the code easier to follow.

The optimization that wasn’t

Another attempt at improving game_gen_instance_buffer was to not call it at all and write into the buffer while processing the AI. The reasoning behind this optimization was that the data was already in memory (at least for the move code) and storing it would be best at this point. However, as a result of this, the Search and Farm code had to access data they didn’t need and the code ended up slightly slower.

Can we make it smaller?

This next optimization is fairly simple. The scale and sprite indices are stored as 32 bit floats. We don’t need that much space for them. Instead, we’ll store them in 2 16 bit normalized integers.

The idea behind 16 bit normalized integers is simple, instead of storing the values between 0 and the Max value (11 for sprite indices, 1 for scale) we divide the value by the range and multiply by INT16_MAX. This will scale our normalized value to the whole range of int16_t.

The math would look like this:

int16_t toInt16(float v, float min, float max)
{
    return ((v - min) / (max - min)) * INT16_MAX;
}

This will allow us to store our sprite index and scale in half the size!

The game instance buffer now looks like this:

typedef struct
{
    int16_t spriteIndicesAndScales[GAME_MAX_INSTANCE_COUNT * 2];
    float positions[GAME_MAX_INSTANCE_COUNT * 2];
} Game_InstanceBuffer;

And our chrome://tracing results tells us that our game_gen_instance_buffer average is now 1.8ms from 2.2ms.

Looking at VTune confirms this:

VTuneSample15

game_gen_instance_buffer is now 0.9s instead of 1.8s!

You might think “Why don’t we also do this for positions?”. Well, I did attempt this at first. However, due to the conversion code from float to int16_t in the movement code, the performance actually suffered! ai_tick now ran in 6.9 seconds instead of 4.7s, far worse than we would like.

There’s another cost to this size reduction. Our code is no longer vectorized!

If we take a look at our assembly, we see this:

 lea         ebx,[rcx-3]
 mov         word ptr [rsi+rbx*2],0
 lea         ebx,[rcx-2]
 mov         word ptr [rsi+rbx*2],333h

That’s no good! We don’t need these stores, we can vectorize this much better.

As a result, I re-implemented our vectorization code with the new 16 bit data:

__m128i searchAndScale = _mm_set_epi16(FarmerState_Search, AI_FarmerScale, FarmerState_Search, AI_FarmerScale, FarmerState_Search, AI_FarmerScale, FarmerState_Search, AI_FarmerScale);
for (uint32_t i = 0; i < AI_FarmerSearchCount; i+=4)
{
    _mm_storeu_si128((__m128i*)&buffer->spriteIndicesAndScales[(writeIndex + i) * 2], searchAndScale);
}
writeIndex += AI_FarmerSearchCount;

And our assembly now looks like this:

 vmovups     xmmword ptr [rsi+rdx*2],xmm0

Much better, we’re now storing 8 16 bit values in 1 vmovups instruction.

chrome://tracing now tells us that we’re at 1.6ms per game_gen_instance buffer from 1.8ms and our average tick is now 6.8ms.

And VTune:

VTuneSample16

Says that we’ve reduced game_gen_instance_buffer another 0.2 seconds.

Link to this change is here.

Non-Temporal Stores?

Following this optimization I looked into ways to beat memcpy. I was attempting to find some way to improve the performance of game_gen_instance_buffer by replacing memcpy with my own implementation. Despite my best attempts, I failed to beat it. However, I stumbled onto _mm_stream_si128. What’s interesting about _mm_stream_si128 is that it’s a store with a non-temporal memory hint. If you don’t know what that means, don’t worry, I didn’t either. But after some research I have a decent grasp on the concept. When using _mm_store_ps, we have to load the cache line into our main cache. This is problematic, as we don’t need this data in the cache and it will evict data that we might need. Instead, we use _mm_stream_si128 that will actually use it’s own personal cache lines for storing that will not pollute the main cache and (if I understand correctly) not load the memory into cache at all. Instead, it will write directly to memory. (Sources can be found here, here, here and here)

The write code now looks like this:

__m128i searchAndScale = _mm_set_epi16(FarmerState_Search, AI_FarmerScale, FarmerState_Search, AI_FarmerScale, FarmerState_Search, AI_FarmerScale, FarmerState_Search, AI_FarmerScale);
_mm_storeu_si128((__m128i*)&buffer->spriteIndicesAndScales[writeIndex * 2], searchAndScale);
for (uint32_t i = (4 - writeIndex % 4); i < AI_FarmerSearchCount; i+=4)
{
    _mm_stream_si128((__m128i*)&buffer->spriteIndicesAndScales[(writeIndex + i) * 2], searchAndScale);
}

You might notice the _mm_storeu_si128 and then the _mm_stream_si128. The reason for the unaligned store is because our crops could have unaligned our write index from 16 bytes, so we need to write a few elements unaligned and then write the rest of our aligned memory.

chrome://tracing says that we’re now at an average of 1.5ms for game_gen_instance_buffer and our average tick is now 6.6ms.

VTuneSample17

VTune says that our game_gen_instance_buffer call is now at 0.4s from 0.7s, excellent! Unfortunately, memcpy is taking up quite a bit of time. We might address this in a future part of the series but for now, we’ve reduced game_gen_instance_buffer from 20s to 2.3s and it’s tick from 12ms to 1.5ms.

Link to this change is here.

Bringing it all to harvest

That’s it for part 3! We made quite a few changes in this part:

  • We improved our memory access for Crops
  • We implemented draw commands for crops and tiles
  • We split the data from a single buffer into 3 buffers
  • We took our data and shrunk it to 2 16 bit integers instead of 2 floats
  • We re-SIMD-ed the 16 bit stores
  • We used non-temporal stores to improve the performance of our stores

As a result, we’re now at 2.3s for game_gen_instance_buffer instead of 20s! Far better than what we started with.

Part 4?

In part 4 we’re going to take a look at ai_tick again. This time, we’ll try to improve it with computational usage of SIMD and if we manage to become compute bound, we’ll take a look at the assembly and try to reduce the amount of work done. If we remain back-end bound, we’ll take a look at various attempts to reduce the amount of memory accessed.

Find it on github here!

Until Part 4!