It begins

In Part 1 we looked at the overall implementation of the farming simulator and ran the baseline benchmark with Chrome://Tracing. In part 1, I skimmed a few changes I made to the original implementation in order to gather more consistent benchmarks. The few changes are:

  • Run exactly 1000 frames of the program.
  • Force the delta time to 0.016f in order to have consistent AI behaviour even if the benchmark runs quickly or slowly.
  • Random is not seeded to time, this will allow the randomness to be consistent across runs of the application.

The program ran at 42ms per tick and 25ms of the ticks were taken up by the AI state machine! Without running VTune, I had a few assumptions as to where the performance of the AI code could be going. Here is the whole AI_Tick function:

Our AI_Farmer structure:

typedef struct
{
    Vec2 pos;
    AI_FarmerState state;

    union
    {
        struct
        {
            Vec2 vel;
            Field_Tile* tile;
        } moveState;

        struct
        {
            float farmTimer;
            Field_Tile* tile;
        } farmState;

        struct
        {
            float searchTimer;
        } searchState;
    };
} AI_Farmer;

And our ai_tick code:

for (uint32_t ai = 0; ai < AI_FarmerCount; ++ai)
{
    AI_Farmer* farmer = &AI_Farmers[ai];

    switch (farmer->state)
    {
    case FarmerState_Search:
    {
        farmer->searchState.searchTimer -= delta;
        farmer->searchState.searchTimer = math_maxf(farmer->searchState.searchTimer, 0.0f);

        if (farmer->searchState.searchTimer <= 0.0f)
        {
            // Search code...
        }

        break;
    }
    case FarmerState_Move:
    {
        Field_Tile* tile = farmer->moveState.tile;
        float currentDist = vec2_mag(vec2_sub(tile->pos, farmer->pos));
        Vec2 vel = vec2_mul(farmer->moveState.vel, delta);
        vel = vec2_mul(vec2_norm(vel), math_minf(vec2_mag(vel), currentDist));
        farmer->pos = vec2_add(farmer->pos, vel);

        float dist = vec2_mag(vec2_sub(tile->pos, farmer->pos));
        if (dist < AI_FarmerCropRadius)
        {
            // Move code...
        }

        break;
    }
    case FarmerState_Farm:
    {
        farmer->farmState.farmTimer -= delta;
        if (farmer->farmState.farmTimer <= 0.0f)
        {
            // Farm Code...
        }
        break;
    }
    }
}

Looking at this code snippet, we can see a few things. Most importantly, branch prediction and cache misses.

Branch Prediction

That switch statement will hit us hard on branch mispredictions. The state of the farmers will get more and more disjointed as the program execution moves forward, essentially making the state variable random, a branch predictor’s worse nightmare.

A branch predictor is a hardware component of your CPU that guesses at the result of a branch. Why does it need to guess? The answer to that is complex and there are plenty of books and resources that address these topics. I will try my best to summarize the important details.

Modern CPUs use a pipelined approach to executing instructions, this means that they will execute several instructions at once. A great introduction to this topic can be found here. However, there is a problem when one of those instructions is a conditional jump, this conditional jump relies on the execution of a compare instruction. If that compare instruction hasn’t completed yet, the conditional jump has to wait! This stalls our pipeline and our number of overall instructions that were executed is reduced! To address this, CPUs will instead guess at the result of the comparison.

In short, the branch predictor will “guess” at the result of a conditional jump (branch) based on the local and global history of a branch. The CPU will then execute the instructions that follow the correct path through the branch. If the branch predictor guessed wrong, the CPU has to roll back all the work it did and start again from the correct instruction. (A few sources can be found here, here and here) However, if the branch predictor guessed correctly, it’s as if there was never a branch in the first place! (almost) If the branch results are essentially random, the branch predictor will have more misses than we’d like which can impact your performance. (To what extent varies). (A great introduction on branch predictors can be found here.)

Cache Misses

One good thing to come out of this code is that the farmers are stored in an array and accessed linearly, that’s a win for our memory access speeds!

Why do we care about memory access speeds? Because they’re slow! The ideal case for our memory is to have it in the L1 cache when we’re just about to use it. (A rundown of the memory hierarchy can be found here and here.) Our current implementation facilitates that. Since we’re accessing our farmers linearly, the memory prefetcher will happily prefetch the next farmers. (More information about prefetching can be found here on wikipedia and in the intel optimization guide.) This is great because the latency of waiting for our farmers to be loaded from RAM is hidden by our prefetcher. However, if we look at the size of our AI_Farmer structure we can see that it’s a size of 32 bytes in all cases. At the time of writing, cache lines are generally 64 bytes (Source here) and we’re using up half of it with data that we only use sometimes.

VTune

With these considerations in place, I ran the program through VTune.

VTuneSample3

Looking at these sampling results, we can see that the results confirm the chrome://tracing result that ai_tick is taking up a significant portion of our execution time! I was surprised to find that only 6.5% of the execution time of ai_tick was attributed to branch mispredictions. The results do confirm that a significant portion of our execution time is spent waiting on memory.

Low effort, high reward

We’re going to start addressing some of the problems with our memory access patterns. Digging into the profiling results will show us that a pair of lines has quite a dramatic effect on our performance.

Field_Tile* tile = farmer->moveState.tile;
float currentDist = vec2_mag(vec2_sub(tile->pos, farmer->pos));

Looking at the assembly for this snippet we get

mov         rbx,qword ptr [rbp+rdi+1E84818h]
movss       xmm2,dword ptr [rbx+0Ch]
movss       xmm4,dword ptr [rbx+10h]

VTune tells us that the first movss consumes 49% of our instructions in the backend. Why is that? Well, looking at the C code tells us that we’re dereferencing a tile pointer. If we remember in part 1, we got that tile pointer randomly. There’s no way the memory prefetcher could have guessed that we would access that tile and we’re stuck having to wait for that memory to be loaded into our registers.

There was a very simple solution to this problem that works because we know that the tile won’t be moving. We just cache the tile’s position in our move state instead of accessing the tile directly in our movement code.

struct
{
    Vec2 vel;
    Vec2 tilePos;
    Field_Tile* tile;
} moveState;

This simple change transformed our ai_tick function from being the slowest function to being faster than game_gen_instance_buffer (which we will look at in the future).

VTuneSample4

Looking at our original snapshot, we used to be 79% backend bound for ai_tick and now we’re 63% and only taking 17s instead of 24s.

A quick run through chrome://tracing and we see that our average tick dropped from 42ms to 30ms! 30ms is 71% of our original 42ms; our ai_tick also dropped from 25ms to 11ms. That’s 44% of our original 25ms! Our ai_tick is now running 2x faster than our original benchmark simply by changing our memory access patterns.

However, there’s a trade-off to this caching. Now our structure has grown by 8 more bytes. It’s now standing at 40 bytes per farmer.

Link to this change is here.

More effort

If we take a look at the VTune sample, we notice that game_gen_instance buffers has become our slowest function. However, we’ll ignore it for now and keep hacking away at the ai_tick function.

This next optimization will be more work than the previous two. We’re going to split the AI_Farmer structure in 3, one for each state. Once these structures are created, we’re going to store them in seperate buffers. There are many reasons for doing this:

  • We will avoid the switch statement altogether, states will be processed together instead of seperately. The best way to avoid branch mispredictions is to not have any branches to predict in the first place.
  • We can tightly pack our data instead of having enough space for all of our states. Currently, our farmer takes up the same amount of space no matter how much memory the state might actually need. Splitting them up will allow some states to take up less memory than other states. A win for cache occupancy! We’ll be able to process more farmers with a single cache line.

Our structures now look like this:

typedef struct
{
    Vec2 pos;
    Vec2 vel;
    Vec2 tilePos;
    Field_Tile* tile;
} AI_FarmerMoveState;

typedef struct
{
    Vec2 pos;
    float farmTimer;
    Field_Tile* tile;
} AI_FarmerFarmState;

typedef struct
{
    Vec2 pos;
    float searchTimer;
} AI_FarmerSearchState;

Now we can see that AI_FarmerMoveState takes up 32 bytes, AI_FarmerFarmState takes up 24 bytes and AI_FarmerSearchState takes up 12 bytes, all smaller than our original 40 bytes. A great way to make more use of our cache lines.

You will notice that we’re duplicating data in order to improve our performance. This optimization would not be as straightforward if we had hard memory constraints but thankfully our toy program doesn’t mind at all.

Our new ai_tick function looks roughly like this:

uint32_t previousFarmerMoveCount = AI_FarmerMoveCount;
{
    for (uint32_t i = 0; i < AI_FarmerSearchCount; i++)
    {
        AI_FarmerSearchState* farmer = &AI_FarmersSearch[i];

        // Search Code...
    }
}

{
    for (uint32_t i = 0; i < AI_FarmerMoveCount; i++)
    {
        AI_FarmerMoveState* farmer = &AI_FarmersMove[i];

        // Move Code...
    }
}

{
    for (uint32_t i = 0; i < AI_FarmerFarmCount; i++)
    {
        AI_FarmerFarmState* farmer = &AI_FarmersFarm[i];

        // Farm Code...
    }
}

You will notice that there’s no more switch statement! That’s right, it’s completely unnecessary at this point. Our number of branch mispredictions should drop significantly.

Now we can take out our trusty profiler and see that our ai_tick performance has once again improved.

VTuneSample5

We are still heavily back-end bound however our branch mispredictions dropped to 1.4% from 13% and our CPI (Cycles per instruction) improved as well! Our ai_tick now takes 11s instead of 17s.

If we take a look at chrome://tracing, we see that our average tick transitioned from 30ms to 25ms and our ai_tick transitioned from 11ms to 8ms! Pretty good savings for simple changes like these!

Link to this change is here.

A song of ice and fire

Now that we split the AI_Farmer structure into multiple structures, we can push this even further. What we want to do next, is split the data between “hot” and “cold” data. The hot data is the data that we use at every iteration (Timers, velocity, target position) and the cold data is data that we only use when a condition is met. (We’ve reached a tile, we’re done farming, we’ve found a tile)

We want to do this in order to improve our cache line usage. If we’re only using 4 bytes of 12 (search) or 24 (farm) in a hot loop,  we’re wasting plenty of cache space for data that we might not even need!

We’re incurring the cost of a cache miss when we reach the cold data as a trade-off for faster access to our hot data.

typedef struct
{
    Vec2 pos;
    Vec2 vel;
    Vec2 tilePos;
} AI_FarmerMoveStateHot;

typedef struct
{
    Field_Tile* tile;
} AI_FarmerMoveStateCold;

typedef struct
{
    float farmTimer;
} AI_FarmerFarmStateHot;

typedef struct
{
    Vec2 pos;
    Field_Tile* tile;
} AI_FarmerFarmStateCold;

typedef struct
{
    float searchTimer;
} AI_FarmerSearchStateHot;

typedef struct
{
    Vec2 pos;
} AI_FarmerSearchStateCold;

This change should give us slight savings by improving our cache line utilization.

VTuneSample6

Looking at our VTune profile, we see that we only saved 1 second off of our original 11 seconds. Not nearly as much as we saved before, however this change will allow us to leverage SIMD so we’re going to keep it.

Running the program with chrome://tracing we see that our program is still at 25ms per tick but our ai_tick dropped to 7.5ms.

Link to this change is here.

Cutting out the excess

Finally, we’re going to take a look directly at AI_FarmerMoveStateHot:

typedef struct
{
    Vec2 pos;
    Vec2 vel;
    Vec2 tilePos;
} AI_FarmerMoveStateHot;

Position represents our farmer’s position, velocity represents our farmer’s constant velocity towards the tile and tilePosition represents our target tile’s position. What you might notice if you look back at the original ai_tick code, is that all farmer’s have the same velocity. With a small change to our movement code, we can completely remove the velocity from this structure, saving even more cache memory!

typedef struct
{
    Vec2 pos;
    Vec2 tilePos;
} AI_FarmerMoveStateHot;

Now, taking a look at our movement code:

Field_Tile* tile = farmer->moveState.tile;
float currentDist = vec2_mag(vec2_sub(tile->pos, farmer->pos));
Vec2 vel = vec2_mul(farmer->moveState.vel, delta);
vel = vec2_mul(vec2_norm(vel), math_minf(vec2_mag(vel), currentDist));
farmer->pos = vec2_add(farmer->pos, vel);

float dist = vec2_mag(vec2_sub(tile->pos, farmer->pos));
if (dist < AI_FarmerCropRadius)
{
    // Switch to farm state
}

We can simplify this a lot. We have 3 vec2_mag calls and 1 vec2_norm call that internally calls vec2_mag. vec2_mag internally invokes sqrt, most of the time, sqrt isn’t a huge performance bottleneck, however when you’re processing 1+ million sqrts per tick, you start seeing a performance impact.

Vec2 tilePos = farmer->tilePos;

float velMag = AI_FarmerSpeed * delta;
Vec2 dirVec = vec2_sub(tilePos, farmer->pos);
float mag = vec2_mag(dirVec);

Vec2 vel = vec2_mul(dirVec, (1.0f / mag) * velMag);
farmer->pos = vec2_add(farmer->pos, vel);

// If our velocity is larger than our magnitude, we're going to move past the tile
// This is a good enough metric to determine that we've reached the tile
if (velMag > mag)
{
    // Hardset our position to the tile, this is better than a tile radius!
    farmer->pos = farmer->tilePos;
    // Switch to farm state
}

We modified our code to have the same functionality but we’ve simplified the math significantly by combining terms, noticing redudent calculations and removing unnecessary logic. Our movement code has now become significantly simpler than before with only 1 mag call instead of 4 and we’re also using up less of our cache space! Let’s take a look at VTune for our results.

VTuneSample7

ai_tick is now lower than both game_tick and game_gen_instance_buffer. Running in only 5s instead of 10s from our last change and only 50% back-end bound and our CPI dropped to lower than 1!

Running this through chrome://tracing we see that our average tick is now 21ms and our average ai_tick is 4.5ms!

This is the end of our focus on ai_tick for now. We’ve done a good job in improving it’s performance from 32s to 6s a 5x improvement!

Link to this change is here.

Phew! What did we do again?

We did 4 things to improve the performance of the ai_tick function:

  • Cached the target tile’s position instead of accessing memory that we wouldn’t have loaded in the cache.
  • Split the AI_Farmer struct into 3 different structs. One for each state, this improved our branch mispredictions and also improved our cache utilization due to the smaller struct sizes.
  • Split the state structs in hot and cold data, hot data is the data that we access every loop iteration, cold data is data that we access infrequently. This is a nice setup for future SIMD work.
  • Removed the velocity vector from the movement struct, this reduced the size of that struct by 8 bytes, allowing us to use even more of our cache line for more farmers.

What does the future look like?

In Part 3, we’re going to take a look at game_gen_instance_buffers. As we saw in our profiling results, it’s dominating our performance at 21s compared to 8s for game_tick and 6s for ai_tick!

A little birdie told me that we were slowed down by both reading and writing memory, so we’re going to look into ways of reducing the cost of writing all that data to memory. Some tricks that we’re going to look at are compression, dirty commands, SIMD and some more data manipulation. It’s going to be fun!

Find it on github here!

Until Part 3!