Intro

I recently ran a workshop at my local tech meetup about optimizing a small farming simulator that I wrote. Unfortunately, a few people couldn’t make it to the meetup. As a result, I decided to write a blog series on optimizing this small farming simulator in order to share the insights gleamed at the workshop and also in an attempt to push the optimizations even further than the few hours provided by the workshop.

The goal of this series will be to push this simulator to execute in as small a time frame as possible. Follow along and provide your own insights as we work towards our lofty goal.

What will we be tackling?

We’ll be touching on various performance topics as we optimize this program and I hope that you’ll be able to gather some valuable insights along the way. (Or maybe you’ll provide valuable insights to me! :o). Some of the technologies we’ll be touching on are SIMD, Cache, Branch Prediction, Assembly and more! (Maybe…)

Unfortunately, we won’t be tackling the rendering performance or introducing multi-threading. Not only because this would cause the series to span far longer than I’d like but also because I am not skilled enough in these disciplines to provide valuable insights.

Code, code and more code

Let’s take a look at the code shall we?

The program is very simple. Farmer’s use a simple state machine that consists of 3 states. Searching, Moving and Farming.

The farmer structure is simple, it consists of a position, a state enum that dictates which state the farmer is currently in and a union that holds the information for each state.

typedef struct
{
    Vec2 pos;
    AI_FarmerState state;

    union
    {
        struct
        {
            Vec2 vel;
            Field_Tile* tile;
        } moveState;

        struct
        {
            float farmTimer;
            Field_Tile* tile;
        } farmState;

        struct
        {
            float searchTimer;
        } searchState;
    };
} AI_Farmer;

Search Code:

The search code simply decrements a timer, picks a random field tile and if that tile hasn’t been planted yet, targets it and finally changes to the move state. If the farmer failed to find an un-planted tile, it simply restarts the timer.

farmer->searchState.searchTimer -= delta;
farmer->searchState.searchTimer = math_maxf(farmer->searchState.searchTimer, 0.0f);

if (farmer->searchState.searchTimer <= 0.0f)
{
    uint32_t tileIndex = rand_range(0U, Field_Width * Field_Height);
    Field_Tile* tile = &Field_Tiles[tileIndex];

    if (tile->stage != FieldStage_Planted)
    {
        farmer->state = FarmerState_Move;
        farmer->moveState.tile = tile;

        farmer->moveState.vel = vec2_mul(vec2_norm(vec2_sub(tile->pos, farmer->pos)), AI_FarmerSpeed);
    }
    else
    {
        farmer->searchState.searchTimer = rand_rangef(AI_FarmerSearchSpeedMin, AI_FarmerSearchSpeedMax);
    }
}

Move Code:

The move code is also simple, the farmer will move towards it’s target and once it enters a radius from the tile, it moves to the farm state to begin farming the tile.

Field_Tile* tile = farmer->moveState.tile;
float currentDist = vec2_mag(vec2_sub(tile->pos, farmer->pos));
Vec2 vel = vec2_mul(farmer->moveState.vel, delta);
vel = vec2_mul(vec2_norm(vel), math_minf(vec2_mag(vel), currentDist));
farmer->pos = vec2_add(farmer->pos, vel);

float dist = vec2_mag(vec2_sub(tile->pos, farmer->pos));
if (dist < AI_FarmerCropRadius)
{
    farmer->state = FarmerState_Farm;
    farmer->farmState.farmTimer = rand_rangef(AI_FarmerFarmSpeedMin, AI_FarmerFarmSpeedMax);
    farmer->farmState.tile = tile;
}

Farm Code:

The farm code decrements a farming timer. Once the timer is complete, the farmer will determine what to do based on the state of the tile. If the tile is already grown, it will release the crop memory (harvesting). If the next state of the field tile was planted, the farmer will allocate memory for the crop and assign it a random crop type. Finally, the farmer moves back to the search state to find another tile.

farmer->farmState.farmTimer -= delta;
if (farmer->farmState.farmTimer <= 0.0f)
{
    Field_Tile* tile = farmer->farmState.tile;

    if (tile->stage == FieldStage_Grown)
    {
        free(tile->crop);
        tile->crop = NULL;
    }

    tile->stage = math_max((tile->stage + 1) % FieldState_Max, FieldStage_Fallow);

    if (tile->stage == FieldStage_Planted)
    {
        tile->crop = (Field_Crop*)malloc(sizeof(Field_Crop));
        tile->crop->grown = rand_rangef(Crop_MinLifetime, Crop_MaxLifetime);
        tile->crop->cropType = rand_range(0, Crop_MaxCropType);
    }

    farmer->state = FarmerState_Search;
    farmer->searchState.searchTimer = rand_rangef(AI_FarmerSearchSpeedMin, AI_FarmerSearchSpeedMax);
}

Finally, crops use a very simple timer to determine if they have completed their growth cycle:

for (uint32_t i = 0; i < Field_Width * Field_Height; ++i)
{
    if (Field_Tiles[i].stage == FieldStage_Planted)
    {
        Field_Crop* crop = Field_Tiles[i].crop;
        crop->lifetime += delta;

        if (crop->lifetime >= crop->grown)
        {
            Field_Tiles[i].stage = FieldStage_Grown;
        }
    }
}

Technologies

My IDE of choice is Visual Studio 2017 community, however instead of using the Visual Studio Compiler, I will be using Clang with the -Ofast flag as my optimizing compiler. Additionally, I will be using VTune to gather performance metrics as well as my home-grown trace profiling library Mist_Profile that utilizes Chrome://tracing for trace profiling. VTune will be used for our profiling work, but Mist_Profile will allow us to get a feel for where our performance is every frame. The project also makes use of sokol for graphics, windowing and timings, as well as stb_image for well… images. We’ll also be coding in C11 instead of C++. This shouldn’t look too different from simple C++ therefore those who are comfortable with C++ should still feel at home in this series.

My CPU is an Intel Core i7 4702HQ with 4 cores running at 2.20GHZ, 4 x 32kb L1 cache (8 way associative), 4 x 32kb L1 icache (8-way), 4 x 256kb L2 cache (8 way) and 6mb L3 cache. I also have 8gb of DDR3 DRAM.

What’s next?

Some might already see plenty of optimization opportunities. But before we even determine what needs to be optimized, we’re going to run the program through VTune and determine what we should tackle first. That will be our first step in the next part of this series.

In order to get a benchmark for future work, I executed a quick run of the program. Looking at the trace can determine that our average tick time is 57.676ms. Quite far from an interactive 16ms (without rendering!). Let’s see if we can bring it down to less than 16ms as the series goes on! (JSON Trace here, plug it into Chrome://Tracing to take a look!)

In Part 2 we’ll be running VTune and tackling the AI performance which currently runs at an average of 36.166ms per frame. In order to improve it, we’ll be improving our cache utilization as well as reducing the number of branch mispredictions.

Edit: While working on part 2, I noticed that my laptop was running on power saver mode. This was rectified and I determined that the average tick is of 42ms and the average AI tick is of 25ms! …oops…

Github link is here!

Links and thanks!

I’d be remiss if I didn’t share Pierre Terdiman’s excellent series on optimizing his box pruning library as well as Aras Pranckevičius’ dod playground for unity employees that prompted me to devise the initial workshop after some performance junkies and I attempted to push the performance of his playground as much as we could.

Finally, a big thanks to everyone who came to the initial workshop and my friends and colleagues who endlessly challenge themselves and share their knowledge as well as those who proof read this post.

Until Part 2!