Cranberry Lair

Walking the programming parameter space

Diary of a Path Tracer – Math Library — August 22, 2020

Diary of a Path Tracer – Math Library



This is the first in a series of posts detailing the functionality of my hobby path tracing, we will be looking at the underpinning’s of it all, the math library cranberry_math.

The goals of cranberry_math are simple. Making SIMD accelerated mathematics easier and fun to use. There are already a huge swath of math libraries out there that do things very well (DirectXMath and GLM to name a few), however I decided to write my own for fun!


The basis of cranberry_math is the concept of lanes. If you’re not familiar with SIMD, the concept is relatively simple. You can imagine that a SIMD register (or vector register) is a set of 4 “lanes”, these lanes are generally independent of each other but allow you to perform operations on them using another SIMD register. For the purpose of demonstration, we will write a SIMD register as v_n (v_0, v_1, v_2) and the element of one of those registers as v_ne_m (v_0e_0) where v_0e_0 would be element 0 of SIMD register 0.

I.e. if I take v_0 and add it to v_1 the result is the equivalent of v_2 = (v_0e_0+v_1e_0, v_0e_1+v_1e_1, v_0e_2+v_1e_2, v_0e_3+v_1e_3)


On it’s own, this might not be particularly useful. But vector register operations have a number of desirable properties.

Let’s imagine we have a fantasy processor, and this processor takes 1 cycle for every addition. This means that to add 4 pairs of numbers together it would take us roughly 4 cycles or 1 cycle per add. Now if our vector addition takes 4 cycles to complete, that’s still 1 cycle per add. However, consider what happens if by some magic we only take 3 cycles for our vector addition. That means overall, we only spend 3/4=0.75 cycles per add! In our fantasy processor, we can make this as low as 1 cycle for 4 adds or 1/4=0.25 cycles per add. You may ask why we can make our instruction run in less than 4 cycles even though we’re still doing 4 adds? In our fantasy processor, we can do this  by simply running 4 adders at once with our data. This means that although every adder is taking 1 cycle to complete, they are run in parallel, making the overall throughput 1 cycle/4 adds.

This is a highly simplified version of how your processor might implement vector adds.

Now, with this added information, we can see why we would want to use these operations efficiently, we could reduce our program latency by at most 4x! (In reality this is unlikely to happen, but we can get fairly close).

Cranberry_math takes this concept and tries to make the best use of it while still making it approachable to the user.

An intuitive and simple use of these registers would be to store our X,Y,Z components into each component such as v_0 = (x, y, z, unused) and working with them in that manner. In some common operations such as adding vectors, this can give you a nice hypothetical improvement of 3x over your original single element addition. However, we’re still leaving 25% of our optimal performance on the table! This approach also causes some issues with common operations such as the dot product.

Let’s say in our fantasy processor, our dot product operation takes 2 cycles to complete when applied to 1 vector. That means we get a throughput of 2 cycles/dot product. I propose (and has been proposed elsewhere as well) that you instead use each register to store a single component of a vector. Such as v_x = (x_0, x_1, x_2, x_3), v_y = (y_0, y_1, y_2, y_3), etc. With this in mind, we can achieve the dot product by doing v_x*v_x+v_y*v_y+v_z*v_z which is a total of 5 operations that all take 1 cycle in our fantasy processor for 4 vectors or a throughput of 1.25 cycles/dot product.

For SSE according to the intel intrinsics guide a multiply operation has a latency of 4 cycles (on Skylake), an add has a latency of 4 cycles. As a result a dot product in a fantasy processor with no out of order functionality or multiple issue, our wide dot product would take roughly 20 cycles to complete or 20/4=5 cycles per dot product. While if we look at the dot product instruction, it has a latency of 11 cycles. (See [1] and [2] for info on reading the intel intrinsics guide values) This is not indicative of real performance numbers. Modern processors are significantly more complex than what I’m implying here.

Another benefit of storing vectors this way, is that it scales trivially to SIMD registers of size 8, or 16! You only need to throw more data at it!

Now, this approach to storing vectors in SIMD registers is not fool proof. It introduces difficulty if we want to add similar components together. (x with x, y with y) within one vector, I however have not found this to be a problem while writing my path tracer.

This isn’t a silver bullet, but I recommend considering adding a path for this type of storage if you can afford it.

In Depth View

With that out of the way, here are the primary types for our library:

#define cran_lane_count 4
cran_align(16) typedef union
	float f[cran_lane_count];
	__m128 sse;
} cfl;

typedef union
		float x, y, z;

} cv3;

typedef struct
	cfl x;
	cfl y;
	cfl z;
} cv3l;

The theme of this library is that SIMD registers are referred to as lanes. Every element is expected to stay within it’s respective lane. cfl represents “cranberry float lane”, cv3 represents “cranberry vector 3” and cv3l represents “cranberry vector 3 lanes”. Notice how every component in cv3l has it’s own set of lanes? That is a direct implementation of the explanation that was presented above.

cran_forceinline cv3l cv3l_add(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_sub(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_mul(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_min(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_max(cv3l l, cv3l r);

The API for cv3l is simple, it looks exactly the same as it’s cv3 counterpart. This makes it relatively easy to switch from one to the other simply by loading the appropriate data correctly.

cran_forceinline cv3l cv3l_indexed_load(void const* vectors, uint32_t stride, uint32_t offset, uint32_t* indices, uint32_t indexCount);

One function that we should take a particular look at is cv3l_indexed_load. It’s likely you don’t want to store your data in cv3l format. It doesn’t lend itself particularly well to general processing and works much better in batch processing. As a result your data needs to be transformed from x,y,z,x,y,z to x,x,x,y,y,y,z,z,z to facilitate our maximum throughput.

One way to shuffle our data would be to load each element individually and store it in the appropriate locations. However this is surprisingly slow (I forget the exact reasons, a reference for that would be awesome).

Instead, what you can do is load your vectors as a set of vectors:

v_0 = (x_0, y_0, z_0, u), v_1 = (x_1, y_1, z_1, u), v_2 = (x_2, y_2, z_2, u), v_3 = (x_3, y_3, z_3, u)

and then shuffle the values into a set of temporary registers

v_{xy0} = (x_0, y_0, x_1, y_1), v_{xy1} = (x_2,y_2,x_3,y_3), v_{z0} = (z_0, u, z_1, u), v_{z1} = (z_2, u, z_3, u)

and finally shuffle them into the final registers

v_x = (x_0, x_1, x_2, x_3), v_y = (y_0, y_1, y_2, y_3), v_z =(z_0, z_1, z_2, z_3)


(There’s a mistake with the arrows for Z_2 and Z_3 they should be pointed to the 4 register)

Here’s the source for this operation:

cran_forceinline cv3l cv3l_indexed_load(void const* vectors, uint32_t stride, uint32_t offset, uint32_t* indices, uint32_t indexCount)
	__m128 loadedVectors[cran_lane_count];
	for (uint32_t i = 0; i < indexCount && i < cran_lane_count; i++)
		uint8_t const* vectorData = (uint8_t*)vectors;
		loadedVectors[i] = _mm_load_ps((float const*)(vectorData + indices[i] * stride + offset));

	__m128 XY0 = _mm_shuffle_ps(loadedVectors[0], loadedVectors[1], _MM_SHUFFLE(1, 0, 1, 0));
	__m128 XY1 = _mm_shuffle_ps(loadedVectors[2], loadedVectors[3], _MM_SHUFFLE(1, 0, 1, 0));
	__m128 Z0 = _mm_shuffle_ps(loadedVectors[0], loadedVectors[1], _MM_SHUFFLE(3, 2, 3, 2));
	__m128 Z1 = _mm_shuffle_ps(loadedVectors[2], loadedVectors[3], _MM_SHUFFLE(3, 2, 3, 2));

	return (cv3l)
		.x = {.sse = _mm_shuffle_ps(XY0, XY1, _MM_SHUFFLE(2, 0, 2, 0))},
		.y = {.sse = _mm_shuffle_ps(XY0, XY1, _MM_SHUFFLE(3, 1, 3, 1))},
		.z = {.sse = _mm_shuffle_ps(Z0, Z1, _MM_SHUFFLE(2, 0, 2, 0))}

With all this, we can see that I can write a vectorized ray/AABB intersection using this new API:

cran_forceinline uint32_t caabb_does_ray_intersect_lanes(cv3 rayO, cv3 rayD, float rayMin, float rayMax, cv3l aabbMin, cv3l aabbMax)
	cv3l rayOLanes = cv3l_replicate(rayO);
	cv3l invD = cv3l_replicate(cv3_rcp(rayD));
	cv3l t0s = cv3l_mul(cv3l_sub(aabbMin, rayOLanes), invD);
	cv3l t1s = cv3l_mul(cv3l_sub(aabbMax, rayOLanes), invD);

	cv3l tsmaller = cv3l_min(t0s, t1s);
	cv3l tbigger  = cv3l_max(t0s, t1s);

	cfl rayMinLane = cfl_replicate(rayMin);
	cfl rayMaxLane = cfl_replicate(rayMax);
	cfl tmin = cfl_max(rayMinLane, cfl_max(tsmaller.x, cfl_max(tsmaller.y, tsmaller.z)));
	cfl tmax = cfl_min(rayMaxLane, cfl_min(tbigger.x, cfl_min(tbigger.y, tbigger.z)));
	cfl result = cfl_less(tmin, tmax);
	return cfl_mask(result);

That’s the gist of it! Writing vectorized BVH traversal using this API has been a breeze. You can find the rest of the library here:

Here is the library in use for BVH AABB testing:

uint32_t activeLaneCount = min((uint32_t)(testQueueEnd - testQueueIter), cran_lane_count);
cv3l boundMins = cv3l_indexed_load(bvh->bounds, sizeof(caabb), offsetof(caabb, min), testQueueIter, activeLaneCount);
cv3l boundMaxs = cv3l_indexed_load(bvh->bounds, sizeof(caabb), offsetof(caabb, max), testQueueIter, activeLaneCount);
uint32_t intersections = caabb_does_ray_intersect_lanes(rayO, rayD, rayMin, rayMax, boundMins, boundMaxs);

We’ll be taking a deeper look at the SSE optimized BVH traversal in the future. For now, happy coding!






Diary of a Path Tracer – The Beginning —

Diary of a Path Tracer – The Beginning

Within the past few months I’ve been implementing a path tracer during my weekends and evenings. Having reached what I consider to be an acceptable stopping point, I thought it might be interesting to detail my current implementation alongside the references that inspired me. Let’s jump in!

This series will touch on a variety of features that were implemented in the path tracer. Notably:

These posts will try to explain every aspect of the path tracer as they stand but they might not be as comprehensive as the sources referenced to create it. As a result, I will strive to include all relevant sources for each feature.

You can find the path tracer here:

Next we’ll talk a about the underpinning of the path tracer, the cranberry_math library, see you then!

Derivation – Importance Sampling The Cosine Lobe — June 7, 2020

Derivation – Importance Sampling The Cosine Lobe


I’ve recently been diving into the world of importance sampling and I decided to share my derivation for importance sampling the cosine lobe.


When we’re shading a point in path tracing, we typically shoot a ray from our surface in a uniformity random direction contained on a hemisphere centered about our normal. This has the downside of introducing quite a bit of variance into our renders.

Imagine that we have a very bright light that only occupies a very small projected area from our shaded point.


We would be very likely to miss this light with most of our samples and our render could turn out much darker than we would expect.

This is where importance sampling comes in.

If you imagine that your illumination is a polar function


If we were to sample a random variable with a distribution that matches this function, we would be much more likely to hit the important points of our function. (Hence importance sampling)

I won’t dive deeply into this topic, as there are a variety of excellent resources detailing this topic. [1]

The essence of it however, is that you want to find a Probability Density Function (PDF) that matches the shape of your illumination function. Once you’ve define this PDF, you can sample it using the Cumulative Density Function (CDF).


Since our cosine lobe illumination will look like this:


We will use this as the basis to derive our distribution since we’re most likely to get the most light from directions arriving parallel to our normal.

Thankfully, our cosine lobe has an analytical formula that we can use as our PDF.

PDF(\omega) = C*cos(\theta) (1)

Our PDF must integrate to 1, we integrate the PDF across our hemisphere


\int_0^{2\pi}\int_0^{\frac{\pi}{2}}PDF(\omega)sin\theta d\theta d\phi

Plug in (1)

\int_0^{2\pi}\int_0^{\frac{\pi}{2}}C*cos\theta sin\theta d\theta d\phi

C*\int_0^{2\pi}\int_0^{\frac{\pi}{2}}cos\theta sin\theta d\theta d\phi

\int cos\theta sin\theta d\theta = -\frac{1}{4}cos2\theta

\int_0^{\frac{\pi}{2}}cos\theta sin\theta d\theta

-\frac{1}{4}cos\pi+ \frac{1}{4}cos0


\int_0^{\frac{\pi}{2}}cos\theta sin\theta d\theta=\frac{1}{2} (2)

Plug in (2)

C*\int_0^{2\pi}\frac{1}{2} d\phi


C*\int_0^{2\pi}\int_0^{\frac{\pi}{2}}cos\theta sin\theta d\theta d\phi=C*\pi (3)

Since our PDF has to integrate to 1,

\int_0^{2\pi}\int_0^{\frac{\pi}{2}}PDF(\omega)sin\theta d\theta d\phi = 1

Plug in (3),


C=\frac{1}{pi} (4)

Finally, plug in (4) into our PDF,

PDF(\omega) = \frac{cos(\theta)}{\pi} (5)

Now that we have our PDF, we can calculate our PDF in terms of \theta and \phi.

PDF(\theta,\phi)d\theta d\phi = PDF(\omega)d\omega

PDF(\theta,\phi)d\theta d\phi = PDF(\omega)sin\theta d\theta d\phi


PDF(\theta,\phi)=\frac{cos\theta sin\theta}{\pi} (6)

Now we integrate with respect to \phi to get PDF(\theta)

\int_0^{2\pi}\frac{cos\theta sin\theta}{\pi}d\phi = 2cos\theta sin\theta

PDF(\theta)=2cos\theta sin\theta

And then to get PDF(\phi),


\frac{cos\theta sin\theta}{2cos\theta sin\theta \pi}=\frac{1}{2\pi}


Now we want to calculate the CDF of each function,

CDF(\theta)=\int_0^\theta PDF(\theta) d\theta

CDF(\theta)=\int_0^\theta 2cos(\theta)sin(\theta) d\theta

CDF(\theta)=\int_0^\theta sin(2\theta) d\theta


CDF(\phi)=\int_0^\phi PDF(\phi) d\phi

CDF(\phi)=\int_0^\phi\frac{1}{2\pi} d\phi


Now we want to invert our CDF to sample it using our random variable y,





\frac{cos^{-1}(1-2y)}{2}=\theta (7)

For \phi,



y*2\pi = \phi (8)

Now we have our CDFs and our PDFs, we can finally calculate our direction.

In pseudo code you can simply do:



With these directions, you can now sample your scene:

\frac{SampleScene(SphericalTo3D(\theta, \phi))}{PDF(\omega)}

Plug in (5)

\frac{SampleScene(SphericalTo3D(\theta, \phi))\pi}{cos\theta}


That’s it! These formulas will sample the hemisphere where it is receiving more light defined by the cosine lobe. The results are pretty awesome.



Attempting Triangular Area Lights — May 16, 2020
A quest towards intuition – Where does that cosine come from? — April 1, 2020

A quest towards intuition – Where does that cosine come from?

I’ve recently been learning more about BRDFs and their properties and I stumbled across the concept of energy conservation. Energy conservation simply states that your BRDF should not emit more energy than it receives. To determine if a BRDF is energy conserving, you accumulate all the light emitted by a surface across a hemisphere centered at the normal of that surface and assure that the accumulated light is less than or equal to the amount of received light.

The mathematical equation for this is:


Where Li is the incoming light, i is the angle from our normal to our current accumulation direction.

Since Li is a constant, we can remove it and we’re left with:


Faced with this equation, I was struck. Where did that cosine come from?

Where did you come from? Where did you go? Where did you come from Cosine I Joe.

My first instinct was to brush it off. “It’s just the Lambert cosine factor”. False! The Lambert cosine factor was applied in Li which we removed.

I then began my slow decline into madness.


I was very confused. With a lot of stumbling I came up with various answers, each not as satisfying as I would like them to be.

Radiant exitance I yelped! Is it related to this “physics” and this “poynting vector” that you find when researching radiant flux? It might be, but I’m not proficient in the dark magic of physics. And so I moved on.

Can we instead think of it using the Helmholtz reciprocity principle that says that we can reverse our view and light vectors? This would allow us to flip the meaning of cos(i) from our viewing direction to our light direction and then simply say it’s the Lambertian cosine factor. But this didn’t feel right. We’re not talking about Lambertian materials, we’re talking about accumulating light from a generalized material. I decided to move on from this theory.

Finally, it struck me. The circle had done it! It’s always the circle.

Let’s forget everything we talked about. Forget about incoming radiance. Forget about 3D. Forget about Helmholtz. Forget about physics.

It’s just geometry

Let’s take a look at what we’re doing when we’re calculating how much light is being emitted by a surface in 2D.

If we start with the integration of the BRDF across a 2D circle without the cosine factor, we get:


Where dS is our integration segment.

What this means is that We’re splitting the circle into equal lines and multiplying our emitted light by the width of each viewing segment.



Let’s take a look at each integration part in succession. Our integration method implies that we want to multiply our current light value by the width of our integration, as a result, we’ll draw 2 parallel lines for each line segment to represent the shape that would be the result of our multiplication.

When viewed top down, everything is great!

What about at an angle?

Oops! That doesn’t line up with our observed area, our perceived area is containing more than our desired dA. What do we do about it?

Let’s look at it from the perspective of dA projecting onto a line at different angles


Notice how our area is getting smaller as our angle gets steeper? That’s what our integration segment should be doing to remove the extra area that we observed.

Let’s figure out the proportion of that shrinking area.

Notice that we can calculate the length of our projection using a right angle triangle (It always comes back to triangles).




Our hypotenuse is dA, our angle is i and our projected area is dP.
Remembering our identities, SOH CAH TOA, Sine of our angle is equal to our opposite over our hypotenuse. This gives us dP/dA = sin(i) which we can reformulate as dP=sin(i)*dA.

dP here is the area of our integration segment, it is the value by which we want to multiply our accumulated light. Now that we know how to get dP, we want to formulate dA in terms of our integration segment dS.

Since we know that dS from the top down (when i is \pi/2) matches dA we can say that dS = sin(\pi/2)*dA which simplifies to dS = dA, as a result, we can replace our dA in the above to get dP = sin(i)*dS.

We can therefore modify our integral to be \int_{\Omega}BRDF*dP and substituting dP we get \int_{\Omega}BRDF*sin(i)*dS. Close! But why is it a sine and not a cosine? This is because our angle i is the angle from the plane of our normal to our normal but the information that we have is the angle from our normal to our plane (using N dot V). Let’s call the angle from the normal to the viewing direction \theta. We know that \theta=\pi/2-i. If we solve for i we get \pi/2-\theta=i and substituting to get \int_{\Omega}BRDF*sin(\pi/2-\theta)*dS which finally becomes \int_{\Omega}BRDF*cos(\theta)*dS!


We’ve reached the end. This is where our cosine comes from!

Final thoughts

I spent quite some time trying to unravel this mystery, but it’s always exciting to get just that little bit of extra intuition. Hopefully this helps you understand it just a little bit more!

Side notes

Q: Why don’t we always do this cosine when integrating over a hemisphere?

A: Because we’re not always integrating a differential area (dA). It would be wrong to always apply that rule if we’re instead integrating a smaller hemisphere, or some other geometrical concept

Q: Why don’t we multiply our light by this cosine factor when we’re rendering?

A: Because it’s implicit in our perspective projection. When we’re viewing the surface it already covers less of our viewing surface when viewed at an angle. If we applied the cosine when rendering we would be applying it twice. We apply it here because we don’t have any perspective projection applied when doing our integration.




A Quest Towards Intuition – Why is depth interpolated as 1/z? — August 27, 2019

A Quest Towards Intuition – Why is depth interpolated as 1/z?


I have recently been attempting to improve my understanding of perspective projection. This included a variety of topics such as deriving a perspective projection matrix and understanding interpolation of vertex shader outputs in perspective. However, one topic that evaded me, was the surprising result that depth is interpolated as 1/z instead of z.

Continue reading

Think Piece – Expecting Failure — June 3, 2019
Creating an Optimized Transform Hierarchy — April 14, 2019
Sowing the seeds of speed – Part 5 – The Finale (For now?) — March 16, 2019

Sowing the seeds of speed – Part 5 – The Finale (For now?)

Recap of part 4

In part 4, we looked at SIMD for computations, more quantization, software prefetching and a bug fix. This time, we’re going to be looking at some more bug fixing, more SIMD and some surprising performance impacts!

Bug fixes

While I was debugging a new optimization for part 5, I ran into a few bugs in the program…

Farmer and crop removal had a bug in it. When we were removing these elements, we would loop from the start of the removed indices and we would take the last valid element from the back and copy it onto the current index we want removed. The problem with this solution, is that if our last valid element was also to be removed, we would copy it onto the current index. This is a problem, because we would need to remove it! It’s not a valid object anymore! Instead, if we loop from the back of the array, we will only replace our current index with an element that is guaranteed to be valid because the indices are ordered.

Farmer and crop removal also had a bug in it with the second call to simd_moveMaskToIndexMask. Before the fix, the call would only mask out the lower 8 bits from the indexMask. However, this is not correct, because the function itself only works with the lower 8 bits! As a result, the indices would all be the same. In order to modify this, I simply shifted the indexMask by 8 bits to the right.


simd_moveMaskToIndexMask(indexMask & 0xFF00);


simd_moveMaskToIndexMask((indexMask & 0xFF00) >> 8);

It would probably have been best to take a uint8_t as an argument instead of an unsigned int but I had misunderstood the functionality of the _pdep_u64 intrinsic used in simd_moveMaskToIndexMask.

Finally, there was also a bug in converting the result of a 16 bit compare to a move mask. Because there is no movemask intrinsic for 16 bit integer, I had used _mm256_packs_epi16(a, b) which I believed to convert the 16 bit integers to 8 bit integers at their respective element. However, I had not realized that these actually worked within lanes! The first 8 elements were from a and the next 8 were from b and then the next 8 were from a and then from b. I expected the format to be a, a, b, b. As a result, the movemask would end up incorrect!

Instead, I modified the code to execute the movemask on the 16 bit results and then converted the mask to an 8 bit movemask.

__m256i cmpRes = _mm256_cmpgt_epi16(zeroI, lifetime);
uint32_t moveMask = _mm256_movemask_epi8(cmpRes);
int indexMask = _pext_u32(moveMask, 0x55555555) & bitMask;

The magic happens in _pext_u32(moveMask, 0x55555555). Because we’re creating a movemask from 16 bit integers and not 8 bit integers, the mask bits will actually double! If the result of our compare was 0xFFFF, 0x0000, 0xFFFF, 0xFFFF then our movemask would be 11001111 which is not correct! We want our movemask to look like 1011. As a result, I used _pext_u32. _pext_u32 will take the bits corresponding set bits in mask (0x55555555) and pack them from least significant bit to most significant bit. This means that we’re taking all our even bits and packing them. Because 0x55555555 in binary is 01010101010101… we’re taking x1x0x1x1 from our movemask and packing them to 1011! (More cool bit tricks can be found here)

As a result to all these changes, the performance of the program degraded. I believe it is because we weren’t adding all the indices to our removal list, and we were also not adding the correct indices! Now we actually loop through all the correct indices and chrome://tracing says that our average ai_tick performance degraded from 1.67ms to 1.87ms. Our average tick is now 2.75ms instead of 2.6ms.

Bucket farming

At this point we’ve made our program very efficient by improving the speed at which we access and modify our data. But now, it’s gotten quite a bit harder to improve the performance of our algorithm this way.

Instead, we’re going to modify our algorithm. According to VTune, most of our time is spent decrementing timers and moving farmers. We’re going to be tackling the timers.

The timers are currently all decremented at a rate of 16ms per tick, that means at times we can be decrementing 1 million timers per tick! Instead of doing this, we can do something much better, we can group our timers in buckets and only decrement the global bucket timers instead of the timers themselves. Then, in order to retain our fine grained timing, we keep track of which bucket will require fine grained decrementation and we will only decrement that bucket!

Looking at our farm state, we can see that our state decrements a timer between the ranges of 3 seconds and 5 seconds. In order to split this up into buckets, I decided to split these buckets up in 6 buckets where each bucket holds the timers of a specific time range.

We start off with an index indicating which bucket needs to be finely decremented. We then increment our global bucket timer and decrement the timers in the bucket referenced by our index. Once out timer reaches 1s, we reset the timer and advance our index by 1 and modulo by the number of buckets in order to get it to wrap around.

Say our current fine bucket index is 5, then our timer reaches 1s, we advance our index by 1 to 6. Since 6 % 6 is 0, that’s our next fine decrementation bucket. We can guarantee this property because we place the timers in the buckets based on their number of seconds.

int16_t farmerTimer = rand_range(AI_FarmerFarmSpeedMin, AI_FarmerFarmSpeedMax);
uint32_t bucketSecond = (farmerTimer + AI_FarmersFarmBucketTransitionTimer) / AI_TimePrecision;
uint32_t bucketFarmerCount = AI_FarmersFarmHotBucketCounts[bucketIndex];

AI_FarmersFarmHotBuckets[bucketIndex][bucketFarmerCount] = farmerTimer - bucketSecond * AI_TimePrecision;

Looking at chrome://tracing indicates that ai_tick now runs at an average of 1.7ms from our original 1.8ms. Interestingly, game_gen_instance_buffer now runs 0.1ms slower. I wonder if this a result of ai_tick not allowing game_gen_instance_buffer to complete it’s work in the background.


Looking at VTune indicates that we’re now 75% back-end bound in ai_tick and we’re now retiring almost 30% of our instructions, very good results from bucketing the farm timers.

Doing the same modification to the search timers doesn’t produce exceptional results, but I will keep this modification in order to keep things consistent.

More quantization!

Now that our timers are only in the range of 0s to 1s, we can quantize our data even more! Now I’m going to quantize our values to a range of 1s to 10ms. This will cause our precision to drop significantly, but since our timers are for AI, I think this cost is appropriate.

As a result, I changed all the farmer timers to int8_t and change the AI_TimePrecision to 10ms. With this change, chrome://tracing notes that ai_tick now runs at an average of 1.25ms from our 1.7ms! However, once again, game_gen_instance buffer slowed down from 0.9ms to 1.17ms…

Big thanks to Zack Dawson (twitter) for the idea!

Streaming again?

When we left off for game_gen_instance_buffer, we were mostly using memcpy to copy all of our data from one position buffer to another. At that time, I had decided to use memcpy because memcpy is very fast and the location being written to was not guaranteed to be aligned with the location being read from, restricting me from using _mm256_stream_si256.

As a result, I decided to write the last element of an array multiple times in order to align the write index to a 64 byte boundary. This gave me the opportunity to use the stream intrinsics for our homegrown memcpy:

void simd_streamMemCpy(__m256i* dstWrite, __m256i* srcRead, size_t size)
    if (size == 0)

    __m256i* dstEnd = dstWrite + (size >> 5);
    for (; dstWrite <= dstEnd; dstWrite += 2, srcRead += 2)
        __m256i src256 = _mm256_stream_load_si256(srcRead);
        _mm256_stream_si256(dstWrite, src256);
        __m256i src256_2 = _mm256_stream_load_si256(srcRead + 1);
        _mm256_stream_si256(dstWrite + 1, src256_2);

Which compiles to:

    test    rdx, rdx
    je      .LBB0_3
    and     rdx, -32
    add     rdx, rdi
    xor     eax, eax
    vmovaps ymm0, ymmword ptr [rsi + rax]
    vmovntps        ymmword ptr [rdi + rax], ymm0
    vmovaps ymm0, ymmword ptr [rsi + rax + 32]
    vmovntps        ymmword ptr [rdi + rax + 32], ymm0
    lea     rcx, [rdi + rax]
    add     rcx, 64
    add     rax, 64
    cmp     rcx, rdx
    jbe     .LBB0_2

Taking a look at chrome://tracing shows us that gen_instance_buffer now runs at an average of 1ms instead of 1.1ms, saving us an additional 0.1ms.

And VTune:


Both our slowest functions are now taking less than 1s!

Can we avoid reading the tile stage?

At first, in order to avoid reading the tile stage and to be able to just store the index without having to access the memory, I thought storing a collection of the unplanted indices might be effective. However, I had missed the obvious that in order to get the index from this collection of indices, I would have to read that memory instead. This added complexity simply slowed down ai_tick by 0.2ms.

As a result, I simply changed the tile from a 32 bit integer, to an 8 bit integer giving speedups in ai_tick but slowing down game_gen_instance_buffer again.

The return of game_gen_instance_buffer

At this point, game_gen_instance_buffer is running at an average of 1ms per tick. Still slower than the original 0.9ms from part 4. As a result, we’re going to tackle it some more with quite a few modifications.

The first, we changed the type for sprite indices from uint16_t to uint8_t. Since the range of values for these indices is 0 to 11, we have plenty of space for these 11 values in a uint8_t.

The next modification was quite a bit more complex than the uint8_t modification. This change takes into account that the state of the farmers will dictate what sprite index will be used to render the farmers and that all farmers share the same scale.

In order to keep the blog post from becoming all about this optimization, I’m only going to go through the explanation for the scales.

The buffer generation is very simple: I write every scale into the array in the order of tiles, crops, farmers. The number of tiles is constant, thus, we don’t need to worry about it. The number of crops however, is variable. This means that were our farmer scales starts and end is determined by the number of crops rendered that tick.

While rendering, I keep track of a writing index. This index determines where the next sprite instance will be rendered.

Like this:

int writeIndex = 0;

__m256i farmerScale = _mm256_set1_epi16(AI_FarmerScale);

if (Gen_PreviousFarmerScaleStart == 0)
    Gen_PreviousFarmerScaleEnd = writeIndex + AI_FarmerCount;
    Gen_PreviousFarmerScaleEnd += 64 - (AI_FarmerSearchCount % 64);
    Gen_PreviousFarmerScaleEnd += 64 - (AI_FarmerMoveCount % 64);

    simd_memSetToValue((__m256i*)(buffer->scales + writeIndex), farmerScale, (Gen_PreviousFarmerScaleEnd - writeIndex) * sizeof(uint16_t));
    Gen_PreviousFarmerScaleStart = writeIndex;
    uint32_t newEnd = writeIndex + AI_FarmerCount;
    newEnd += 64 - (AI_FarmerSearchCount % 64);
    newEnd += 64 - (AI_FarmerMoveCount % 64);

    if (newEnd > Gen_PreviousFarmerScaleEnd)
        uint32_t extra = Gen_PreviousFarmerScaleEnd % 64;
        uint32_t writeLoc = Gen_PreviousFarmerScaleEnd - extra;
        simd_memSetToValue((__m256i*)(buffer->scales + writeLoc), farmerScale, (newEnd - Gen_PreviousFarmerScaleEnd + extra) * sizeof(uint16_t));

    if (writeIndex < Gen_PreviousFarmerScaleStart)
        simd_memSetToValue((__m256i*)(buffer->scales + writeIndex), farmerScale, (Gen_PreviousFarmerScaleStart - writeIndex) * sizeof(uint16_t));

    Gen_PreviousFarmerScaleEnd = newEnd;
    Gen_PreviousFarmerScaleStart = writeIndex;

And that’s it! This change adds quite a bit of complexity to our generation code but the results are worth it!

chrome://tracing indicates that our average game_gen_instance_buffer tick is now 0.72ms, faster than our original performance! ai_tick also runs at an average of 1.2ms and our average tick is now 1.9ms.


VTune shows us that game_gen_instance_buffer is now at

More bug fixes…

At this point, I realized that I hadn’t ran the game out of profile mode in a bit… I had gotten too enthralled in the performance side of things. And when I ran it… nothing rendered…

It turns out I had made a mistake very early on when I changed the format of Game_InstanceBuffer.

As a refresher, our buffer looks like this:

typedef struct
    uint8_t spriteIndices[GAME_MAX_INSTANCE_COUNT];
    uint16_t scales[GAME_MAX_INSTANCE_COUNT];
    uint16_t positionX[GAME_MAX_INSTANCE_COUNT];
    uint16_t positionY[GAME_MAX_INSTANCE_COUNT];
} Game_InstanceBuffer;

The code was set up in such a way that I would copy the data needed for all the rendered instances with this call:

sg_update_buffer(Render_DrawState.vertex_buffers[0], Render_InstanceBuffer, (sizeof(uint16_t) * 3 + sizeof(uint8_t)) * renderInstances);

This doesn’t work! If our renderInstances is 1, we’re rendering 7 bytes of data from the sprite indices array, not spriteIndices, scales, positionX and positionY! This caused the GPU to only get a little bit of the data that it needed and the rest was completely missing…

This could be seen as a lesson in testing all aspects of your code…

One last hoorah

At this point, it was very clear to me that I had to address the movement code. The problem with this, is that we have to access every target position and velocity in order to move the farmer.

One approach that I attempted was to store the future positions of the farmers into N buffers and read from those and only process a fraction of the active move farmers while using the buffered positions for the other fractions. A rough prototype of this approach however showed that this was slower than the simple processing of all the farmers due to needing to touch quite a bit of memory to store the future positions of the farmers.

Another consideration was to bucket the farmer movements by cardinal direction. If the farmers were close enough to a cardinal direction, they would use that direction as an approximation of their velocity. This attempt only managed to provide speedups if a large portion of the farmers were bound to predefined velocities and quickly introduced visual artifacts. As a result, this solution although potentially viable with a large amount of predefined velocities didn’t seem particularly viable for this project.

After these two attempts, I’m going to call it for this project. We made some excellent progress and I learned a ton in the process. I hope you did as well!

Where are we now?

Looking at chrome://tracing tells us that we’re now at an average tick of around 2ms from our original 42ms. 21 times faster than our original performance! We tackled a lot from memory access patterns, quantization, SIMD, software prefetching and non-temporal stores. As a result however, our program went from ~400 lines to almost 1k lines of code. Our program is now harder to change and harder to read, but code cleanliness was not one of our concerns here.

I had great fun and I hope you did too!

Until next time!

Find the repo on github here!