Index: https://cranberryking.com/2020/08/22/diary-of-a-path-tracer-the-beginning/

## Intro

This is the first in a series of posts detailing the functionality of my hobby path tracing, we will be looking at the underpinning’s of it all, the math library cranberry_math.

The goals of cranberry_math are simple. Making SIMD accelerated mathematics easier and fun to use. There are already a huge swath of math libraries out there that do things very well (DirectXMath and GLM to name a few), however I decided to write my own for fun!

## Ideology

The basis of cranberry_math is the concept of lanes. If you’re not familiar with SIMD, the concept is relatively simple. You can imagine that a SIMD register (or vector register) is a set of 4 “lanes”, these lanes are generally independent of each other but allow you to perform operations on them using another SIMD register. For the purpose of demonstration, we will write a SIMD register as and the element of one of those registers as where would be element 0 of SIMD register 0.

I.e. if I take and add it to the result is the equivalent of

On it’s own, this might not be particularly useful. But vector register operations have a number of desirable properties.

Let’s imagine we have a fantasy processor, and this processor takes 1 cycle for every addition. This means that to add 4 pairs of numbers together it would take us roughly 4 cycles or 1 cycle per add. Now if our vector addition takes 4 cycles to complete, that’s still 1 cycle per add. However, consider what happens if by some magic we only take 3 cycles for our vector addition. That means overall, we only spend 3/4=0.75 cycles per add! In our fantasy processor, we can make this as low as 1 cycle for 4 adds or 1/4=0.25 cycles per add. You may ask why we can make our instruction run in less than 4 cycles even though we’re still doing 4 adds? In our fantasy processor, we can do this by simply running 4 adders at once with our data. This means that although every adder is taking 1 cycle to complete, they are run in parallel, making the overall throughput 1 cycle/4 adds.

This is a highly simplified version of how your processor might implement vector adds.

Now, with this added information, we can see why we would want to use these operations efficiently, we could reduce our program latency by at most 4x! (In reality this is unlikely to happen, but we can get fairly close).

Cranberry_math takes this concept and tries to make the best use of it while still making it approachable to the user.

An intuitive and simple use of these registers would be to store our X,Y,Z components into each component such as and working with them in that manner. In some common operations such as adding vectors, this can give you a nice hypothetical improvement of 3x over your original single element addition. However, we’re still leaving 25% of our optimal performance on the table! This approach also causes some issues with common operations such as the dot product.

Let’s say in our fantasy processor, our dot product operation takes 2 cycles to complete when applied to 1 vector. That means we get a throughput of 2 cycles/dot product. I propose (and has been proposed elsewhere as well) that you instead use each register to store a single component of a vector. Such as , etc. With this in mind, we can achieve the dot product by doing which is a total of 5 operations that all take 1 cycle in our fantasy processor for 4 vectors or a throughput of 1.25 cycles/dot product.

For SSE according to the intel intrinsics guide a multiply operation has a latency of 4 cycles (on Skylake), an add has a latency of 4 cycles. As a result a dot product in a fantasy processor with no out of order functionality or multiple issue, our wide dot product would take roughly 20 cycles to complete or cycles per dot product. While if we look at the dot product instruction, it has a latency of 11 cycles. (See [1] and [2] for info on reading the intel intrinsics guide values) This is not indicative of real performance numbers. Modern processors are significantly more complex than what I’m implying here.

Another benefit of storing vectors this way, is that it scales trivially to SIMD registers of size 8, or 16! You only need to throw more data at it!

Now, this approach to storing vectors in SIMD registers is not fool proof. It introduces difficulty if we want to add similar components together. (x with x, y with y) within one vector, I however have not found this to be a problem while writing my path tracer.

This isn’t a silver bullet, but I recommend considering adding a path for this type of storage if you can afford it.

## In Depth View

With that out of the way, here are the primary types for our library:

#define cran_lane_count 4 cran_align(16) typedef union { float f[cran_lane_count]; __m128 sse; } cfl; typedef union { struct { float x, y, z; }; } cv3; typedef struct { cfl x; cfl y; cfl z; } cv3l;

The theme of this library is that SIMD registers are referred to as lanes. Every element is expected to stay within it’s respective lane. cfl represents “cranberry float lane”, cv3 represents “cranberry vector 3” and cv3l represents “cranberry vector 3 lanes”. Notice how every component in cv3l has it’s own set of lanes? That is a direct implementation of the explanation that was presented above.

cran_forceinline cv3l cv3l_add(cv3l l, cv3l r); cran_forceinline cv3l cv3l_sub(cv3l l, cv3l r); cran_forceinline cv3l cv3l_mul(cv3l l, cv3l r); cran_forceinline cv3l cv3l_min(cv3l l, cv3l r); cran_forceinline cv3l cv3l_max(cv3l l, cv3l r);

The API for cv3l is simple, it looks exactly the same as it’s cv3 counterpart. This makes it relatively easy to switch from one to the other simply by loading the appropriate data correctly.

cran_forceinline cv3l cv3l_indexed_load(void const* vectors, uint32_t stride, uint32_t offset, uint32_t* indices, uint32_t indexCount);

One function that we should take a particular look at is cv3l_indexed_load. It’s likely you don’t want to store your data in cv3l format. It doesn’t lend itself particularly well to general processing and works much better in batch processing. As a result your data needs to be transformed from x,y,z,x,y,z to x,x,x,y,y,y,z,z,z to facilitate our maximum throughput.

One way to shuffle our data would be to load each element individually and store it in the appropriate locations. However this is surprisingly slow (I forget the exact reasons, a reference for that would be awesome).

Instead, what you can do is load your vectors as a set of vectors:

and then shuffle the values into a set of temporary registers

and finally shuffle them into the final registers

(There’s a mistake with the arrows for and they should be pointed to the 4 register)

Here’s the source for this operation:

cran_forceinline cv3l cv3l_indexed_load(void const* vectors, uint32_t stride, uint32_t offset, uint32_t* indices, uint32_t indexCount) { __m128 loadedVectors[cran_lane_count]; for (uint32_t i = 0; i < indexCount && i < cran_lane_count; i++) { uint8_t const* vectorData = (uint8_t*)vectors; loadedVectors[i] = _mm_load_ps((float const*)(vectorData + indices[i] * stride + offset)); } __m128 XY0 = _mm_shuffle_ps(loadedVectors[0], loadedVectors[1], _MM_SHUFFLE(1, 0, 1, 0)); __m128 XY1 = _mm_shuffle_ps(loadedVectors[2], loadedVectors[3], _MM_SHUFFLE(1, 0, 1, 0)); __m128 Z0 = _mm_shuffle_ps(loadedVectors[0], loadedVectors[1], _MM_SHUFFLE(3, 2, 3, 2)); __m128 Z1 = _mm_shuffle_ps(loadedVectors[2], loadedVectors[3], _MM_SHUFFLE(3, 2, 3, 2)); return (cv3l) { .x = {.sse = _mm_shuffle_ps(XY0, XY1, _MM_SHUFFLE(2, 0, 2, 0))}, .y = {.sse = _mm_shuffle_ps(XY0, XY1, _MM_SHUFFLE(3, 1, 3, 1))}, .z = {.sse = _mm_shuffle_ps(Z0, Z1, _MM_SHUFFLE(2, 0, 2, 0))} }; }

With all this, we can see that I can write a vectorized ray/AABB intersection using this new API:

cran_forceinline uint32_t caabb_does_ray_intersect_lanes(cv3 rayO, cv3 rayD, float rayMin, float rayMax, cv3l aabbMin, cv3l aabbMax) { cv3l rayOLanes = cv3l_replicate(rayO); cv3l invD = cv3l_replicate(cv3_rcp(rayD)); cv3l t0s = cv3l_mul(cv3l_sub(aabbMin, rayOLanes), invD); cv3l t1s = cv3l_mul(cv3l_sub(aabbMax, rayOLanes), invD); cv3l tsmaller = cv3l_min(t0s, t1s); cv3l tbigger = cv3l_max(t0s, t1s); cfl rayMinLane = cfl_replicate(rayMin); cfl rayMaxLane = cfl_replicate(rayMax); cfl tmin = cfl_max(rayMinLane, cfl_max(tsmaller.x, cfl_max(tsmaller.y, tsmaller.z))); cfl tmax = cfl_min(rayMaxLane, cfl_min(tbigger.x, cfl_min(tbigger.y, tbigger.z))); cfl result = cfl_less(tmin, tmax); return cfl_mask(result); }

That’s the gist of it! Writing vectorized BVH traversal using this API has been a breeze. You can find the rest of the library here: https://github.com/AlexSabourinDev/cranberries/blob/cranberray/cranberry_math.h

Here is the library in use for BVH AABB testing:

uint32_t activeLaneCount = min((uint32_t)(testQueueEnd - testQueueIter), cran_lane_count); cv3l boundMins = cv3l_indexed_load(bvh->bounds, sizeof(caabb), offsetof(caabb, min), testQueueIter, activeLaneCount); cv3l boundMaxs = cv3l_indexed_load(bvh->bounds, sizeof(caabb), offsetof(caabb, max), testQueueIter, activeLaneCount); uint32_t intersections = caabb_does_ray_intersect_lanes(rayO, rayD, rayMin, rayMax, boundMins, boundMaxs);

We’ll be taking a deeper look at the SSE optimized BVH traversal in the future. For now, happy coding!

## References:

[1] https://stackoverflow.com/questions/40203254/intel-intrinsics-guide-latency-and-throughput

[3] https://users.ece.cmu.edu/~franzf/teaching/slides-18-645-simd.pdf

[4] https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf