Cranberry Lair

Walking the programming parameter space

Diary of a Path Tracer – SSE optimized BVH Traversal — September 3, 2020

Diary of a Path Tracer – SSE optimized BVH Traversal

Index: https://cranberryking.com/2020/08/22/diary-of-a-path-tracer-the-beginning/

Intro

In our last post, we looked at building our BVH using SAH as our guiding heuristic. Now it’s time to look at how we intersect that BVH.

High Level

Querying the BVH is actually quite simple. We test our ray against our parent node, if it intersects, we test the ray against both of it’s children. If the ray intersects either of it’s children, we test that children’s children. Once we reach a leaf node, we add it to our list of nodes that we’ve intersected.

This sounds perfect for recursion, but recursion never scales particularly well. Instead a queue or stack is a more effective method to represent the nodes to be tested.

There are three pieces that bring the BVH traversal together. The math library, the AABB intersection code and the (roughly) branchless intersection response code.

We’ve already seen the math library in this post, as a result, let’s take a dive into the intersection routine!

Ray/AABB Intersection Routine

There are quite a few excellent sources on the ray/AABB intersection routine [1][2][3]. We’ll do a small dive into how we get to the final implementation.

The intersection routine is based on the slab intersection method presented in [3]. We want to define our AABB as a set of 3 slabs in each axis. A slab is simply represented by a pair of planes, the slab being the space in between them.

In order to determine if we’re intersecting the AABB, we want to determine if our ray passes through the shared intersection of each slab.

AABBIntersection

In order to determine that, we want to determine if our ray enters each slab before it exits any of the other slabs.

A simple approach to this could be:

slabEntryX = intersect(ray, startPlaneX)
slabExitX = intersect(ray, endPlaneX)
slabEntryY = intersect(ray, startPlaneY)
slabExitY = intersect(ray, endPlaneY)

if(slabEntryX is greater than slabExitY or slabEntryY is greater than slabExitX)
    return missed collision

return hit collision

As a result, our core approach is to find if there is any entry value that is greater than our exit values. And if so, we’re not intersecting the slab.

To intersect a ray and a plane, we need to define our ray as:

\vec{p} = \vec{a}*t+\vec{b}

and our plane as

0 = \vec{n}.\vec{p}-d

We can then plug our ray function into our plane function to get:

0 = \vec{n}.(\vec{a}*t+\vec{b})-d

And solving for t

d = \vec{n}.\vec{a}*t+vec{n}.\vec{b}

d-\vec{n}.\vec{b} = vec{n}.\vec{a}*t

\frac{d-\vec{n}.\vec{b}}{vec{n}.\vec{a}} = t

Now that we have t we can calculate it for each plane, since our planes are axis aligned, our intersection math gets quite simple. Here is an example for our x axis.

\frac{d-\vec{n}.\vec{b}}{vec{n}.\vec{a}} = t

where n = (1, 0, 0)

\frac{d_x-b_x}{a_x} = t

And repeat for each axis.

Now let’s look at our intersection code:

cv3l rayOLanes = cv3l_replicate(rayO);
cv3l invD = cv3l_replicate(cv3_rcp(rayD));
cv3l t0s = cv3l_mul(cv3l_sub(aabbMin, rayOLanes), invD);
cv3l t1s = cv3l_mul(cv3l_sub(aabbMax, rayOLanes), invD);

cv3l tsmaller = cv3l_min(t0s, t1s);
cv3l tbigger  = cv3l_max(t0s, t1s);

cfl rayMinLane = cfl_replicate(rayMin);
cfl rayMaxLane = cfl_replicate(rayMax);
cfl tmin = cfl_max(rayMinLane, cfl_max(tsmaller.x, cfl_max(tsmaller.y, tsmaller.z)));
cfl tmax = cfl_min(rayMaxLane, cfl_min(tbigger.x, cfl_min(tbigger.y, tbigger.z)));
cfl result = cfl_less(tmin, tmax);
return cfl_mask(result);

This code looks different than our math, but it’s essence is the same.

In this code we’re testing 4 AABBs against a single ray instead of one ray against one AABB.

Notice that in this line:

cv3l t0s = cv3l_mul(cv3l_sub(aabbMin, rayOLanes), invD);

We’re implementing \frac{d_x-b_x}{a_x} = t where b_x, b_y, b_z is rayOLanes, invD is \frac{1}{a_x}, \frac{1}{a_y}, \frac{1}{a_z} and aabbMin is d_x, d_y, d_z.

The next few lines:

cv3l tsmaller = cv3l_min(t0s, t1s);
cv3l tbigger = cv3l_max(t0s, t1s);

Are there to account for the situation where our ray direction is negative. If our ray direction is negative on one axis, t0 and t1 will actually be flipped. As a result, we want to simply make sure that t0 is the smaller one and t1 is the larger one. [1]

And finally,

cfl tmin = cfl_max(rayMinLane, cfl_max(tsmaller.x, cfl_max(tsmaller.y, tsmaller.z)));
cfl tmax = cfl_min(rayMaxLane, cfl_min(tbigger.x, cfl_min(tbigger.y, tbigger.z)));
cfl result = cfl_less(tmin, tmax);
return cfl_mask(result);

These lines are calculating to see if any of our latest entrance is earlier than our earliest exit. If so, cfl_mask will set the appropriate bit for that AABB.

That’s it for our intersection routine! Now we can take a look at what we do with these results.

Intersection Response

The previous intersection code is nicely packaged in the caabb_does_ray_intersect_lanes routine.

In our bvh traversal, the call to it looks like this:

uint32_t intersections = caabb_does_ray_intersect_lanes(rayO, rayD, rayMin, rayMax, boundMins, boundMaxs);

Where intersections hold the bits of the AABBs that intersect with our ray.

One approach to handling these bits is to simply loop through each AABB, determine if it’s bit is set and add it’s children volumes to the queue of being tested or to the final list of leaf nodes.

However, that’s no fun.

Instead it’s a lot more fun to make this code branchless using some SSE intrinsics!

Fair warning, this code can be complicated if you aren’t used to reading SSE intrinsics.

#define _MM_SHUFFLE_EPI8(i3,i2,i1,i0) _mm_set_epi8(i3*4+3,i3*4+2,i3*4+1,i3*4,i2*4+3,i2*4+2,i2*4+1,i2*4,i1*4+3,i1*4+2,i1*4+1,i1*4,i0*4+3,i0*4+2,i0*4+1,i0*4)
__m128i shuffles[16] =
{
	_MM_SHUFFLE_EPI8(0,0,0,0), _MM_SHUFFLE_EPI8(0,0,0,0), _MM_SHUFFLE_EPI8(0,0,0,1), _MM_SHUFFLE_EPI8(0,0,1,0), // 0000, 0001, 0010, 0011
	_MM_SHUFFLE_EPI8(0,0,0,2), _MM_SHUFFLE_EPI8(0,0,2,0), _MM_SHUFFLE_EPI8(0,0,2,1), _MM_SHUFFLE_EPI8(0,2,1,0), // 0100, 0101, 0110, 0111
	_MM_SHUFFLE_EPI8(0,0,0,3), _MM_SHUFFLE_EPI8(0,0,3,0), _MM_SHUFFLE_EPI8(0,0,3,1), _MM_SHUFFLE_EPI8(0,3,1,0), // 1000, 1001, 1010, 1011
	_MM_SHUFFLE_EPI8(0,0,3,2), _MM_SHUFFLE_EPI8(0,3,2,0), _MM_SHUFFLE_EPI8(0,3,2,1), _MM_SHUFFLE_EPI8(3,2,1,0)  // 1100, 1101, 1110, 1111
};

__m128i queueIndices = _mm_load_si128((__m128i*)testQueueIter);
uint32_t leafLine = bvh->count - bvh->leafCount;
uint32_t childIndexMask;
uint32_t parentIndexMask;
{
	__m128i isParent = _mm_cmplt_epi32(queueIndices, _mm_set_epi32(leafLine, leafLine, leafLine, leafLine));
	parentIndexMask = _mm_movemask_ps(_mm_castsi128_ps(isParent));
	childIndexMask = ~parentIndexMask & 0x0F;

	parentIndexMask = parentIndexMask & intersections;
	childIndexMask = childIndexMask & intersections;
}

uint32_t leafCount = __popcnt(childIndexMask);
uint32_t parentCount = __popcnt(parentIndexMask);
__m128i childIndices = _mm_shuffle_epi8(queueIndices, shuffles[childIndexMask]);
__m128i parentIndices = _mm_shuffle_epi8(queueIndices, shuffles[parentIndexMask]);

union
{
	uint32_t i[4];
	__m128i v;
} indexUnion;

indexUnion.v = childIndices;
for (uint32_t i = 0; i < leafCount; i++)
{
	uint32_t nodeIndex = indexUnion.i[i];
	candidateIter[i] = bvh->jumps[nodeIndex].leaf.index;
}
candidateIter+=leafCount;

indexUnion.v = parentIndices;
for (uint32_t i = 0; i < parentCount; i++)
{
	uint32_t nodeIndex = indexUnion.i[i];
	cran_assert(nodeIndex < bvh->count);
	testQueueEnd[i*2] = bvh->jumps[nodeIndex].jumpIndices.left;
	testQueueEnd[i*2 + 1] = bvh->jumps[nodeIndex].jumpIndices.right;
}
testQueueEnd += parentCount * 2;

Now this looks like a lot, so let’s take a look at it piece by piece. As we presented in the previous post, our leaf nodes are stored at the end of our array. This allows us to determine which nodes are leaf nodes simply by comparing their indices against that boundary. The following bit of code does exactly that:

__m128i isParent = _mm_cmplt_epi32(queueIndices, _mm_set_epi32(leafLine, leafLine, leafLine, leafLine));
parentIndexMask = _mm_movemask_ps(_mm_castsi128_ps(isParent));
childIndexMask = ~parentIndexMask & 0x0F;
parentIndexMask = parentIndexMask & intersections;
childIndexMask = childIndexMask & intersections;

The first line here simply compares our indices against our “leaf boundary”. The result of this tells us if our index is a parent or a leaf node.

Once we know which ones are our leaf nodes, we convert this to a bitmask representing the elements of the vector.

And finally, we mask out the indices that have no intersections associated with them.

The result of this operation is that we have two bitmasks that represents the parents that have intersected the ray and the leaves that have intersected the ray.

Next we want to turn this bit mask into a series of indices.

uint32_t leafCount = __popcnt(childIndexMask);
uint32_t parentCount = __popcnt(parentIndexMask);
__m128i childIndices = _mm_shuffle_epi8(queueIndices, shuffles[childIndexMask]);
__m128i parentIndices = _mm_shuffle_epi8(queueIndices, shuffles[parentIndexMask]);

The first two lines are very simple, it tells us how many leaf nodes have intersections and how many parent nodes. The next pair of lines require a little more explaining.

Right now, our queueIndices vector represents the indices to the AABBs that we’ve tested against our rays.

One approach we could implement is to simply loop through every element, test their bits and read the values if those bits are set.

However, what we can do instead is pack all of our active indices to the front of our vector allowing us to read only the ones that have reported an intersection. Reducing our branching at the cost of shuffling our data. That’s what _mm_shuffle_epi8 is doing. It’s taking a lookup table (shuffles[]) and determining where our values should be moved in the vector to pack them to the front of the vector.

Since our masks only contain 4 bits, they have a maximum of 15 values making for a very manageable look up table.

Once we have our indices packed, we simply have to read them.

union
{
    uint32_t i[4];
    __m128i v;
} indexUnion;

indexUnion.v = childIndices;
for (uint32_t i = 0; i < leafCount; i++)
{
    uint32_t nodeIndex = indexUnion.i[i];
    candidateIter[i] = bvh->jumps[nodeIndex].leaf.index;
}
candidateIter+=leafCount;

indexUnion.v = parentIndices;
for (uint32_t i = 0; i < parentCount; i++)
{
    uint32_t nodeIndex = indexUnion.i[i];
    cran_assert(nodeIndex < bvh->count);
    testQueueEnd[i*2] = bvh->jumps[nodeIndex].jumpIndices.left;
    testQueueEnd[i*2 + 1] = bvh->jumps[nodeIndex].jumpIndices.right;
}
testQueueEnd += parentCount * 2;

These parts are relatively self explanatory. We’re reading our indices and doing the appropriate work if our values are leaves or parent nodes. To understand why we’re using a union to convert our SSE vector, I recommend looking into type punning. [5]

Conclusion

That’s it! I don’t really have any future work for this part of cranberray, I’m actually quite happy with the end result.

You can find the code here: https://github.com/AlexSabourinDev/cranberries/blob/5fe9c25e1df23d558b7ef8b5475717d9e67a19fc/cranberray.c#L1233

Next we’ll be looking at the ray tracer’s sampling strategy. Until then, happy coding!

References

[1] https://medium.com/@bromanz/another-view-on-the-classic-ray-aabb-intersection-algorithm-for-bvh-traversal-41125138b525

[2] http://psgraphics.blogspot.com/2016/02/new-simple-ray-box-test-from-andrew.html

[3] http://papers.cumincad.org/data/works/att/67d2.content.pdf

[4] https://realtimecollisiondetection.net/

[5] https://www.cocoawithlove.com/2008/04/using-pointers-to-recast-in-c-is-bad.html

 

Diary of a Path Tracer – Math Library — August 22, 2020

Diary of a Path Tracer – Math Library

Index: https://cranberryking.com/2020/08/22/diary-of-a-path-tracer-the-beginning/

Intro

This is the first in a series of posts detailing the functionality of my hobby path tracing, we will be looking at the underpinning’s of it all, the math library cranberry_math.

The goals of cranberry_math are simple. Making SIMD accelerated mathematics easier and fun to use. There are already a huge swath of math libraries out there that do things very well (DirectXMath and GLM to name a few), however I decided to write my own for fun!

Ideology

The basis of cranberry_math is the concept of lanes. If you’re not familiar with SIMD, the concept is relatively simple. You can imagine that a SIMD register (or vector register) is a set of 4 “lanes”, these lanes are generally independent of each other but allow you to perform operations on them using another SIMD register. For the purpose of demonstration, we will write a SIMD register as v_n (v_0, v_1, v_2) and the element of one of those registers as v_ne_m (v_0e_0) where v_0e_0 would be element 0 of SIMD register 0.

I.e. if I take v_0 and add it to v_1 the result is the equivalent of v_2 = (v_0e_0+v_1e_0, v_0e_1+v_1e_1, v_0e_2+v_1e_2, v_0e_3+v_1e_3)

VectorExample

On it’s own, this might not be particularly useful. But vector register operations have a number of desirable properties.

Let’s imagine we have a fantasy processor, and this processor takes 1 cycle for every addition. This means that to add 4 pairs of numbers together it would take us roughly 4 cycles or 1 cycle per add. Now if our vector addition takes 4 cycles to complete, that’s still 1 cycle per add. However, consider what happens if by some magic we only take 3 cycles for our vector addition. That means overall, we only spend 3/4=0.75 cycles per add! In our fantasy processor, we can make this as low as 1 cycle for 4 adds or 1/4=0.25 cycles per add. You may ask why we can make our instruction run in less than 4 cycles even though we’re still doing 4 adds? In our fantasy processor, we can do this¬† by simply running 4 adders at once with our data. This means that although every adder is taking 1 cycle to complete, they are run in parallel, making the overall throughput 1 cycle/4 adds.

This is a highly simplified version of how your processor might implement vector adds.

Now, with this added information, we can see why we would want to use these operations efficiently, we could reduce our program latency by at most 4x! (In reality this is unlikely to happen, but we can get fairly close).

Cranberry_math takes this concept and tries to make the best use of it while still making it approachable to the user.

An intuitive and simple use of these registers would be to store our X,Y,Z components into each component such as v_0 = (x, y, z, unused) and working with them in that manner. In some common operations such as adding vectors, this can give you a nice hypothetical improvement of 3x over your original single element addition. However, we’re still leaving 25% of our optimal performance on the table! This approach also causes some issues with common operations such as the dot product.

Let’s say in our fantasy processor, our dot product operation takes 2 cycles to complete when applied to 1 vector. That means we get a throughput of 2 cycles/dot product. I propose (and has been proposed elsewhere as well) that you instead use each register to store a single component of a vector. Such as v_x = (x_0, x_1, x_2, x_3), v_y = (y_0, y_1, y_2, y_3), etc. With this in mind, we can achieve the dot product by doing v_x*v_x+v_y*v_y+v_z*v_z which is a total of 5 operations that all take 1 cycle in our fantasy processor for 4 vectors or a throughput of 1.25 cycles/dot product.

For SSE according to the intel intrinsics guide a multiply operation has a latency of 4 cycles (on Skylake), an add has a latency of 4 cycles. As a result a dot product in a fantasy processor with no out of order functionality or multiple issue, our wide dot product would take roughly 20 cycles to complete or 20/4=5 cycles per dot product. While if we look at the dot product instruction, it has a latency of 11 cycles. (See [1] and [2] for info on reading the intel intrinsics guide values) This is not indicative of real performance numbers. Modern processors are significantly more complex than what I’m implying here.

Another benefit of storing vectors this way, is that it scales trivially to SIMD registers of size 8, or 16! You only need to throw more data at it!

Now, this approach to storing vectors in SIMD registers is not fool proof. It introduces difficulty if we want to add similar components together. (x with x, y with y) within one vector, I however have not found this to be a problem while writing my path tracer.

This isn’t a silver bullet, but I recommend considering adding a path for this type of storage if you can afford it.

In Depth View

With that out of the way, here are the primary types for our library:

#define cran_lane_count 4
cran_align(16) typedef union
{
	float f[cran_lane_count];
	__m128 sse;
} cfl;

typedef union
{
	struct
	{
		float x, y, z;
	};

} cv3;

typedef struct
{
	cfl x;
	cfl y;
	cfl z;
} cv3l;

The theme of this library is that SIMD registers are referred to as lanes. Every element is expected to stay within it’s respective lane. cfl represents “cranberry float lane”, cv3 represents “cranberry vector 3” and cv3l represents “cranberry vector 3 lanes”. Notice how every component in cv3l has it’s own set of lanes? That is a direct implementation of the explanation that was presented above.

cran_forceinline cv3l cv3l_add(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_sub(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_mul(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_min(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_max(cv3l l, cv3l r);

The API for cv3l is simple, it looks exactly the same as it’s cv3 counterpart. This makes it relatively easy to switch from one to the other simply by loading the appropriate data correctly.

cran_forceinline cv3l cv3l_indexed_load(void const* vectors, uint32_t stride, uint32_t offset, uint32_t* indices, uint32_t indexCount);

One function that we should take a particular look at is cv3l_indexed_load. It’s likely you don’t want to store your data in cv3l format. It doesn’t lend itself particularly well to general processing and works much better in batch processing. As a result your data needs to be transformed from x,y,z,x,y,z to x,x,x,y,y,y,z,z,z to facilitate our maximum throughput.

One way to shuffle our data would be to load each element individually and store it in the appropriate locations. However this is surprisingly slow (I forget the exact reasons, a reference for that would be awesome).

Instead, what you can do is load your vectors as a set of vectors:

v_0 = (x_0, y_0, z_0, u), v_1 = (x_1, y_1, z_1, u), v_2 = (x_2, y_2, z_2, u), v_3 = (x_3, y_3, z_3, u)

and then shuffle the values into a set of temporary registers

v_{xy0} = (x_0, y_0, x_1, y_1), v_{xy1} = (x_2,y_2,x_3,y_3), v_{z0} = (z_0, u, z_1, u), v_{z1} = (z_2, u, z_3, u)

and finally shuffle them into the final registers

v_x = (x_0, x_1, x_2, x_3), v_y = (y_0, y_1, y_2, y_3), v_z =(z_0, z_1, z_2, z_3)

VectorShuffle

(There’s a mistake with the arrows for Z_2 and Z_3 they should be pointed to the 4 register)

Here’s the source for this operation:

cran_forceinline cv3l cv3l_indexed_load(void const* vectors, uint32_t stride, uint32_t offset, uint32_t* indices, uint32_t indexCount)
{
	__m128 loadedVectors[cran_lane_count];
	for (uint32_t i = 0; i < indexCount && i < cran_lane_count; i++)
	{
		uint8_t const* vectorData = (uint8_t*)vectors;
		loadedVectors[i] = _mm_load_ps((float const*)(vectorData + indices[i] * stride + offset));
	}

	__m128 XY0 = _mm_shuffle_ps(loadedVectors[0], loadedVectors[1], _MM_SHUFFLE(1, 0, 1, 0));
	__m128 XY1 = _mm_shuffle_ps(loadedVectors[2], loadedVectors[3], _MM_SHUFFLE(1, 0, 1, 0));
	__m128 Z0 = _mm_shuffle_ps(loadedVectors[0], loadedVectors[1], _MM_SHUFFLE(3, 2, 3, 2));
	__m128 Z1 = _mm_shuffle_ps(loadedVectors[2], loadedVectors[3], _MM_SHUFFLE(3, 2, 3, 2));

	return (cv3l)
	{
		.x = {.sse = _mm_shuffle_ps(XY0, XY1, _MM_SHUFFLE(2, 0, 2, 0))},
		.y = {.sse = _mm_shuffle_ps(XY0, XY1, _MM_SHUFFLE(3, 1, 3, 1))},
		.z = {.sse = _mm_shuffle_ps(Z0, Z1, _MM_SHUFFLE(2, 0, 2, 0))}
	};
}

With all this, we can see that I can write a vectorized ray/AABB intersection using this new API:

cran_forceinline uint32_t caabb_does_ray_intersect_lanes(cv3 rayO, cv3 rayD, float rayMin, float rayMax, cv3l aabbMin, cv3l aabbMax)
{
	cv3l rayOLanes = cv3l_replicate(rayO);
	cv3l invD = cv3l_replicate(cv3_rcp(rayD));
	cv3l t0s = cv3l_mul(cv3l_sub(aabbMin, rayOLanes), invD);
	cv3l t1s = cv3l_mul(cv3l_sub(aabbMax, rayOLanes), invD);

	cv3l tsmaller = cv3l_min(t0s, t1s);
	cv3l tbigger  = cv3l_max(t0s, t1s);

	cfl rayMinLane = cfl_replicate(rayMin);
	cfl rayMaxLane = cfl_replicate(rayMax);
	cfl tmin = cfl_max(rayMinLane, cfl_max(tsmaller.x, cfl_max(tsmaller.y, tsmaller.z)));
	cfl tmax = cfl_min(rayMaxLane, cfl_min(tbigger.x, cfl_min(tbigger.y, tbigger.z)));
	cfl result = cfl_less(tmin, tmax);
	return cfl_mask(result);
}

That’s the gist of it! Writing vectorized BVH traversal using this API has been a breeze. You can find the rest of the library here: https://github.com/AlexSabourinDev/cranberries/blob/cranberray/cranberry_math.h

Here is the library in use for BVH AABB testing:

uint32_t activeLaneCount = min((uint32_t)(testQueueEnd - testQueueIter), cran_lane_count);
cv3l boundMins = cv3l_indexed_load(bvh->bounds, sizeof(caabb), offsetof(caabb, min), testQueueIter, activeLaneCount);
cv3l boundMaxs = cv3l_indexed_load(bvh->bounds, sizeof(caabb), offsetof(caabb, max), testQueueIter, activeLaneCount);
uint32_t intersections = caabb_does_ray_intersect_lanes(rayO, rayD, rayMin, rayMax, boundMins, boundMaxs);

We’ll be taking a deeper look at the SSE optimized BVH traversal in the future. For now, happy coding!

References:

[1] https://stackoverflow.com/questions/40203254/intel-intrinsics-guide-latency-and-throughput

[2] https://stackoverflow.com/questions/35859449/why-are-some-haswell-avx-latencies-advertised-by-intel-as-3x-slower-than-sandy-b

[3] https://users.ece.cmu.edu/~franzf/teaching/slides-18-645-simd.pdf

[4] https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf