# Cranberry Lair

## Walking the programming parameter space

### Intro

In our last post, we looked at the various utilities used by cranberray. in this post we’re going to return to the world of graphics and take a look at cranberray’s texture sampling and bump mapping strategy.

### Texture Sampling

Cranberray takes inspiration from graphics APIs and splits texture sampling into 2 parts. The sampler and the texture.

The sampler in cranberray stores information such as sample type (nearest, bilinear) and some other useful flags such as gamma_to_linear to convert our textures stored in sRGBA to linear RGBA. Textures store pixel data and simply support sampling a single point in them.

Sampling a texture in cranberray is quite simple, you take your UV coordinates and convert them to an array index.

static cv4 sample_r_u8(cv2 uv, uint8_t* cran_restrict image, uint32_t width, uint32_t height, uint32_t offsetX, uint32_t offsetY)
{
float readY = uv.y * (float)height;
float readX = uv.x * (float)width;

uint32_t y = (uint32_t)floorf(readY) + offsetY;
y = y >= height ? height - 1 : y;
uint32_t x = (uint32_t)floorf(readX) + offsetX;
x = x >= width ? width - 1 : x;
uint32_t readIndex = y * width + x;

float f = (float)image[readIndex] / 255.0f;
return (cv4) { f, f, f, f };
}


Once your can retrieve distinct pixel values, you can either sample nearest of use bilinear interpolation between 4 different samples.

if (sampleType == sample_type_nearest)
{
color = samplers[texture->format](uv, texture->data, texture->width, texture->height, 0, 0);
}
else if (sampleType == sample_type_bilinear)
{
cv4 s00 = samplers[texture->format](uv, texture->data, texture->width, texture->height, 0, 0);
cv4 s01 = samplers[texture->format](uv, texture->data, texture->width, texture->height, 0, 1);
cv4 s10 = samplers[texture->format](uv, texture->data, texture->width, texture->height, 1, 0);
cv4 s11 = samplers[texture->format](uv, texture->data, texture->width, texture->height, 1, 1);

float wf = cf_frac((float)texture->width * uv.x);
float hf = cf_frac((float)texture->height * uv.y);
wf = wf < 0.0f ? 1.0f + wf : wf;
hf = hf < 0.0f ? 1.0f + hf : hf;
color = (cv4)
{
cf_bilinear(s00.x, s01.x, s10.x, s11.x, wf, hf),
cf_bilinear(s00.y, s01.y, s10.y, s11.y, wf, hf),
cf_bilinear(s00.z, s01.z, s10.z, s11.z, wf, hf),
cf_bilinear(s00.w, s01.w, s10.w, s11.w, wf, hf)
};
}


And that’s pretty much it for how cranberray samples its textures!

### Bump Mapping

Bump mapping is a bit more fun than texture sampling.

A very important and interesting point about the tangent frame is that it is the set of vectors that represent the basis for our texture coordinates. Originally, I believed that any tangent and bitangent could be selected as long as they were orthonormal. In the context of bump mapping, this is not correct. We actually want to select our tangent and bitangent so as to have them represent the flow of the U and V coordinates in space. ([2] has an excellent visualization for this)

To construct your tangent and bitangent, you can imagine that your triangle edge is a construction of the tangent and bitangent vectors.

Once you know you can construct your edges from some contribution of the tangent and bitangent, you can solve for your tangent and bitangent vectors with a little bit of algebra.

// e0=du0T+dv0B (1)
// e1=du1T+dv1B (2)
// solve for B
// (e0-du0T)/dv0
// plug into (2)
// e1=du1T+dv1(e0-du0T)/dv0
// solve for T
// e1=du1dv0T/dv0+dv1e0/dv0-dv1du0T/dv0
// dv0e1=du1dv0T+dv1e0-dv1du0T
// dv0e1-dv1e0=du1dv0T-dv1du0T
// dv0e1-dv1e0=T(du1dv0-dv1du0)
// T = (dv0e1-dv1e0)/(dv0du1-dv1du0)


Calculating tangent frames caused a surprising amount of issues with degenerate triangles and NaNs. Be careful here.

And now that you have your tangent and bitangent vectors, you can “bump” them using the partial derivatives of a height map. To bump your tangent and bitangent vectors, you add a scaled normal to the tangent and bitangent vectors and recalculate the normal using the cross product of said vectors.

cv3 normal = cv3_normalize(inputs.normal);
{
cv4 partialDerivative = sampler_sample(&scene->textureStore, microfacetData.bumpSampler, microfacetData.bumpTexture, inputs.uv);
normal = cv3_normalize(normal);
cran_assert(cv3_dot(normal, inputs.normal) >= 0.0f);
}


### Conclusion

And that’s it! This part of cranberray is quite simple but was fun to write!. Next time we’ll look at the iterative path tracing of cranberray. Until then, happy coding!

### Future Work

I would like to add a few things to the texture sampling system as it’s quite barebones. Things such as mip map selection, trilinear interpolation as well as add a texture caching system. (Currently cranberray keeps all textures resident in memory)

### Intro

In our last post, we took a look at a variety of things. Namely the GGX-Smith BRDF, importance sampling and multiple importance sampling. These are really interesting aspects of path tracing, I recommend the read!

In this post, we’re going to look at all the small utilities for cranberray. Obj loading, multi-threading and our allocator. All these pieces are relatively simple but were a lot of fun to write!

### Allocator

Cranberray’s allocator is tiny. It’s roughly 60 lines of code.

typedef struct
{
// Grows from bottom to top
uint8_t* mem;
uint64_t top;
uint64_t size;
bool locked;
} crana_stack_t;

void* crana_stack_alloc(crana_stack_t* stack, uint64_t size)
{
cran_assert(!stack->locked);
uint8_t* ptr = stack->mem + stack->top;
stack->top += size + sizeof(uint64_t);
cran_assert(stack->top <= stack->size);

*(uint64_t*)ptr = size + sizeof(uint64_t);
return ptr + sizeof(uint64_t);
}

void crana_stack_free(crana_stack_t* stack, void* memory)
{
cran_assert(!stack->locked);
stack->top -= *((uint64_t*)memory - 1);
cran_assert(stack->top <= stack->size);
}

// Lock our stack, this means we have free range on the memory pointer
// Commit the pointer back to the stack to complete the operation.
// While the stack is locked, alloc and free cannot be called.
void* crana_stack_lock(crana_stack_t* stack)
{
cran_assert(!stack->locked);
stack->locked = true;
return stack->mem + stack->top + sizeof(uint64_t);
}

// Commit our new pointer. You can have moved it forward or backwards
// Make sure that the pointer is still from the same memory pool as the stack
void crana_stack_commit(crana_stack_t* stack, void* memory)
{
cran_assert(stack->locked);
cran_assert((uint64_t)((uint8_t*)memory - stack->mem) <= stack->size);
*(uint64_t*)(stack->mem + stack->top) = (uint8_t*)memory - (stack->mem + stack->top);
stack->top = (uint8_t*)memory - stack->mem;
stack->locked = false;
}

// Cancel our lock, all the memory we were planning on commiting can be ignored
void crana_stack_revert(crana_stack_t* stack)
{
cran_assert(stack->locked);
stack->locked = false;
}


As you can see, it’s a simple stack allocator. It grows upwards, stores the size of your allocation before your allocation and shrinks when you pop from it. It has very few safety checks and is simple to the point of being flawed.

The part that I thought was interesting is the lock, commit and revert API. It’s a simple concept that I really can only afford because of the simplicity of cranberray.

Using the locking API is simple, you call crana_stack_lock to retrieve the raw pointer from the allocator. This allows you to move the pointer forward to any size without needing to continuously allocate chunks of memory. The piece of mind it gives and the ease of use is excellent.

Once you’re done with your memory, you have two options. You can commit your pointer, allowing you to finalize your changes. Or you can revert, returning the top pointer to it’s location from before the call to lock. This allows you to work with a transient buffer of memory that you can simply “wish away” once you’re done with it.

Coding with this allocator, was a blast. I would consider adding an extension to any custom allocator if it could be made a bit more safely.

Maybe what could be done, is to write it on top of another more general allocator. The stack allocator could request a fixed sized buffer from the allocator and allow the user to work within that range. Once you’ve finished with the stack allocator, it could return the unused parts to the primary allocator.

When I speak of Obj, I mean the text version of the Wavefront geometry format.

I won’t be going into the details of the format. They are well explained at [1].

Cranberray’s Obj loader is simple. First we memory map the file, this allows us to easily read our memory using a raw pointer and allows us to suspend our knowledge that this is an IO operation.

Once our file is mapped, we process our file in 2 passes. We first loop through our file to determine how many of each elements we should be allocating and once we’ve done that, we parse the data into the appropriate buffers.

for (char* cran_restrict fileIter = fileStart; fileIter != fileEnd; fileIter = token(fileIter, fileEnd, delim))
{
if (memcmp(fileIter, "vt ", 3) == 0)
{
uvCount++;
}
else if (memcmp(fileIter, "vn ", 3) == 0)
{
normalCount++;
}
else if (memcmp(fileIter, "v ", 2) == 0)
{
vertexCount++;
}
else if (memcmp(fileIter, "g ", 2) == 0)
{
groupCount++;
}
else if (memcmp(fileIter, "f ", 2) == 0)
{
uint32_t vertCount = 0;
fileIter += 1;
for (uint32_t i = 0; i < 4; i++)
{
int32_t v = strtol(fileIter, &fileIter, 10);
if (v != 0)
{
strtol(fileIter + 1, &fileIter, 10);
strtol(fileIter + 1, &fileIter, 10);
vertCount++;
}
else
{
break;
}
}
assert(vertCount == 3 || vertCount == 4);
faceCount += vertCount == 3 ? 1 : 2;
}
else if (memcmp(fileIter, "usemtl ", 7) == 0)
{
materialCount++;
}
else if (memcmp(fileIter, "mtllib ", 7) == 0)
{
materialLibCount++;
}
else if (memcmp(fileIter, "# ", 2) == 0)
{
}
}


Cranberray’s Obj loader simply assumes that every space, tab or return carriage is a potential delimiter of a token. You might be thinking “this is fragile!” and you would be right! But it works for my use case.

Once done iterating through every token to count how much memory it should allocate, we reset the pointer and gather our data.

We could write this loader as a state machine, but I find this solution mildly simpler (and less effort to think about).

Something that I think is of note for the loader, is how materials are handled. Instead of storing our mesh in chunks for each material type, I simply store every material boundary. This allows me to store our mesh as one big chunk while also handling multiple materials per chunk.

Honestly, that’s pretty much it. It’s just a series of switch statements and loops.

Multi-threading cranberray was a piece of cake. Since all state being mutated is contained in a thread, there is very little to no effort needed to parallelize it.

Every thread in cranberray gets access to the immutable source render data (scene, lights, etc), it’s own render context (random seed, stack allocator, stats object) and the render queue.

Work for cranberray is split into render “chunks” which are simply rectangular segments that every thread will be rendering to. These chunks are stored in a queue (the render queue).

To get an element from the queue, every thread is simply using an atomic increment to retrieve their next chunk.

int32_t chunkIdx = cranpl_atomic_increment(&renderQueue->next) - 1;
for (; chunkIdx < renderQueue->count; chunkIdx = cranpl_atomic_increment(&renderQueue->next) - 1)


Note the “-1” here. That’s because if we simply read from next and then increment, multiple threads could have read the same value for their chunk index. Instead, we want to read the post incremented value returned by the atomic instruction. This guarantees that our queue index is unique to us.

Once a thread has it’s chunk, it simply writes to the output target at it’s specified location.

Cranberray also captures some useful metrics from it’s rendering such as rays spawned, time spent and other useful bits and pieces.

Originally this data was global, but we can’t have that if we’re going to access it with different threads. One option is to modify every item in that data with an atomic operation but atomic operations aren’t free and the user has to be careful about where and how they update the data. Instead, a much simpler approach is to store metric data per-thread and merge it at the end. This is the approach cranberray takes, and the peace of mind afforded by this strategy is sublime.

### Conclusion

This post was a simple rundown of a few small but essential things from cranberray. Next post we’ll be looking at texture mapping and bump mapping! Until next time. Happy coding!

## Intro

In our last post, we looked at building our BVH using SAH as our guiding heuristic. Now it’s time to look at how we intersect that BVH.

## High Level

Querying the BVH is actually quite simple. We test our ray against our parent node, if it intersects, we test the ray against both of it’s children. If the ray intersects either of it’s children, we test that children’s children. Once we reach a leaf node, we add it to our list of nodes that we’ve intersected.

This sounds perfect for recursion, but recursion never scales particularly well. Instead a queue or stack is a more effective method to represent the nodes to be tested.

There are three pieces that bring the BVH traversal together. The math library, the AABB intersection code and the (roughly) branchless intersection response code.

We’ve already seen the math library in this post, as a result, let’s take a dive into the intersection routine!

## Ray/AABB Intersection Routine

There are quite a few excellent sources on the ray/AABB intersection routine [1][2][3]. We’ll do a small dive into how we get to the final implementation.

The intersection routine is based on the slab intersection method presented in [3]. We want to define our AABB as a set of 3 slabs in each axis. A slab is simply represented by a pair of planes, the slab being the space in between them.

In order to determine if we’re intersecting the AABB, we want to determine if our ray passes through the shared intersection of each slab.

In order to determine that, we want to determine if our ray enters each slab before it exits any of the other slabs.

A simple approach to this could be:

slabEntryX = intersect(ray, startPlaneX)
slabExitX = intersect(ray, endPlaneX)
slabEntryY = intersect(ray, startPlaneY)
slabExitY = intersect(ray, endPlaneY)

if(slabEntryX is greater than slabExitY or slabEntryY is greater than slabExitX)
return missed collision

return hit collision


As a result, our core approach is to find if there is any entry value that is greater than our exit values. And if so, we’re not intersecting the slab.

To intersect a ray and a plane, we need to define our ray as:

$\vec{p} = \vec{a}*t+\vec{b}$

and our plane as

$0 = \vec{n}.\vec{p}-d$

We can then plug our ray function into our plane function to get:

$0 = \vec{n}.(\vec{a}*t+\vec{b})-d$

And solving for $t$

$d = \vec{n}.\vec{a}*t+vec{n}.\vec{b}$

$d-\vec{n}.\vec{b} = vec{n}.\vec{a}*t$

$\frac{d-\vec{n}.\vec{b}}{vec{n}.\vec{a}} = t$

Now that we have $t$ we can calculate it for each plane, since our planes are axis aligned, our intersection math gets quite simple. Here is an example for our x axis.

$\frac{d-\vec{n}.\vec{b}}{vec{n}.\vec{a}} = t$

where $n = (1, 0, 0)$

$\frac{d_x-b_x}{a_x} = t$

And repeat for each axis.

Now let’s look at our intersection code:

cv3l rayOLanes = cv3l_replicate(rayO);
cv3l invD = cv3l_replicate(cv3_rcp(rayD));
cv3l t0s = cv3l_mul(cv3l_sub(aabbMin, rayOLanes), invD);
cv3l t1s = cv3l_mul(cv3l_sub(aabbMax, rayOLanes), invD);

cv3l tsmaller = cv3l_min(t0s, t1s);
cv3l tbigger  = cv3l_max(t0s, t1s);

cfl rayMinLane = cfl_replicate(rayMin);
cfl rayMaxLane = cfl_replicate(rayMax);
cfl tmin = cfl_max(rayMinLane, cfl_max(tsmaller.x, cfl_max(tsmaller.y, tsmaller.z)));
cfl tmax = cfl_min(rayMaxLane, cfl_min(tbigger.x, cfl_min(tbigger.y, tbigger.z)));
cfl result = cfl_less(tmin, tmax);


This code looks different than our math, but it’s essence is the same.

In this code we’re testing 4 AABBs against a single ray instead of one ray against one AABB.

Notice that in this line:

cv3l t0s = cv3l_mul(cv3l_sub(aabbMin, rayOLanes), invD);


We’re implementing $\frac{d_x-b_x}{a_x} = t$ where $b_x, b_y, b_z$ is rayOLanes, invD is $\frac{1}{a_x}, \frac{1}{a_y}, \frac{1}{a_z}$ and aabbMin is $d_x, d_y, d_z$.

The next few lines:

cv3l tsmaller = cv3l_min(t0s, t1s);
cv3l tbigger = cv3l_max(t0s, t1s);


Are there to account for the situation where our ray direction is negative. If our ray direction is negative on one axis, t0 and t1 will actually be flipped. As a result, we want to simply make sure that t0 is the smaller one and t1 is the larger one. [1]

And finally,

cfl tmin = cfl_max(rayMinLane, cfl_max(tsmaller.x, cfl_max(tsmaller.y, tsmaller.z)));
cfl tmax = cfl_min(rayMaxLane, cfl_min(tbigger.x, cfl_min(tbigger.y, tbigger.z)));
cfl result = cfl_less(tmin, tmax);


These lines are calculating to see if any of our latest entrance is earlier than our earliest exit. If so, cfl_mask will set the appropriate bit for that AABB.

That’s it for our intersection routine! Now we can take a look at what we do with these results.

## Intersection Response

The previous intersection code is nicely packaged in the caabb_does_ray_intersect_lanes routine.

In our bvh traversal, the call to it looks like this:

uint32_t intersections = caabb_does_ray_intersect_lanes(rayO, rayD, rayMin, rayMax, boundMins, boundMaxs);


Where intersections hold the bits of the AABBs that intersect with our ray.

One approach to handling these bits is to simply loop through each AABB, determine if it’s bit is set and add it’s children volumes to the queue of being tested or to the final list of leaf nodes.

However, that’s no fun.

Instead it’s a lot more fun to make this code branchless using some SSE intrinsics!

Fair warning, this code can be complicated if you aren’t used to reading SSE intrinsics.

#define _MM_SHUFFLE_EPI8(i3,i2,i1,i0) _mm_set_epi8(i3*4+3,i3*4+2,i3*4+1,i3*4,i2*4+3,i2*4+2,i2*4+1,i2*4,i1*4+3,i1*4+2,i1*4+1,i1*4,i0*4+3,i0*4+2,i0*4+1,i0*4)
__m128i shuffles[16] =
{
_MM_SHUFFLE_EPI8(0,0,0,0), _MM_SHUFFLE_EPI8(0,0,0,0), _MM_SHUFFLE_EPI8(0,0,0,1), _MM_SHUFFLE_EPI8(0,0,1,0), // 0000, 0001, 0010, 0011
_MM_SHUFFLE_EPI8(0,0,0,2), _MM_SHUFFLE_EPI8(0,0,2,0), _MM_SHUFFLE_EPI8(0,0,2,1), _MM_SHUFFLE_EPI8(0,2,1,0), // 0100, 0101, 0110, 0111
_MM_SHUFFLE_EPI8(0,0,0,3), _MM_SHUFFLE_EPI8(0,0,3,0), _MM_SHUFFLE_EPI8(0,0,3,1), _MM_SHUFFLE_EPI8(0,3,1,0), // 1000, 1001, 1010, 1011
_MM_SHUFFLE_EPI8(0,0,3,2), _MM_SHUFFLE_EPI8(0,3,2,0), _MM_SHUFFLE_EPI8(0,3,2,1), _MM_SHUFFLE_EPI8(3,2,1,0)  // 1100, 1101, 1110, 1111
};

uint32_t leafLine = bvh->count - bvh->leafCount;
{
__m128i isParent = _mm_cmplt_epi32(queueIndices, _mm_set_epi32(leafLine, leafLine, leafLine, leafLine));

}

union
{
uint32_t i[4];
__m128i v;
} indexUnion;

indexUnion.v = childIndices;
for (uint32_t i = 0; i < leafCount; i++)
{
uint32_t nodeIndex = indexUnion.i[i];
candidateIter[i] = bvh->jumps[nodeIndex].leaf.index;
}
candidateIter+=leafCount;

indexUnion.v = parentIndices;
for (uint32_t i = 0; i < parentCount; i++)
{
uint32_t nodeIndex = indexUnion.i[i];
cran_assert(nodeIndex < bvh->count);
testQueueEnd[i*2] = bvh->jumps[nodeIndex].jumpIndices.left;
testQueueEnd[i*2 + 1] = bvh->jumps[nodeIndex].jumpIndices.right;
}
testQueueEnd += parentCount * 2;


Now this looks like a lot, so let’s take a look at it piece by piece. As we presented in the previous post, our leaf nodes are stored at the end of our array. This allows us to determine which nodes are leaf nodes simply by comparing their indices against that boundary. The following bit of code does exactly that:

__m128i isParent = _mm_cmplt_epi32(queueIndices, _mm_set_epi32(leafLine, leafLine, leafLine, leafLine));


The first line here simply compares our indices against our “leaf boundary”. The result of this tells us if our index is a parent or a leaf node.

Once we know which ones are our leaf nodes, we convert this to a bitmask representing the elements of the vector.

And finally, we mask out the indices that have no intersections associated with them.

The result of this operation is that we have two bitmasks that represents the parents that have intersected the ray and the leaves that have intersected the ray.

Next we want to turn this bit mask into a series of indices.

uint32_t leafCount = __popcnt(childIndexMask);


The first two lines are very simple, it tells us how many leaf nodes have intersections and how many parent nodes. The next pair of lines require a little more explaining.

Right now, our queueIndices vector represents the indices to the AABBs that we’ve tested against our rays.

One approach we could implement is to simply loop through every element, test their bits and read the values if those bits are set.

However, what we can do instead is pack all of our active indices to the front of our vector allowing us to read only the ones that have reported an intersection. Reducing our branching at the cost of shuffling our data. That’s what _mm_shuffle_epi8 is doing. It’s taking a lookup table (shuffles[]) and determining where our values should be moved in the vector to pack them to the front of the vector.

Since our masks only contain 4 bits, they have a maximum of 15 values making for a very manageable look up table.

Once we have our indices packed, we simply have to read them.

union
{
uint32_t i[4];
__m128i v;
} indexUnion;

indexUnion.v = childIndices;
for (uint32_t i = 0; i < leafCount; i++)
{
uint32_t nodeIndex = indexUnion.i[i];
candidateIter[i] = bvh->jumps[nodeIndex].leaf.index;
}
candidateIter+=leafCount;

indexUnion.v = parentIndices;
for (uint32_t i = 0; i < parentCount; i++)
{
uint32_t nodeIndex = indexUnion.i[i];
cran_assert(nodeIndex < bvh->count);
testQueueEnd[i*2] = bvh->jumps[nodeIndex].jumpIndices.left;
testQueueEnd[i*2 + 1] = bvh->jumps[nodeIndex].jumpIndices.right;
}
testQueueEnd += parentCount * 2;


These parts are relatively self explanatory. We’re reading our indices and doing the appropriate work if our values are leaves or parent nodes. To understand why we’re using a union to convert our SSE vector, I recommend looking into type punning. [5]

## Conclusion

That’s it! I don’t really have any future work for this part of cranberray, I’m actually quite happy with the end result.

Next we’ll be looking at the ray tracer’s sampling strategy. Until then, happy coding!