Cranberry Lair

Walking the programming parameter space

Intro

In our last post, we took a look at texture sampling and bump mapping. In this post, we’ll be talking about the last few additions that were made to cranberray before it was wrapped up. Converting from a recursive path tracer to an iterative path tracer and implementing tone mapping using ACES tone mapping curves.

Iterative Path Tracing

Converting cranberray from a recursive path tracer to an iterative path tracer was surprisingly simple. Before this change, cranberray would shoot a ray from the camera and run the appropriate shader at the intersection point. This shader would in turn shoot a ray in some arbitrary direction to determine how much light should illuminate that location. This would continue recursively until the rays would either reach their maximum recursive depth or the ray would terminate by exiting the scene completely and sampling our skybox.

This approach makes sense if you think of every point as reflecting light that it has received from another location. As a result, you need to know how much light you’ve received before you can shade your point.

However, instead you can consider the opposite view. That every point is continuously absorbing light. We can see that the amount of light that the viewer receives is modeled by

$light = lightSource * absorb_0 * absorb_1 * absorb_2 ... * absorb_n$

With that in mind, we can see that we can simply calculate how much light we absorb at each point without needing to know the quality or the quantity of the light that we are receiving.

Instead of thinking of it as the light source being recursively reflected to the viewer.

light recursive(i)
{
return emissionAtPoint(i) + absorptionAtPoint(i) * recursive(i + 1);
}


We instead think of it as the light being absorbed at every stage

light iterative()
{
for(i = 0; i < maxDepth; i++)
{
light += absorption * emissionAtPoint(i);
absorption *= absorptionAtPoint(i);
}
return light;
}


This has some interesting properties!

• If we notice that our absorption has reached 0, we can stop iterating.
• We can multi-thread this much more easily than the recursive approach.
• It has the desirable property of not easily introducing stack overflows.

I also find this version easier to think about which allows me more flexibility in the solutions I can imagine.

Tone Mapping

I will only touch briefly on tone mapping, I find that [1], [2] and [3] already provide excellent views on this topic.

Cranberray makes use of the histogram based average luminance calculations alongside the ACES tone mapping operator. You can find more information about these in the referenced articles.

However, one portion that I would like to talk about is why we are using the geometric mean to calculate the average luminance and hopefully provide some intuitions about the geometric mean.

As you can see in [1], the histogram luminance is stored as a histogram of the logarithms of our luminance instead of as their raw value. We sum up the values of the logarithms, divide it by the total number of values and revert it to a luminance.

This is the definition of the geometric mean. You can calculate the geometric mean either through repeated multiplication:

$\sqrt[n]{i_0*i_1*...*i_n}$

Or by using the logarithmic identities:

$e^{1/n*((ln(i_0)+ln(i_1)+...+ln(i_n)))}$

It took me quite some time to understand why the geometric mean would be a desirable value when doing tone mapping. Then, I realized that it is desirable when you are speaking in terms of multiplicative properties.

If I’m speaking of light, I’m thinking of the intensity either being twice as bright, or half as bright. In most scenarios, I don’t necessarily care if something is 2 units brighter versus 2 units dimmer. The center value of something that is twice as bright and half as bright is the geometric mean.

You can derive this by imagining that we have our center value $x$.

Imagine that we have a set of numbers, $0.5x, 1x, 2x$ and I would like to determine the value of $x$. If we repeatedly multiply our values, we get $0.5*1*2*x^3=x^3$ and we can then retrieve our value by taking the cube root.

Generalizing, given 3 variables

$y,z,w$

where

$y=ax, z=bx, w=cx$

and

$a*b*c = 1$ (1)

Then we can find a value for x by

$\sqrt[3]{y*z*w}=m$

since

$\sqrt[3]{ax*bx*cx}=m$

$\sqrt[3]{a*b*c*x^3}=m$

given (1), then

$\sqrt[3]{x^3}=m$

$x=m$

As an example, say we have the values 0.2, 0.7, 1.3. If we take the geometric mean we get $\sqrt[3]{0.2*0.7*1.3}$ which is 0.56670511081. Now if we divide our original values by that value we get 0.35291723364, 1.23521031776 and 2.2939620187 which when multiplied together give 1!

In terms of desirable properties, let’s take a look at a more visual example.

These images show the difference between using the arithmetic mean to normalize our values and using the geometric mean to normalize our values. The top row is our arithmetic mean and the bottom row is our geometric mean.

Our colors are simply generated by starting with 2, 4, 8, 16, 32, 2^n and calculating their arithmetic and geometric means then dividing each value by the respective means.

In our first example we have 2, 4 and 8. We divide by 4.6666 with the arithmetic mean and divide by 4 with the geometric mean.

You’ll notice that as our numbers get larger and larger, the arithmetic mean slowly loses all detail in the low range of our values and only maintains the brightest values at the edges.

Notice also that the geometric mean remains centered despite the ever growing magnitude of our values. This is desirable since most of our image would become very dark, very quickly.

Conclusion

That’s it for this series! Our next series will likely focus on a completely different topic, possibly in the world of real-time rendering. Cranberray was an excellent experience in various areas of path tracing. I would love to come back to it someday. Until next time, happy coding!

Intro

This is the first in a series of posts detailing the functionality of my hobby path tracing, we will be looking at the underpinning’s of it all, the math library cranberry_math.

The goals of cranberry_math are simple. Making SIMD accelerated mathematics easier and fun to use. There are already a huge swath of math libraries out there that do things very well (DirectXMath and GLM to name a few), however I decided to write my own for fun!

Ideology

The basis of cranberry_math is the concept of lanes. If you’re not familiar with SIMD, the concept is relatively simple. You can imagine that a SIMD register (or vector register) is a set of 4 “lanes”, these lanes are generally independent of each other but allow you to perform operations on them using another SIMD register. For the purpose of demonstration, we will write a SIMD register as $v_n (v_0, v_1, v_2)$ and the element of one of those registers as $v_ne_m (v_0e_0)$ where $v_0e_0$ would be element 0 of SIMD register 0.

I.e. if I take $v_0$ and add it to $v_1$ the result is the equivalent of $v_2 = (v_0e_0+v_1e_0, v_0e_1+v_1e_1, v_0e_2+v_1e_2, v_0e_3+v_1e_3)$

On it’s own, this might not be particularly useful. But vector register operations have a number of desirable properties.

This is a highly simplified version of how your processor might implement vector adds.

Now, with this added information, we can see why we would want to use these operations efficiently, we could reduce our program latency by at most 4x! (In reality this is unlikely to happen, but we can get fairly close).

Cranberry_math takes this concept and tries to make the best use of it while still making it approachable to the user.

An intuitive and simple use of these registers would be to store our X,Y,Z components into each component such as $v_0 = (x, y, z, unused)$ and working with them in that manner. In some common operations such as adding vectors, this can give you a nice hypothetical improvement of 3x over your original single element addition. However, we’re still leaving 25% of our optimal performance on the table! This approach also causes some issues with common operations such as the dot product.

Let’s say in our fantasy processor, our dot product operation takes 2 cycles to complete when applied to 1 vector. That means we get a throughput of 2 cycles/dot product. I propose (and has been proposed elsewhere as well) that you instead use each register to store a single component of a vector. Such as $v_x = (x_0, x_1, x_2, x_3), v_y = (y_0, y_1, y_2, y_3)$, etc. With this in mind, we can achieve the dot product by doing $v_x*v_x+v_y*v_y+v_z*v_z$ which is a total of 5 operations that all take 1 cycle in our fantasy processor for 4 vectors or a throughput of 1.25 cycles/dot product.

For SSE according to the intel intrinsics guide a multiply operation has a latency of 4 cycles (on Skylake), an add has a latency of 4 cycles. As a result a dot product in a fantasy processor with no out of order functionality or multiple issue, our wide dot product would take roughly 20 cycles to complete or $20/4=5$ cycles per dot product. While if we look at the dot product instruction, it has a latency of 11 cycles. (See [1] and [2] for info on reading the intel intrinsics guide values) This is not indicative of real performance numbers. Modern processors are significantly more complex than what I’m implying here.

Another benefit of storing vectors this way, is that it scales trivially to SIMD registers of size 8, or 16! You only need to throw more data at it!

Now, this approach to storing vectors in SIMD registers is not fool proof. It introduces difficulty if we want to add similar components together. (x with x, y with y) within one vector, I however have not found this to be a problem while writing my path tracer.

This isn’t a silver bullet, but I recommend considering adding a path for this type of storage if you can afford it.

In Depth View

With that out of the way, here are the primary types for our library:

#define cran_lane_count 4
cran_align(16) typedef union
{
float f[cran_lane_count];
__m128 sse;
} cfl;

typedef union
{
struct
{
float x, y, z;
};

} cv3;

typedef struct
{
cfl x;
cfl y;
cfl z;
} cv3l;


The theme of this library is that SIMD registers are referred to as lanes. Every element is expected to stay within it’s respective lane. cfl represents “cranberry float lane”, cv3 represents “cranberry vector 3” and cv3l represents “cranberry vector 3 lanes”. Notice how every component in cv3l has it’s own set of lanes? That is a direct implementation of the explanation that was presented above.

cran_forceinline cv3l cv3l_add(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_sub(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_mul(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_min(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_max(cv3l l, cv3l r);


The API for cv3l is simple, it looks exactly the same as it’s cv3 counterpart. This makes it relatively easy to switch from one to the other simply by loading the appropriate data correctly.

cran_forceinline cv3l cv3l_indexed_load(void const* vectors, uint32_t stride, uint32_t offset, uint32_t* indices, uint32_t indexCount);


One function that we should take a particular look at is cv3l_indexed_load. It’s likely you don’t want to store your data in cv3l format. It doesn’t lend itself particularly well to general processing and works much better in batch processing. As a result your data needs to be transformed from x,y,z,x,y,z to x,x,x,y,y,y,z,z,z to facilitate our maximum throughput.

One way to shuffle our data would be to load each element individually and store it in the appropriate locations. However this is surprisingly slow (I forget the exact reasons, a reference for that would be awesome).

$v_0 = (x_0, y_0, z_0, u), v_1 = (x_1, y_1, z_1, u), v_2 = (x_2, y_2, z_2, u), v_3 = (x_3, y_3, z_3, u)$

and then shuffle the values into a set of temporary registers

$v_{xy0} = (x_0, y_0, x_1, y_1), v_{xy1} = (x_2,y_2,x_3,y_3), v_{z0} = (z_0, u, z_1, u), v_{z1} = (z_2, u, z_3, u)$

and finally shuffle them into the final registers

$v_x = (x_0, x_1, x_2, x_3), v_y = (y_0, y_1, y_2, y_3), v_z =(z_0, z_1, z_2, z_3)$

(There’s a mistake with the arrows for $Z_2$ and $Z_3$ they should be pointed to the 4 register)

Here’s the source for this operation:

cran_forceinline cv3l cv3l_indexed_load(void const* vectors, uint32_t stride, uint32_t offset, uint32_t* indices, uint32_t indexCount)
{
for (uint32_t i = 0; i < indexCount && i < cran_lane_count; i++)
{
uint8_t const* vectorData = (uint8_t*)vectors;
}

return (cv3l)
{
.x = {.sse = _mm_shuffle_ps(XY0, XY1, _MM_SHUFFLE(2, 0, 2, 0))},
.y = {.sse = _mm_shuffle_ps(XY0, XY1, _MM_SHUFFLE(3, 1, 3, 1))},
.z = {.sse = _mm_shuffle_ps(Z0, Z1, _MM_SHUFFLE(2, 0, 2, 0))}
};
}


With all this, we can see that I can write a vectorized ray/AABB intersection using this new API:

cran_forceinline uint32_t caabb_does_ray_intersect_lanes(cv3 rayO, cv3 rayD, float rayMin, float rayMax, cv3l aabbMin, cv3l aabbMax)
{
cv3l rayOLanes = cv3l_replicate(rayO);
cv3l invD = cv3l_replicate(cv3_rcp(rayD));
cv3l t0s = cv3l_mul(cv3l_sub(aabbMin, rayOLanes), invD);
cv3l t1s = cv3l_mul(cv3l_sub(aabbMax, rayOLanes), invD);

cv3l tsmaller = cv3l_min(t0s, t1s);
cv3l tbigger  = cv3l_max(t0s, t1s);

cfl rayMinLane = cfl_replicate(rayMin);
cfl rayMaxLane = cfl_replicate(rayMax);
cfl tmin = cfl_max(rayMinLane, cfl_max(tsmaller.x, cfl_max(tsmaller.y, tsmaller.z)));
cfl tmax = cfl_min(rayMaxLane, cfl_min(tbigger.x, cfl_min(tbigger.y, tbigger.z)));
cfl result = cfl_less(tmin, tmax);
}


That’s the gist of it! Writing vectorized BVH traversal using this API has been a breeze. You can find the rest of the library here: https://github.com/AlexSabourinDev/cranberries/blob/cranberray/cranberry_math.h

Here is the library in use for BVH AABB testing:

uint32_t activeLaneCount = min((uint32_t)(testQueueEnd - testQueueIter), cran_lane_count);
cv3l boundMins = cv3l_indexed_load(bvh->bounds, sizeof(caabb), offsetof(caabb, min), testQueueIter, activeLaneCount);
cv3l boundMaxs = cv3l_indexed_load(bvh->bounds, sizeof(caabb), offsetof(caabb, max), testQueueIter, activeLaneCount);
uint32_t intersections = caabb_does_ray_intersect_lanes(rayO, rayD, rayMin, rayMax, boundMins, boundMaxs);


We’ll be taking a deeper look at the SSE optimized BVH traversal in the future. For now, happy coding!

References:

Within the past few months I’ve been implementing a path tracer during my weekends and evenings. Having reached what I consider to be an acceptable stopping point, I thought it might be interesting to detail my current implementation alongside the references that inspired me. Let’s jump in!

This series will touch on a variety of features that were implemented in the path tracer. Notably:

These posts will try to explain every aspect of the path tracer as they stand but they might not be as comprehensive as the sources referenced to create it. As a result, I will strive to include all relevant sources for each feature.

You can find the path tracer here: https://github.com/AlexSabourinDev/cranberries/tree/cranberray

Next we’ll talk a about the underpinning of the path tracer, the cranberry_math library, see you then!

Introduction

I’ve recently been diving into the world of importance sampling and I decided to share my derivation for importance sampling the cosine lobe.

When we’re shading a point in path tracing, we typically shoot a ray from our surface in a uniformity random direction contained on a hemisphere centered about our normal. This has the downside of introducing quite a bit of variance into our renders.

Imagine that we have a very bright light that only occupies a very small projected area from our shaded point.

We would be very likely to miss this light with most of our samples and our render could turn out much darker than we would expect.

This is where importance sampling comes in.

If you imagine that your illumination is a polar function

If we were to sample a random variable with a distribution that matches this function, we would be much more likely to hit the important points of our function. (Hence importance sampling)

I won’t dive deeply into this topic, as there are a variety of excellent resources detailing this topic. [1]

The essence of it however, is that you want to find a Probability Density Function (PDF) that matches the shape of your illumination function. Once you’ve define this PDF, you can sample it using the Cumulative Density Function (CDF).

Derivation

Since our cosine lobe illumination will look like this:

We will use this as the basis to derive our distribution since we’re most likely to get the most light from directions arriving parallel to our normal.

Thankfully, our cosine lobe has an analytical formula that we can use as our PDF.

$PDF(\omega) = C*cos(\theta)$ (1)

Our PDF must integrate to 1, we integrate the PDF across our hemisphere

$\int_{\Omega}PDF(\omega)d\omega$

$\int_0^{2\pi}\int_0^{\frac{\pi}{2}}PDF(\omega)sin\theta d\theta d\phi$

Plug in (1)

$\int_0^{2\pi}\int_0^{\frac{\pi}{2}}C*cos\theta sin\theta d\theta d\phi$

$C*\int_0^{2\pi}\int_0^{\frac{\pi}{2}}cos\theta sin\theta d\theta d\phi$

$\int cos\theta sin\theta d\theta = -\frac{1}{4}cos2\theta$

$\int_0^{\frac{\pi}{2}}cos\theta sin\theta d\theta$

$-\frac{1}{4}cos\pi+ \frac{1}{4}cos0$

$\frac{1}{4}+\frac{1}{4}$

$\int_0^{\frac{\pi}{2}}cos\theta sin\theta d\theta=\frac{1}{2}$ (2)

Plug in (2)

$C*\int_0^{2\pi}\frac{1}{2} d\phi$

$C*\frac{1}{2}*2\pi$

$C*\int_0^{2\pi}\int_0^{\frac{\pi}{2}}cos\theta sin\theta d\theta d\phi=C*\pi$ (3)

Since our PDF has to integrate to 1,

$\int_0^{2\pi}\int_0^{\frac{\pi}{2}}PDF(\omega)sin\theta d\theta d\phi = 1$

Plug in (3),

$C*\pi=1$

$C=\frac{1}{pi}$ (4)

Finally, plug in (4) into our PDF,

$PDF(\omega) = \frac{cos(\theta)}{\pi}$ (5)

Now that we have our PDF, we can calculate our PDF in terms of $\theta$ and $\phi$.

$PDF(\theta,\phi)d\theta d\phi = PDF(\omega)d\omega$

$PDF(\theta,\phi)d\theta d\phi = PDF(\omega)sin\theta d\theta d\phi$

$PDF(\theta,\phi)=PDF(\omega)sin\theta$

$PDF(\theta,\phi)=\frac{cos\theta sin\theta}{\pi}$ (6)

Now we integrate with respect to $\phi$ to get $PDF(\theta)$

$\int_0^{2\pi}\frac{cos\theta sin\theta}{\pi}d\phi = 2cos\theta sin\theta$

$PDF(\theta)=2cos\theta sin\theta$

And then to get $PDF(\phi)$,

$\frac{PDF(\theta,\phi)}{PDF(\theta)}=PDF(\phi)$

$\frac{cos\theta sin\theta}{2cos\theta sin\theta \pi}=\frac{1}{2\pi}$

$PDF(\phi)=\frac{1}{2\phi}$

Now we want to calculate the CDF of each function,

$CDF(\theta)=\int_0^\theta PDF(\theta) d\theta$

$CDF(\theta)=\int_0^\theta 2cos(\theta)sin(\theta) d\theta$

$CDF(\theta)=\int_0^\theta sin(2\theta) d\theta$

$CDF(\theta)=\frac{1}{2}-\frac{cos(2\theta)}{2}$

$CDF(\phi)=\int_0^\phi PDF(\phi) d\phi$

$CDF(\phi)=\int_0^\phi\frac{1}{2\pi} d\phi$

$CDF(\phi)=\frac{\phi}{2\pi}$

Now we want to invert our CDF to sample it using our random variable $y$,

$y=CDF(\theta)$

$y=\frac{1}{2}-\frac{cos(2\theta)}{2}$

$\frac{1}{2}-y=\frac{cos(2\theta)}{2}$

$1-2y=cos(2\theta)$

$\frac{cos^{-1}(1-2y)}{2}=\theta$ (7)

For $\phi$,

$y=CDF(\phi)$

$y=\frac{\phi}{2\pi}$

$y*2\pi = \phi$ (8)

Now we have our CDFs and our PDFs, we can finally calculate our direction.

In pseudo code you can simply do:

$\theta=\frac{cos^{-1}(1-2rand01())}{2}$

$\phi=rand01()*2\pi$

With these directions, you can now sample your scene:

$\frac{SampleScene(SphericalTo3D(\theta, \phi))}{PDF(\omega)}$

Plug in (5)

$\frac{SampleScene(SphericalTo3D(\theta, \phi))\pi}{cos\theta}$

Conclusion

That’s it! These formulas will sample the hemisphere where it is receiving more light defined by the cosine lobe. The results are pretty awesome.