Cranberry Lair

Walking the programming parameter space

Diary of a Path Tracer – Wrapping Up — September 27, 2020

Diary of a Path Tracer – Wrapping Up

Index: https://cranberryking.com/2020/08/22/diary-of-a-path-tracer-the-beginning/

Intro

In our last post, we took a look at texture sampling and bump mapping. In this post, we’ll be talking about the last few additions that were made to cranberray before it was wrapped up. Converting from a recursive path tracer to an iterative path tracer and implementing tone mapping using ACES tone mapping curves.

Iterative Path Tracing

Converting cranberray from a recursive path tracer to an iterative path tracer was surprisingly simple. Before this change, cranberray would shoot a ray from the camera and run the appropriate shader at the intersection point. This shader would in turn shoot a ray in some arbitrary direction to determine how much light should illuminate that location. This would continue recursively until the rays would either reach their maximum recursive depth or the ray would terminate by exiting the scene completely and sampling our skybox.

This approach makes sense if you think of every point as reflecting light that it has received from another location. As a result, you need to know how much light you’ve received before you can shade your point.

However, instead you can consider the opposite view. That every point is continuously absorbing light. We can see that the amount of light that the viewer receives is modeled by

light = lightSource * absorb_0 * absorb_1 * absorb_2 ... * absorb_n

With that in mind, we can see that we can simply calculate how much light we absorb at each point without needing to know the quality or the quantity of the light that we are receiving.

Instead of thinking of it as the light source being recursively reflected to the viewer.

light recursive(i)
{
    return emissionAtPoint(i) + absorptionAtPoint(i) * recursive(i + 1);
}

We instead think of it as the light being absorbed at every stage

light iterative()
{   
    for(i = 0; i < maxDepth; i++)
    {
        light += absorption * emissionAtPoint(i);
        absorption *= absorptionAtPoint(i);
    }
    return light;
}

This has some interesting properties!

  • If we notice that our absorption has reached 0, we can stop iterating.
  • We can multi-thread this much more easily than the recursive approach.
  • It has the desirable property of not easily introducing stack overflows.

I also find this version easier to think about which allows me more flexibility in the solutions I can imagine.

Tone Mapping

I will only touch briefly on tone mapping, I find that [1], [2] and [3] already provide excellent views on this topic.

Cranberray makes use of the histogram based average luminance calculations alongside the ACES tone mapping operator. You can find more information about these in the referenced articles.

However, one portion that I would like to talk about is why we are using the geometric mean to calculate the average luminance and hopefully provide some intuitions about the geometric mean.

As you can see in [1], the histogram luminance is stored as a histogram of the logarithms of our luminance instead of as their raw value. We sum up the values of the logarithms, divide it by the total number of values and revert it to a luminance.

This is the definition of the geometric mean. You can calculate the geometric mean either through repeated multiplication:

\sqrt[n]{i_0*i_1*...*i_n}

Or by using the logarithmic identities:

e^{1/n*((ln(i_0)+ln(i_1)+...+ln(i_n)))}

It took me quite some time to understand why the geometric mean would be a desirable value when doing tone mapping. Then, I realized that it is desirable when you are speaking in terms of multiplicative properties.

If I’m speaking of light, I’m thinking of the intensity either being twice as bright, or half as bright. In most scenarios, I don’t necessarily care if something is 2 units brighter versus 2 units dimmer. The center value of something that is twice as bright and half as bright is the geometric mean.

You can derive this by imagining that we have our center value x.

Imagine that we have a set of numbers, 0.5x, 1x, 2x and I would like to determine the value of x. If we repeatedly multiply our values, we get 0.5*1*2*x^3=x^3 and we can then retrieve our value by taking the cube root.

Generalizing, given 3 variables

y,z,w

where

y=ax, z=bx, w=cx

and

a*b*c = 1 (1)

Then we can find a value for x by

\sqrt[3]{y*z*w}=m

since

\sqrt[3]{ax*bx*cx}=m

\sqrt[3]{a*b*c*x^3}=m

given (1), then

\sqrt[3]{x^3}=m

x=m

As an example, say we have the values 0.2, 0.7, 1.3. If we take the geometric mean we get \sqrt[3]{0.2*0.7*1.3} which is 0.56670511081. Now if we divide our original values by that value we get 0.35291723364, 1.23521031776 and 2.2939620187 which when multiplied together give 1!

In terms of desirable properties, let’s take a look at a more visual example.

These images show the difference between using the arithmetic mean to normalize our values and using the geometric mean to normalize our values. The top row is our arithmetic mean and the bottom row is our geometric mean.

Our colors are simply generated by starting with 2, 4, 8, 16, 32, 2^n and calculating their arithmetic and geometric means then dividing each value by the respective means.

In our first example we have 2, 4 and 8. We divide by 4.6666 with the arithmetic mean and divide by 4 with the geometric mean.

You’ll notice that as our numbers get larger and larger, the arithmetic mean slowly loses all detail in the low range of our values and only maintains the brightest values at the edges.

Notice also that the geometric mean remains centered despite the ever growing magnitude of our values. This is desirable since most of our image would become very dark, very quickly.

Conclusion

That’s it for this series! Our next series will likely focus on a completely different topic, possibly in the world of real-time rendering. Cranberray was an excellent experience in various areas of path tracing. I would love to come back to it someday. Until next time, happy coding!

Resources

[1] http://alextardif.com/HistogramLuminance.html

[2] https://bruop.github.io/exposure/

[3] https://64.github.io/tonemapping/

Diary of a Path Tracer – Math Library — August 22, 2020

Diary of a Path Tracer – Math Library

Index: https://cranberryking.com/2020/08/22/diary-of-a-path-tracer-the-beginning/

Intro

This is the first in a series of posts detailing the functionality of my hobby path tracing, we will be looking at the underpinning’s of it all, the math library cranberry_math.

The goals of cranberry_math are simple. Making SIMD accelerated mathematics easier and fun to use. There are already a huge swath of math libraries out there that do things very well (DirectXMath and GLM to name a few), however I decided to write my own for fun!

Ideology

The basis of cranberry_math is the concept of lanes. If you’re not familiar with SIMD, the concept is relatively simple. You can imagine that a SIMD register (or vector register) is a set of 4 “lanes”, these lanes are generally independent of each other but allow you to perform operations on them using another SIMD register. For the purpose of demonstration, we will write a SIMD register as v_n (v_0, v_1, v_2) and the element of one of those registers as v_ne_m (v_0e_0) where v_0e_0 would be element 0 of SIMD register 0.

I.e. if I take v_0 and add it to v_1 the result is the equivalent of v_2 = (v_0e_0+v_1e_0, v_0e_1+v_1e_1, v_0e_2+v_1e_2, v_0e_3+v_1e_3)

VectorExample

On it’s own, this might not be particularly useful. But vector register operations have a number of desirable properties.

Let’s imagine we have a fantasy processor, and this processor takes 1 cycle for every addition. This means that to add 4 pairs of numbers together it would take us roughly 4 cycles or 1 cycle per add. Now if our vector addition takes 4 cycles to complete, that’s still 1 cycle per add. However, consider what happens if by some magic we only take 3 cycles for our vector addition. That means overall, we only spend 3/4=0.75 cycles per add! In our fantasy processor, we can make this as low as 1 cycle for 4 adds or 1/4=0.25 cycles per add. You may ask why we can make our instruction run in less than 4 cycles even though we’re still doing 4 adds? In our fantasy processor, we can do this¬† by simply running 4 adders at once with our data. This means that although every adder is taking 1 cycle to complete, they are run in parallel, making the overall throughput 1 cycle/4 adds.

This is a highly simplified version of how your processor might implement vector adds.

Now, with this added information, we can see why we would want to use these operations efficiently, we could reduce our program latency by at most 4x! (In reality this is unlikely to happen, but we can get fairly close).

Cranberry_math takes this concept and tries to make the best use of it while still making it approachable to the user.

An intuitive and simple use of these registers would be to store our X,Y,Z components into each component such as v_0 = (x, y, z, unused) and working with them in that manner. In some common operations such as adding vectors, this can give you a nice hypothetical improvement of 3x over your original single element addition. However, we’re still leaving 25% of our optimal performance on the table! This approach also causes some issues with common operations such as the dot product.

Let’s say in our fantasy processor, our dot product operation takes 2 cycles to complete when applied to 1 vector. That means we get a throughput of 2 cycles/dot product. I propose (and has been proposed elsewhere as well) that you instead use each register to store a single component of a vector. Such as v_x = (x_0, x_1, x_2, x_3), v_y = (y_0, y_1, y_2, y_3), etc. With this in mind, we can achieve the dot product by doing v_x*v_x+v_y*v_y+v_z*v_z which is a total of 5 operations that all take 1 cycle in our fantasy processor for 4 vectors or a throughput of 1.25 cycles/dot product.

For SSE according to the intel intrinsics guide a multiply operation has a latency of 4 cycles (on Skylake), an add has a latency of 4 cycles. As a result a dot product in a fantasy processor with no out of order functionality or multiple issue, our wide dot product would take roughly 20 cycles to complete or 20/4=5 cycles per dot product. While if we look at the dot product instruction, it has a latency of 11 cycles. (See [1] and [2] for info on reading the intel intrinsics guide values) This is not indicative of real performance numbers. Modern processors are significantly more complex than what I’m implying here.

Another benefit of storing vectors this way, is that it scales trivially to SIMD registers of size 8, or 16! You only need to throw more data at it!

Now, this approach to storing vectors in SIMD registers is not fool proof. It introduces difficulty if we want to add similar components together. (x with x, y with y) within one vector, I however have not found this to be a problem while writing my path tracer.

This isn’t a silver bullet, but I recommend considering adding a path for this type of storage if you can afford it.

In Depth View

With that out of the way, here are the primary types for our library:

#define cran_lane_count 4
cran_align(16) typedef union
{
	float f[cran_lane_count];
	__m128 sse;
} cfl;

typedef union
{
	struct
	{
		float x, y, z;
	};

} cv3;

typedef struct
{
	cfl x;
	cfl y;
	cfl z;
} cv3l;

The theme of this library is that SIMD registers are referred to as lanes. Every element is expected to stay within it’s respective lane. cfl represents “cranberry float lane”, cv3 represents “cranberry vector 3” and cv3l represents “cranberry vector 3 lanes”. Notice how every component in cv3l has it’s own set of lanes? That is a direct implementation of the explanation that was presented above.

cran_forceinline cv3l cv3l_add(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_sub(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_mul(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_min(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_max(cv3l l, cv3l r);

The API for cv3l is simple, it looks exactly the same as it’s cv3 counterpart. This makes it relatively easy to switch from one to the other simply by loading the appropriate data correctly.

cran_forceinline cv3l cv3l_indexed_load(void const* vectors, uint32_t stride, uint32_t offset, uint32_t* indices, uint32_t indexCount);

One function that we should take a particular look at is cv3l_indexed_load. It’s likely you don’t want to store your data in cv3l format. It doesn’t lend itself particularly well to general processing and works much better in batch processing. As a result your data needs to be transformed from x,y,z,x,y,z to x,x,x,y,y,y,z,z,z to facilitate our maximum throughput.

One way to shuffle our data would be to load each element individually and store it in the appropriate locations. However this is surprisingly slow (I forget the exact reasons, a reference for that would be awesome).

Instead, what you can do is load your vectors as a set of vectors:

v_0 = (x_0, y_0, z_0, u), v_1 = (x_1, y_1, z_1, u), v_2 = (x_2, y_2, z_2, u), v_3 = (x_3, y_3, z_3, u)

and then shuffle the values into a set of temporary registers

v_{xy0} = (x_0, y_0, x_1, y_1), v_{xy1} = (x_2,y_2,x_3,y_3), v_{z0} = (z_0, u, z_1, u), v_{z1} = (z_2, u, z_3, u)

and finally shuffle them into the final registers

v_x = (x_0, x_1, x_2, x_3), v_y = (y_0, y_1, y_2, y_3), v_z =(z_0, z_1, z_2, z_3)

VectorShuffle

(There’s a mistake with the arrows for Z_2 and Z_3 they should be pointed to the 4 register)

Here’s the source for this operation:

cran_forceinline cv3l cv3l_indexed_load(void const* vectors, uint32_t stride, uint32_t offset, uint32_t* indices, uint32_t indexCount)
{
	__m128 loadedVectors[cran_lane_count];
	for (uint32_t i = 0; i < indexCount && i < cran_lane_count; i++)
	{
		uint8_t const* vectorData = (uint8_t*)vectors;
		loadedVectors[i] = _mm_load_ps((float const*)(vectorData + indices[i] * stride + offset));
	}

	__m128 XY0 = _mm_shuffle_ps(loadedVectors[0], loadedVectors[1], _MM_SHUFFLE(1, 0, 1, 0));
	__m128 XY1 = _mm_shuffle_ps(loadedVectors[2], loadedVectors[3], _MM_SHUFFLE(1, 0, 1, 0));
	__m128 Z0 = _mm_shuffle_ps(loadedVectors[0], loadedVectors[1], _MM_SHUFFLE(3, 2, 3, 2));
	__m128 Z1 = _mm_shuffle_ps(loadedVectors[2], loadedVectors[3], _MM_SHUFFLE(3, 2, 3, 2));

	return (cv3l)
	{
		.x = {.sse = _mm_shuffle_ps(XY0, XY1, _MM_SHUFFLE(2, 0, 2, 0))},
		.y = {.sse = _mm_shuffle_ps(XY0, XY1, _MM_SHUFFLE(3, 1, 3, 1))},
		.z = {.sse = _mm_shuffle_ps(Z0, Z1, _MM_SHUFFLE(2, 0, 2, 0))}
	};
}

With all this, we can see that I can write a vectorized ray/AABB intersection using this new API:

cran_forceinline uint32_t caabb_does_ray_intersect_lanes(cv3 rayO, cv3 rayD, float rayMin, float rayMax, cv3l aabbMin, cv3l aabbMax)
{
	cv3l rayOLanes = cv3l_replicate(rayO);
	cv3l invD = cv3l_replicate(cv3_rcp(rayD));
	cv3l t0s = cv3l_mul(cv3l_sub(aabbMin, rayOLanes), invD);
	cv3l t1s = cv3l_mul(cv3l_sub(aabbMax, rayOLanes), invD);

	cv3l tsmaller = cv3l_min(t0s, t1s);
	cv3l tbigger  = cv3l_max(t0s, t1s);

	cfl rayMinLane = cfl_replicate(rayMin);
	cfl rayMaxLane = cfl_replicate(rayMax);
	cfl tmin = cfl_max(rayMinLane, cfl_max(tsmaller.x, cfl_max(tsmaller.y, tsmaller.z)));
	cfl tmax = cfl_min(rayMaxLane, cfl_min(tbigger.x, cfl_min(tbigger.y, tbigger.z)));
	cfl result = cfl_less(tmin, tmax);
	return cfl_mask(result);
}

That’s the gist of it! Writing vectorized BVH traversal using this API has been a breeze. You can find the rest of the library here: https://github.com/AlexSabourinDev/cranberries/blob/cranberray/cranberry_math.h

Here is the library in use for BVH AABB testing:

uint32_t activeLaneCount = min((uint32_t)(testQueueEnd - testQueueIter), cran_lane_count);
cv3l boundMins = cv3l_indexed_load(bvh->bounds, sizeof(caabb), offsetof(caabb, min), testQueueIter, activeLaneCount);
cv3l boundMaxs = cv3l_indexed_load(bvh->bounds, sizeof(caabb), offsetof(caabb, max), testQueueIter, activeLaneCount);
uint32_t intersections = caabb_does_ray_intersect_lanes(rayO, rayD, rayMin, rayMax, boundMins, boundMaxs);

We’ll be taking a deeper look at the SSE optimized BVH traversal in the future. For now, happy coding!

References:

[1] https://stackoverflow.com/questions/40203254/intel-intrinsics-guide-latency-and-throughput

[2] https://stackoverflow.com/questions/35859449/why-are-some-haswell-avx-latencies-advertised-by-intel-as-3x-slower-than-sandy-b

[3] https://users.ece.cmu.edu/~franzf/teaching/slides-18-645-simd.pdf

[4] https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf

Diary of a Path Tracer – The Beginning —

Diary of a Path Tracer – The Beginning

Within the past few months I’ve been implementing a path tracer during my weekends and evenings. Having reached what I consider to be an acceptable stopping point, I thought it might be interesting to detail my current implementation alongside the references that inspired me. Let’s jump in!

This series will touch on a variety of features that were implemented in the path tracer. Notably:

These posts will try to explain every aspect of the path tracer as they stand but they might not be as comprehensive as the sources referenced to create it. As a result, I will strive to include all relevant sources for each feature.

You can find the path tracer here: https://github.com/AlexSabourinDev/cranberries/tree/cranberray

Next we’ll talk a about the underpinning of the path tracer, the cranberry_math library, see you then!

Derivation – Importance Sampling The Cosine Lobe — June 7, 2020

Derivation – Importance Sampling The Cosine Lobe

Introduction

I’ve recently been diving into the world of importance sampling and I decided to share my derivation for importance sampling the cosine lobe.

Shade

When we’re shading a point in path tracing, we typically shoot a ray from our surface in a uniformity random direction contained on a hemisphere centered about our normal. This has the downside of introducing quite a bit of variance into our renders.

Imagine that we have a very bright light that only occupies a very small projected area from our shaded point.

IMG_2799

We would be very likely to miss this light with most of our samples and our render could turn out much darker than we would expect.

This is where importance sampling comes in.

If you imagine that your illumination is a polar function

IMG_2800

If we were to sample a random variable with a distribution that matches this function, we would be much more likely to hit the important points of our function. (Hence importance sampling)

I won’t dive deeply into this topic, as there are a variety of excellent resources detailing this topic. [1]

The essence of it however, is that you want to find a Probability Density Function (PDF) that matches the shape of your illumination function. Once you’ve define this PDF, you can sample it using the Cumulative Density Function (CDF).

Derivation

Since our cosine lobe illumination will look like this:

IMG_2801

We will use this as the basis to derive our distribution since we’re most likely to get the most light from directions arriving parallel to our normal.

Thankfully, our cosine lobe has an analytical formula that we can use as our PDF.

PDF(\omega) = C*cos(\theta) (1)

Our PDF must integrate to 1, we integrate the PDF across our hemisphere

\int_{\Omega}PDF(\omega)d\omega

\int_0^{2\pi}\int_0^{\frac{\pi}{2}}PDF(\omega)sin\theta d\theta d\phi

Plug in (1)

\int_0^{2\pi}\int_0^{\frac{\pi}{2}}C*cos\theta sin\theta d\theta d\phi

C*\int_0^{2\pi}\int_0^{\frac{\pi}{2}}cos\theta sin\theta d\theta d\phi

\int cos\theta sin\theta d\theta = -\frac{1}{4}cos2\theta

\int_0^{\frac{\pi}{2}}cos\theta sin\theta d\theta

-\frac{1}{4}cos\pi+ \frac{1}{4}cos0

\frac{1}{4}+\frac{1}{4}

\int_0^{\frac{\pi}{2}}cos\theta sin\theta d\theta=\frac{1}{2} (2)

Plug in (2)

C*\int_0^{2\pi}\frac{1}{2} d\phi

C*\frac{1}{2}*2\pi

C*\int_0^{2\pi}\int_0^{\frac{\pi}{2}}cos\theta sin\theta d\theta d\phi=C*\pi (3)

Since our PDF has to integrate to 1,

\int_0^{2\pi}\int_0^{\frac{\pi}{2}}PDF(\omega)sin\theta d\theta d\phi = 1

Plug in (3),

C*\pi=1

C=\frac{1}{pi} (4)

Finally, plug in (4) into our PDF,

PDF(\omega) = \frac{cos(\theta)}{\pi} (5)

Now that we have our PDF, we can calculate our PDF in terms of \theta and \phi.

PDF(\theta,\phi)d\theta d\phi = PDF(\omega)d\omega

PDF(\theta,\phi)d\theta d\phi = PDF(\omega)sin\theta d\theta d\phi

PDF(\theta,\phi)=PDF(\omega)sin\theta

PDF(\theta,\phi)=\frac{cos\theta sin\theta}{\pi} (6)

Now we integrate with respect to \phi to get PDF(\theta)

\int_0^{2\pi}\frac{cos\theta sin\theta}{\pi}d\phi = 2cos\theta sin\theta

PDF(\theta)=2cos\theta sin\theta

And then to get PDF(\phi),

\frac{PDF(\theta,\phi)}{PDF(\theta)}=PDF(\phi)

\frac{cos\theta sin\theta}{2cos\theta sin\theta \pi}=\frac{1}{2\pi}

PDF(\phi)=\frac{1}{2\phi}

Now we want to calculate the CDF of each function,

CDF(\theta)=\int_0^\theta PDF(\theta) d\theta

CDF(\theta)=\int_0^\theta 2cos(\theta)sin(\theta) d\theta

CDF(\theta)=\int_0^\theta sin(2\theta) d\theta

CDF(\theta)=\frac{1}{2}-\frac{cos(2\theta)}{2}

CDF(\phi)=\int_0^\phi PDF(\phi) d\phi

CDF(\phi)=\int_0^\phi\frac{1}{2\pi} d\phi

CDF(\phi)=\frac{\phi}{2\pi}

Now we want to invert our CDF to sample it using our random variable y,

y=CDF(\theta)

y=\frac{1}{2}-\frac{cos(2\theta)}{2}

\frac{1}{2}-y=\frac{cos(2\theta)}{2}

1-2y=cos(2\theta)

\frac{cos^{-1}(1-2y)}{2}=\theta (7)

For \phi,

y=CDF(\phi)

y=\frac{\phi}{2\pi}

y*2\pi = \phi (8)

Now we have our CDFs and our PDFs, we can finally calculate our direction.

In pseudo code you can simply do:

\theta=\frac{cos^{-1}(1-2rand01())}{2}

\phi=rand01()*2\pi

With these directions, you can now sample your scene:

\frac{SampleScene(SphericalTo3D(\theta, \phi))}{PDF(\omega)}

Plug in (5)

\frac{SampleScene(SphericalTo3D(\theta, \phi))\pi}{cos\theta}

Conclusion

That’s it! These formulas will sample the hemisphere where it is receiving more light defined by the cosine lobe. The results are pretty awesome.

Bibliography

[1] https://www.scratchapixel.com/lessons/3d-basic-rendering/global-illumination-path-tracing/global-illumination-path-tracing-practical-implementation