Intro

In this post, we’ll be looking at the steps taken to create an optimized transform hierarchy. This is by no means the be all end all of transform hierarchies. If anything, it’s more of a log of considerations that have taken place as the creation of the transform hierarchy took place.

How do we store our Transform?

The Ops

The first question we have is determining how we intend to represent our transform, do we want to represent our transform as [vector, quaternion, scale] or do we want to represent our transform as a 4×4 matrix?

It was presented to me that representing the transform as [vector, quaternion, scale] will facilitate the user’s usage of the transforms as it’s difficult to extract the rotation and scale from a 4×4 matrix. As a result, I will maintain the local transform as [vector, quaternion, scale].

But now we have to decide where do we want to do the matrix conversion? If we look to the pipeline as a series of transformations being applied to a transform we would have a selection from these functions:

  • C(T) – Conversion from transform to 4×4 matrix
  • Tt(T) – Transformation from one transform to another (applies scale, rotation and translation)
  • Tm(M) – 4×4 matrix multiplication operation for a matrix (applies scale, rotation and translation.

If we take these operations, there are to pipelines that make sense to me:

P1(T) = Tm(C(T))

This pipeline converts the transform immediately to a matrix and applies the transformation through a 4×4 matrix multiplication.

P2(T) = Tt(T)

This pipeline only applies the transformation without any conversion to a matrix. Where Tt can be see as the composition of:

  • Rt(T) – Applies rotation
  • St(T) – Applies scale
  • Pt(T) – Applies translation to the transform which applies rotation and scale to the translation. (P for position, as to not confuse with the T for transform)

Which results in the function:

Tt(T) = Pt(Rt(St(T)))

Now that we have our transformation pipeline, we can determine what the most effective pipeline is in terms of the number of operations.

For P1(T), Tm is a 4×4 matrix multiplication which amounts to 16 dot products. A dot product of a 4 element vector results in 4 multiplies and 3 adds or 7 operations. As a result, Tm would be equal to 7 * 16 = 112 ops.

As for P2(T), Rt is a rotation from one quaternion to another resulting in 28 operations. Assuming a non-uniform scale for St makes it 3 operations and Pt is an application of translation, rotation and scale, which respectively have 3 ops, 30 ops and 3 ops or 36 ops. (Numbers from wikipedia)

Calculating the result gives us:

P1(T) = 112 ops

P2(T) = 28 + 3 + 36 = 67 ops

This gives us a clear winner for P2. As a result, we’re going to stick to our [vector, quaternion, scale] representation for the transform from local to global space!

The Storage

Another perk of the transform representation is that our transform will only take 10 floats instead of 12 (assuming we wouldn’t store the bottom row of the 4×4 matrix transform representation) which we could crunch down to 9 floats if we only store the quaternion as a 3 float vector, reconstructing the fourth element when we need it. However, I will only make that change if we end up being severely backend bound.

The decisions

Once I decided how I was storing my transforms, I moved on to deciding how I was going to store the hierarchy in memory. There’s quite a few options here and we’ll go through each of the ones I considered.

The pointer tree

The first and simplest option, is to represent the transform hierarchy as a tree structure. Each parent would hold pointers to it’s children and as it updated, it would simply update it’s children recursively.

One perk of this approach is that access to the children is simple, you only have to follow the pointers. Another, is that you can easily add and remove a child with very little effort. For added usability, the children could also hold a pointer to their parent. The problem with this approach, is that the memory overhead of the pointers increases as a transform has more children. Another downside is that this is typically implemented using the standard new/delete (malloc/free) and the memory has no guarantees of being tightly packed. Another issue is that traversing the tree deeper requires us to wait on the memory of our child transform to be loaded before we can queue up the next load (A good article about this can be found here). There are many things you can do to mitigate these issues but as you do them, you move closer and closer to the next approaches.

The flat hierarchy

Another approach, is to flatten the tree into an array. You do this by simply deciding on one traversal of the tree and inserting the nodes of the tree into the array as you go. (Wikipedia has a good reference on tree traversals) This allows your memory to be relatively fresh in the cache (see wikipedia for a good link on locality). This is excellent in terms of performance when transforming from local space to global space as your parent’s transform is nearby. It also allows you to reference your transforms by index instead of by pointer, giving you some solid memory savings if you can use a smaller data type such as a uint32_t or a uint16_t. One downside of this approach is that adding a transform after the tree is flattened requires shifting all of the memory in order to accommodate the new transform in it’s appropriate location. As a result, references to the transforms that were relocated need to be updated. This can easily be done by adding an extra level of indirection. We would have an array of references to real indices, and as the indices are changed, we update them in that array. This approach is good, it allows us to iterate through our transforms efficiently however it also adds a fairly large cost to adding a new transform to the hierarchy since a lot of memory might need to be shifted around to accommodate the traversal as well as accessing the transforms through their references due to the cost of having to wait for the memory from the index map to be loaded into memory before being able to access the transform. (Which could be mitigated by caching the real index and using that if the map isn’t dirty).

The hybrid

The approach I decided on using was a simple one. Instead of enforcing a hierarchy, I simply add the transforms to the array and require that the children be placed after the parents. All children hold on to the index of their parents and when transforming from local to global, I simple read the transform from the parent. This approach has quite a few limitations: you can’t re-parent a transform to a transform that is further in the array without moving the transform (which subsequently might require the children to be moved and so on and would also require changing the handle to all of those bringing us back to the index map of the previous approach) this approach also doesn’t inherently support removal of transforms but that could be added with another array that keeps track of the free indices. The perks of this approach are that accessing an element in the array is fast as you don’t need an index map, you can simply index into the array and transforming the array is fast as long as transforms are near each other in memory.

Decision

I believe that in most cases, adding, removing and re-parenting is not something that is done commonly. As a result, the flat hierarchy might be the most effective method in most circumstances. It allows you to make sure that your transform is nearby in memory. However, I will be sticking to the hybrid method for the sake of this post, as it is simple and allows me to work quickly and effectively. It also has the benefit of still allowing me to place transforms near each other at creation and another benefit of it’s simplicity is that it would be fairly simple to change it from this hybrid method to the flat hierarchy method.

It’s also far less code and I’m lazy…

But overall, I believe that it comes down to use case. If you need fast addition, removal and reparenting, the first option might be good for you. If you want fast transformations from locals to globals the flat hierarchy is probably best at the cost of slower addition, removal and reparenting. As for the hybrid, if your game is simple and doesn’t require much, it’s easy to debug, understand and can perform as well as the flat hierarchy (if not better, i.e. accessing transforms) for some use cases and if the user is careful with it’s use.

Aside

One thing to mention, is that I don’t believe that these approaches are mutually exclusive. I don’t see why you would need to enforce a single hierarchy, you could use a different hierarchy for different use cases. Say animations, where you won’t be removing any transforms, then you can store your memory in any way you like.. Or if you have a use case where transforms are added and removed frequently, then the first option might be good for that piece of code.

Maybe if you don’t enforce any hierarchy and you use a mix as needed, you can get the benefit of all with relatively low performance overhead. (Of course, the maintenance overhead grows)

Transformation Order

Another important note about how I decided to handle transformations is that all transforms are fixed for a step of the hierarchy. What I mean by this, is that as you write to the hierarchy, you will only be able to read those changes once you call cranh_transform_locals_to_globals. My reasoning for this choice is simple, instead of having unclear positions determined by whether the parent’s logic was run first or the child’s logic was run first, I decided to define a fixed point in time where all transforms are transitioned from one step to the other. This allows the user to have a clear picture of the transforms in one step of their system instead of having the transforms vary based on the order at which the other systems’ logic is run.

This means that logic will either use the previous frame’s position or the current frame’s position with no ambiguity between both.

Here’s an example where ambiguity could have arisen:

Say a ship has a turret that always points at a target by using it’s position and it’s target position. If the turret updates first, it will aim to it’s target. If the ship then moves, the turret is now failing to point to the target correctly. In order to fix this, we could force parent’s to always be updated before their children, but this introduces additional complexity, makes it hard to reason about the state of the system and introduces a dependency if we wanted to turn some of these logics into jobs. However, if we have a fixed transform update state, we could update all the physics, then update the transforms and finally update the turret’s logic leaving it to point at the right location.

Dirty Scheme

At this point, I needed a scheme for only updating the dirty transforms. My first thought on how to approach this was by using an interval set; a simple data structure that would merge overlapping intervals as they are inserted. However the complexity of merging the intervals every time a transform was updated seemed too large. One approach could have been to only do the merging at the start of the cranh_transform_locals_to_globals but instead I decided on something different.

Dirty Stream

The concept is simple, I use 2 bits as indicators. If the 2 bits are equal to 0x02 that means that all the following transforms are dirty. If the 2 bits are equal to 0x01 that means the following transforms are not dirty. This simple scheme allows me to mark a transform and it’s children as dirty in constant time by simply setting the bits at the current transform’s index to 0x02 and the bits at the last child’s index to 0x01. Then, to update the transformations, we simply loop through the stream, if we read a 0x02, then we increment our dirty stack and if we read a 0x01 then we decrement our dirty stack. If our dirty stack number is over 0, that means we’ve read more dirty flags than clean flags and we want to update the current transforms. If it’s 0, we don’t update anything and move along.

An example of this could be:

0b10 00 00 00 01 00 00

If we read each pair we see that the first one is 0x02, that means we increment our dirty stack.

Dirty stack = 1

Then we move forward, and since our dirty stack is 1, we update all the transforms from index 0 to index 5.

At index 5, we see that the value is 0x01, that means we want to decrement our dirty stack.

Dirty stack = 0

Reading indices 6 and 7, our dirty stack is 0 thus we don’t update anything.

A special case is the value 0b11, this one simple means that only this transform will be updated.

SIMD

At this point, our transform performance is significantly core bound. We spend a significant amount of time multiplying and adding elements of vectors and Quaternions. There were 2 approaches I could think of to handle this problem. One of them, was to take 4 vectors and split them into arrays for X, Y and Z. This approach would work well if I had my memory in this format, but I had decided to keep my transforms together instead of splitting them into component arrays.

The perks of the split buffer approach were that I could have 100% register occupancy, allowing for maximum throughput. However, instead I decided to keep my format in a vec3, vec3, Quaternion format in order to be able to quickly retrieve a transform as a whole.

The downside of keeping my memory in this format, is that I now can only occupy 75% of my registers with the vec3s, not as good as 100% and it also doesn’t scale to larger registers such as AVX or AVX512.

Note: I attempted to use a hybrid of the full component array approach and the single transform approach with a batch transform approach, however, the cost of building a single transform from these batches was too steep. I expect many systems to request transforms from the hierarchy, and making this process slower will cause these systems to perform poorly in order to make a single system slightly faster. As a result, I’d rather pay the cost of a slightly slower transformation step for the performance of other systems interacting with the hierarchy. (See the BatchTransform branch for these changes)

Threading

The last piece for this post is threading. We got our single threaded performance to be quite good. But we can only get so far with a single core. As a result, I figured that introducing a mechanism to multithread the transform hierarchy would be quite effective. Due to the way we structured things, it’s actually pretty easy to change our hierarchy to support threading.

Since our hierarchy is self contained in one large buffer, we can simply split the hierarchy into multiple “groups”. Groups are then going to be fed into jobs as needed.

A hierarchy from root to child should never be split across groups. That would introduce a dependency between threads and we definitely don’t want that. Groups are also 64 byte aligned in order to avoid false sharing.

That’s it really…

Implementing things this way allows the external systems to decide how they synchronize the groups. For this example, I simply split the groups into different threads.

With this, our performance is now down to around 8ms for 1 million transforms. Pretty good, we’re now completely bound by the memory subsystem.

You can find the threading change on the threading branch of the transform hierarchy github.

Conclusion

That’s it!

Here’s what I’ve learned when working on this hierarchy:

  • Keeping your hierarchy in a flat array is a pretty effective way to go. As stated above, I decided to go with the hybrid approach of just adding transforms to the end of the array. This has been effective for adding, but makes removing and re-parenting very difficult.
  • It’s difficult to properly SIMD a transform hierarchy if it has to be accessed by external system. Gathering the data from the various batch transforms takes time and the cost is moved from the computation to the retrieval and writing of the transforms. SIMD still adds some gain to the system, but only 128 bit registers. Larger registers become harder to fill as we go.
  • An explicit transformation step makes it easy to multi-thread the hierarchy. You can simply work in these “groups” and as long as you remain within those buffers you have very little synchronization to worry about.
  • There’s plenty of methods to handle dirty schemes. My method works decently, but interval sets are another approach that could be taken that might have more benefit for the computation step while adding a cost to the external writing step.

Overall, there are many tradeoffs to be made. And I don’t think there’s one obvious answer. I found myself having to choose between convenience and performance quite a few times, as well as choosing where some performance losses will go in order to improve the performance in other areas.

The results are pretty good. We can spread our transform step to all threads and currently my example code runs at an average 7ms for 1 million transforms being constantly changed.

Future Work?

This was an exercise in determining an effective approach to transform hierarchies. I’m pretty happy with my approach, but there’s no way to know how it would hold up in production. As a result, I probably won’t be coming back to it again. But if I do, here’s some additional things that could be improved:

  • Can we improve the SIMD usage without increasing the read and write costs of single transforms?
  • Should we remove non-uniform scaling? I did this in the uniform_scaling branch for a pretty decent performance boost. (2ms less per transform tick and some improvements in the example code’s physics step as well)
  • Removing and re-parenting a transform isn’t currently supported. I didn’t need it for this example, but it might be useful later on. There’s a few potential approaches to this but I’m not sure if the hassle is worth adding it to this hybrid method. Maybe the stricter flat tree method would benefit more.
  • Resizing a hierarchy might be necessary. If the hierarchy needs to be dynamic, it would be possible to add the possibility to resize the hierarchy into a larger one.

Link to repo here.

Also take a peak at molecular-matters’ post that sparked this blog post here.