Victor's Devblog

Amazing Performance Gain?

March 22^nd, 2026 at 18:40

Today's devblog is on time or, dare I say, even early? It was actually ready on Thursday as early week has been prolific and there was so much to share, but I didn't want to extend this post even more with what I have been doing since (expect a great one next week, very visual work incoming).

Short:
In first part, I wrote a brief recap of last week's work and an analysis of 2 main remaining issues regarding physics performance.
In second and third part, I explain how each issue was tackled.

From Last Week's Work

Recap Of Current State

Here is a video explaining, with visuals, what I did last week and how my physics engine worked until Monday. I thought this would help follow what's going on more easily?

Analyzing Sources Of Improvement

This week started with a physics execution time of 6ms average with spikes at 12ms.
Therefor I decided to look at the following 2 issues:

Before:
- At map spawn:
    - Organize each static collider's triangles into its own BVH tree (AABB)

- Every tick:
    - Broadphase:
        - Find static colliders maybe collided with
        - Use BVH trees of those colliders to find potentially colliding triangles
        - Organize those triangles into BSP tree (separator planes) => Issue 1
            - Use main triangle spread direction for triangles partitioning => Issue 2
    - Narrow phase, 10x4 sub-steps per tick:
        - For each car's ellipsoid, use BSP tree to find potentially colliding triangles
        - Test those triangles and find deepest penetrating one

Issue 1

Computing BSP tree every physics update is a trade off between doing it fast and finding a good separating plane that will truly make narrow phase faster. Right now, it improves overall performances of physics system but it still costs a lot (about 2.5ms).

Issue 2

If you watched above video (or read last week's devblog update) you should have realized that the axis (and therefor the separating planes) I use to separate the triangles in 2 groups is not very good. Ideally, both planes would be facing away from each others or at least very close, to maximize space dichotomy (i.e. create two regions of space that barely intersect each other such that a shape rarely has to test intersection with triangles in both regions).

A better Separating Axis

I started on Monday by tackling issue #2 first.

Why Chosen Axis Is Wrong

The main axis along which triangles are split in two groups is generated from triangle centroids only. This means that it doesn't account for the triangular nature of the data: their size and proportions. While this algorithm is (almost?) perfect for classifying points in space in two distinct regions, applied to triangles -in particular non equilateral triangles of various sizes- it may lead to put into a group a triangle whose vertices extend in the other group's region of space.

Contextualizing The Problem

Here, we are not talking about random triangles in space: this is a game and triangles usually respect a certain logic. In particular, most of the time, the triangles are organized along strips (a line of contiguous triangles connected together through their edges). If you think about a flat straight road composed of many quads, each quad split in two triangles, you can imagine that a lateral cut of the road should easily split it in two group of triangles that barely overlap in space.

So how do we find this plane to cut the road along?

A Few Contenders

The approach I went for is to take the most central triangle(s). I compute mean centroid and select whichever triangle is the closest to it. The chosen triangle has a normal and 3 edges that we can work with.

Off course we could attempt to split space along planes parallel to triangle, but if road is flat (a lot of the time that's the case) it would be a poor choice. So I decided to consider the 3 directions perpendicular to edges and triangle's normal.

And for each direction, just like what I explained last week, I split triangles in two groups. This is quite costly since each split requires sorting the triangles (but I have hope to save this cost later, don't worry). Finally, I pick whichever axis gave me the best split of space into distinct regions, here it would be the following (which happens to be perfect since both planes are the same) :

Also note, this is a coincidence that triangles are grouped just as before, only with a different pair of planes.

Visualizing The Outcome

This video shows you the result of this new triangle partitioning algorithm. Just in case it's not clear: the more "red" the planes are the more they are at the "root" of the binary partitioning tree, and the more "blue" they are the more they are near leaves. Also, the plane-sausage following the car is, in my opinion, this week's eye-satisfying artifact.

As you can see, it is much better at finding a good splitting plane axis. You can see this because both first and second plane of each split (one is smaller and darker color) are often right on top of one another, meaning it somewhat "recognizes" how the road was split into thin lateral triangles without falling into the pit of attempting to separate them left right. Only when the car moves near the road's edges (or when enough sideways splits have been performed in earlier nodes of the tree) you can start seeing a longitudinal split which is actually great for distinguishing triangles that could collide with wheels against those that should collide with the car's side bumper.

With this change, the tree (still constructed during broad phase) is significantly better at narrowing down what triangles could be colliding during one of the narrow phase sub-steps. Performance seem to stay within 1ms and 2ms in most usual cases. However, as can be seen around 34s into the video, it reaches 5ms and spikes to 8ms in areas with a higher triangle count. While this works for simpler maps like this one, it will require more optimizations for the type of maps I want to support.

Baking Triangle's Binary Search Tree

What Is Costing Time

With previous improvement, we can see how increasing the number of triangles collected by broad phase will still quickly affect performance negatively. This is because of those 3 sorts I perform before choosing the best separating axis. And there could be even more axes to test if I could allow spending time testing them. For instance, I could allow non-even split of the two spaces (there may be 10 triangles forming the road and 20 forming its border, so maybe it is beneficial to allow a 10-20 split at first to then have more relevant separating planes).

Baking Binary Search Tree

The nice thing about the approach I had so far is that the binary search tree is computed at runtime exactly from triangles that are collected by broad phase (from the BVH tree, that binary search tree using bounding boxes instead of separating planes). The BSP tree partitions no more triangles than strictly necessary, even if one of the overlapped entity has million others far from the car. However, this last statement is rarely ever an issue and the exponential searching speed inside a binary tree means using a baked binary search tree with more triangles should just win over computing the binary tree every frame.

So this is what I've been doing: replacing the BVH tree by a BSP-like tree computed at map load. Then broad phase, using a single bounding box per static collider not represented in below schematic, can solely return the list of static colliders to consider this update.

And narrow phase's sub-steps can use the already-constructed BSP tree for each of those collected static collider for an efficient search of deepest colliding triangle.

Even Faster?

I tested, it was great, but I didn't stop there and added a little bit more: instead of returning just the list of static colliders, what if broad phase would also start pruning the tree off branches that really have no chance colliding during any sub-step of this frame's update? Narrow phase can then start exactly where broad phase stopped, from non-pruned nodes, for an even faster search.

In above example this means that narrow phase would only consider BSP branches starting at nodes 2 and 4 of collider 1, and node 1 of collider N. The algorithm to do so is fairly simple; broad phase tests tree's separating planes against car's extended bounding box, and:
- if bounding box is completely below the plane, no collision possible this update with this side of the tree => branch is ignored
- if bounding box is intersecting the the plane, result will be uncertain at any sub-step => branch is returned
- if bounding box is completely above the plane, testing it every sub-step is pointless => process its children recursively

Addendum On Memory Usage

I forgot to talk about a concern I had implementing this and how satisfying it was looking for a solution: memory usage.

The tree nature of the problem, especially the part where it is pruned by broad phase, sounds like a lot of memory indirection (e.g.: std::vector of std::vector). It also means that emptying the outer vector will release memory held by inner vectors, forcing a lot of allocations and deallocations. While this could be solved with an Arena Allocator (custom very fast memory allocator for temporary/frame allocations), I thought it was more interesting finding a way to output broad phase result differently (it also makes debugging easier since I can keep the data "alive" more easily and still without new allocations).

Solution is nothing crazy but I like it, just 2 flat vectors (that can be cleared every frame without deallocation) like so:

// Broad phase writes:
std::vector<BroadPhaseResult> broadPhaseResults;
std::vector<int32_t> broadPhaseResultStartingNodes;

// With:
struct BroadPhaseResult
{
    entt::entity staticCollider;
    int32_t startingNodeBeginIndex;
    int32_t startingNodeCount;
};

(all static collider's non-pruned nodes/branches are stored in the same vector, just need to remember which node indices are associated to each static collider)

Visualizing The Outcome

Here is the outcome; you may realize that it is less satisfying than previous one (Sausage, where? Non mais allô quoi?!) although planes are a little better place around the blocker's geometry (I allow myself to test more axes during map load). At times I show various depth of the branches returned by broad phase, and sometimes I also display all triangles that could potentially be tested during this frame's update (all triangles that are below the branches returned by broad phase). Your eagle eyes may even spot some triangles showing even when the option is disabled; can you guess why?

One more thing before we talk about performance: do you see how, sometimes, as the car moves forward, suddenly less separating planes are shown ahead of it? (this is part of the less satisfying reason btw). You should wonder: how is that the case and how would it be that going forward may mean less triangles are considered ahead of car?

The reason is quite complex but interesting; it relates to how separating planes are stored in the binary search tree and why I always use 2 parallel planes rather than 2 distinct planes tailored to their triangle groups. Instead of storing the plane in the node to be visited, they are stored in the parent. This is helpful for 2 reasons:
- root node doesn't need a plane, it is not a split from anything
- planes being parallel, storing them is just a point and a normal for first plane, and a single floating point offset for second plane

So when I explained how branch pruning was performed I kinda lied. Both branches must be tested by broad phase, and if one test is uncertain (narrow phase could find the car below or above it depending on what happens really this update) then the parent node must be returned instead. So if ahead of a car is a branch that we may or may not be above the plane of, its parent node is returned (meaning the entire sub-tree ahead is returned by broad phase, even if other child branch could have been pruned). But if the car advances a little bit forward, suddenly it may be that both halves have certain results (e.g. one can be pruned and the other is always "true" so only child branches need to be tested).

What about performance?

This (final?) update's performance was actually spoiled in this post's first recap video (did you spot it?). With this final change it is now far below 1ms in most normal condition, and even hugging the most complex walls in the test map I struggled to make it reach above 1ms with very rare spikes above 2ms. This is far above the expectations I had when I started worrying about performance two weeks ago, and makes me think that maybe a split-screen version of the game will be possible in the future?

For now, I think I am done with physics performance for a while. I hope I didn't lose too many of you with so much math. Next week's update will be focused on visual features again!