Richard Meredith allowed us to repost his article on basic multithreading in Unity.
Richard Meredith allowed us to repost his article on basic multithreading in Unity that allowed him to reduce the cost of the flow field algorithm in Bad North.
I recently took a look into basic multithreading in Unity, to reduce the cost of the flow field algorithm in Bad North. Having got the main thread time on mobile down from 2.75ms to 0.25ms, I thought I’d write it up as a bit of a case study/tutorial on very basic threading in Unity. I’m hoping this will be accessible to anyone with a bit of Unity/C# scripting experience and an idea of what threading is, even if you’ve never written any threaded code before.
- I did not parallelise the algorithm itself. All I’ve done is push it to a different thread than the rest of the game.
- This is my first time threading anything with c#/.NET (I’m more experienced in c++), so there may well be some things that could be done better.
The Flow Field Algorithm
I’m not going to talk about the algorithm in detail, but as an overview the flow field is data spread across the navmesh that indicates:
- How many agents are nearby.
- How close they are.
- What direction they are in.
Each frame, every agent drops some “liquid” onto the flow field, which stacks up on the nearby vertices. This liquid then flows around the navmesh vertices and also “evaporates”. The effect is visualised by the blue / orange debug graphics on the gif below.
The following actions occur on to the flow field data
- Systems sample data from the field.
- Agents add data to the field.
- The propagation (or flow update) occurs.
This occurs in the following way in the main thread
Note that all of the thread diagrams in this post are simplified, and the horizontal axis (time) is not to scale.
Various systems/scripts read and write to the flow field throughout the frame (green arrows). Then during the flow update, a lot of reads and writes are done to the flow field. The flow update is is the significant costly part and that’s what we want to run on a separate thread. There are some implications of doing this, but we’ll worry about that later.
First let’s make a thread!
Creating The Thread
I went through a few iterations of implementing the threaded system, and discarded some apparently obvious solutions as unworkable. My first attempt was to create a new thread each frame, let it do its work and then die. However, creating a thread is very slow (almost exactly the same time as the flow field update) and generates around 500B of garbage.
My second attempt was to use the BackgroundWorker system, which pools threads to avoid the setup time. However, while fast, using a BackgroundWorker still produced around 500B of garbage each frame.
My solution was to go a little bit lower level. I create a thread, that runs a simple, infinite loop, then synchronise that loop with the main Update() loop in Unity. Which gives me something looking a bit like this:
The important thing here are the two EventWaitHandle variables, which are used to sync the threads. If a thread requests to wait on a wait handle that is Reset (e.g. line 13), it will block until Set() is called on that EventWaitHandle by another thread. The SignalAndWait() function is the same as calling Set() and WaitOne() on the two parameters (effectively releasing the other thread, and blocking the current one).
On Awake() our ThreadedBehaviour class will create the child thread and start it, which will begin running the code in ChildThreadLoop() and immediately wait on the ChildThreadWait. It will remain in that blocked state until the Update() function is called.
In Update(), we unblock the child thread and block ourselves until the child has completed (line 34). Once the thread completes its work, it will unblock the main thread and block itself again (line 21). Which looks like:
So right now, we are doing the flow update on another thread, but we still wait for the work to be done. We haven’t saved any time on the main thread; we’re actually a little slower because there is a small overhead in the synchronisation actions. But we have pushed some work to another thread, and we know how to synchronise threads. So this is a couple of big steps on the road to getting them running in parallel.
Now the Flow Update is on a separate thread, but we’re not really “multithreading” or saving any time yet, because we block the main thread for it to work:
To get the threads running in parallel, we can just stop the main thread from blocking, by changing the Update() function to:
This is very simple, and gets our threads looking like the following:
At least, that’s what we hope it’s going to do, but you can’t actually be sure. This is probably the most important part of multithreading, so we’ll work through a thought experiment to identify and resolve the issue.
Problems with Parallelisation
We can’t know how long that flow update is going to take. In particular, you can’t know how long it will take in relation to the other thread(s). You might think that it’ll probably finish in time; you might be able to test and see that it seems to be finished in time. But multithreaded systems are non-deterministic and you can’t really prove them safe by testing. They need to be safe by design.
For this example, let’s see what happens if the flow update takes longer than expected:
In the second frame, we have both threads reading and writing from the flow field at the same time, which is totally undefined and problematic behaviour. We also have the main thread trying to restart the child before it’s even finished, which again, is not OK.
Re-Synchronising the Threads
So let’s start by ensuring the child thread can’t take longer than the main thread, which we achieve by blocking the main thread at the beginning of the Update() function:
Note that on the first Update, we don’t wait as the MainThreadWait variable starts in its “set” state. But on subsequent frames, if the child thread is still running, the main thread will get held up, like so:
Now our two loops are in sync, but we still have both threads interacting with the same data in parallel. There are a few ways to deal with this, but the approach that I chose involves restructuring data so that nothing is directly shared between threads.
In the original implementation, we just had one set of data that we worked on. That was the data upon which we did our three interactions, which (to recap) were:
- Systems sample data from the field.
- Agents add data to the field.
- The propagation (or flow update) occurs.
We will now split our data up into 3 sets of data that correspond to the actions:
- Results of the algorithm (used for sampling).
- Pending changes (used for adding).
- Working data (various data used for the propagation).
This is snapshot of the state of your data, i.e. the flow field. Both the main thread and the child keep their own versions. The child thread will update its version of the data, which is then copied into the main thread every frame.
Important: Use value types or perform deep copies for both the results and the pending changes. Copying references to objects in the world is not OK, unless you can guarantee their data will not change.
The main thread does not write to its copy of the data, as this would just be overwritten during the copy operation. Instead, it generates a list of pending changes, to be given to the child thread. This will require some changes, if you are used to modifying the data on the fly, but it should be easy to identify the minimal data you need for your changes.
For the flow field, the changes ends up being a list of “liquid applied” by each agent. What your changes are is very application-specific. It will be all the data you need from the world for your algorithm to update correctly, as the child thread will not read from world data.
This is going to be some combination of results, pending changes, and any other state required for producing the results.
At this point I won’t get into the specifics of what the data structures looks like, because that is application-specific, but the order of operations in the main thread Update() is now:
Which gives a more complicated thread diagram:
Remember that the time axis is not to scale and your copy operations should be extremely fast. I use the main thread to do the copying in and out of child thread, because it’s easy to know that won’t cause any conflicts and it allows the main thread to act as a master, controlling the key sync operations.
In general, we don’t expect the flow update to take longer than the main thread, we just artificially extended it as a thought experiment. But, if it does, we no longer have any conflicts between the threads. The child thread is only reading and writing from it’s own data, and the two threads are properly synchronised. We are now multithreaded and thread-safe!
Typically, if the flow update is quick, we should have each frame looking like the following:
Summary / Thoughts
This is the technique I used to process our flow field updates off the main thread. It’s not the only way to push things onto another thread and there are a few things to bear in mind.
Main Thread Cost
There is still a small amount of main thread cost, made up of:
- Posting pending changes.
- Copying changes in.
- Copying results out.
- Thread synchronisation.
The bulk of which is thread sync. I’m a little bit surprised by this, as my understanding of EventWaitHandles is that they are simple thread signalling, so maybe there is something faster.
It’s also important to note that this technique adds a pipelining delay of 1 frame. Changes that are queued up in frame x, get updated and the results are available in frame [x+1], but are not actually used until frame [x+2]. For the case of the flow field this is fine as it’s a slowly propagating thing that settles over several frames anyway. This would also be true of most cases.
One thing to watch out for is that your copy behaviours are not producing garbage. My pending changes were Lists and the obvious way to copy them across is to simply copy the list reference and create a new List for the main thread. I avoided this by copying the individual elements of the lists and treating them as reusable buffers.
Exiting the Thread
In the example I’ve posted here, I never actually kill off the thread. In the main game, I’ve implemented a slightly clumsy method to allow the thread to exit (there is a bool variable that drives the while loop), but I would like to find a clean way to exit the thread, within Unity’s frameworks.
I kick off the flow update at the very end of LateUpdate(). This is because it gives very predictable behaviour with regards to the other scripts in the project. However, it does mean that it’s likely to overlap with Rendering and Physics updates, both of which Unity is already Multithreading (on most platforms). It seems that in Unity 5.5 at least, both the main thread and child thread are running with priority set to Lowest, so it would seem that there would be no improvements to be made.
Despite the copy overhead and the frame delay, I like this method. It’s very simple and very clean. Your background tasks get a whole frame to update and the interactions between the threads is limited and contained. In addition, outside of that one copy/sync point, the main thread can interact with its copy of the data without consideration for the multithreadedness of the system.
There is an alternative approach that removes the one-frame delay and the need for copying, but has a number of other drawbacks. I think I’ll write up in a separate post, along with some answers to the questions on how to exiting the thread.