Just updated it!
Why didn't you share the URL to sign up in the article
OMG! this is awesome!
Danny Weinbaum allowed us to repost his article full of tips and tricks on optimizing foliage inside the Unity engine.
In my last blog post Art Tips for Building Forests, I outlined some things I keep in mind when building the look of 3d forests. I stuck only to art tips, that is, tips for what assets you should have and things to think about when placing them. One of the most important things about a lush forest is density, and something that comes part and parcel with density is performance optimization.
Since there are many ways to tackle forest optimization, there aren’t many universal guidelines for how to author your assets. However, one thing that is universal (almost anyway) with regards to forest optimization is that your number one enemy will be draw calls (or in Unity known as batches).
While poly count is also important, the problem is not as complex. One just needs to know the reasonable targets. Here are some polycounts for a few of my assets for reference on what might be reasonable for a given type of vegetation. If you find yourself needing too many alpha cards to reach your desired canopy fullness, you may look into increasing leaf coverage in the texture.
Finally, there is overdraw. This is when you have lots of overlap between alpha cards and objects. I don’t even think about this or look at this, because frankly I don’t know what I can do about it. I’m not going to modify the silhouette of my trees to reduce overlapping. In fact, I’m not going to modify the silhouette of my trees for any reason other than to make them look as good as possible artistically. I’m also not going to modify the layout of my forest to prevent overlapping vegetation. It’s hard enough to make a forests believable without worrying about overdraw. I just concentrate on keeping down poly counts and draw calls.
If you came to this article hoping to find hidden Unity settings you can tune to make things more “optimized” I am sorry but I don’t have any of those. To tackle this problem you do need a plan. Most plans will require your forest to be built from the ground up with your optimization strategy in mind. If your forest is a smorgasbord of assets from disparate sources, you may have a hard time with this. As I said, there are many different strategies to reduce draw calls. In this article, I will attempt to show you mine. My goal, in a sentence, was to combine the LOD1 meshes (second LOD stage) into mega meshes so draw calls consolidate as the player moves further.
Step one in my plan was getting every gosh darn thing in the forest (or as close as possible) to use a single material. That’s right. All my forest assets share a single material. To that end I needed all my veggies to sample a single texture.
The reason I needed to do this was so I could safely assume any cluster of foliage assets could consolidate to a single draw call when combined. Had each tree or grass clump a unique material, the resulting combined mesh might have many submeshes, and therefore many draw calls. Authoring all assets to a single texture wasn’t difficult since I planned it from the start. I designated a 2048×2048 and allotted areas on the sheet for assets I knew I’d need, and then when I authored them I simply kept adding to the texture. It’s possible to automate this process. It would require modifying the UVs of all your foliage assets through script though, and sometimes you can organize things more efficiently if you do it by hand.
The eventual goal is combining, but first we need to determine how we’re going to combine things. It is unwise to make one giant super mesh, for two reasons: The first is that unity uses a 16 bit index buffer for its meshes meaning each mesh can only have 64k verts max (however it sounds like 2017.3 will use 32 bit index buffers, therefore this will be a moot point soon). Secondly, and most importantly, you won’t have a means of LODing individual groups or of taking advantage of frustum or occlusion culling, since your entire forest will be one mesh. The draw calls will be nice and low, but the triangle count might will be absurd. We can afford a few extra draw calls to save a boatload of triangles.
Enter the hex grid. Basically I wrote a script that automatically groups all my veggies into a hex grid. I chose hexes over squares since hexes are more circular, making a simple per-group position check for player distance more accurate. A square will have corners that may be sticking out closer to the player.
You will see later that having everything grouped into a grid is helpful for a few other optimization tricks beyond just combining.
With me so far? Here is one more layer to the madness. The hex groups are then grouped into super hexes. When every regular hex in a given super hex is at its LOD2 state (the furthest LOD which for me is basically two planes, sometimes referred to as an ‘imposter’ mesh), the super hex switches to a combined version of the LOD2 hexes, further consolidating draw calls. If you’re wondering why I didn’t simply make my first hex grid have larger hexes, the reason is that having smaller hexes allows me to capitalize on more granular transitioning to the far LODs, and more accurate culling (as mentioned in the opening paragraph of this section). When things are far enough and low poly enough they can be combined to larger hexes (this transition is unnoticable since by this point all the small hexes are already at their furthest LOD. We’re just switching to a big combined version of them).
Initially when I began to build my system, I combined the LOD0s as well as the LOD1s. I have found this to be too memory heavy. It makes every vertex on the highest poly version of all your foliage assets have a unique memory footprint, since every combined mesh is a unique mesh. Additionally, the larger your meshes are the less granular frustum culling is, thereby drawing more triangles than necessary. You tend to be standing close to or right on top of the LOD0 hexes, so the issue of inaccurate frustum culling is exacerbated. I found combining only the LOD1s to be the perfect balance. The LOD1 is low poly enough to use very little memory, but has the most massive savings since most of my loaded world will be in a LOD1 state.
Unfortunately combining meshes in Unity is not straightforward. I had to write my own script, since the scripts I tried on the asset store don’t handle submeshes or vertex colors. As far as how you structure the data for your LODs, that is up to you. I opted to not use the built-in LOD component, since it would merely be a data container anyway (the hex groups handle the LOD switching). I simply use a monobehaviour with references to deactivated child GameObjects, so I can easily get the materials and submeshes from their renderers.
LODing and Culling using the Hex Groups
Here’s an important thing to know: the built-in Unity LOD component is expensive. Especially if you have one active on every clump of grass. That is a lot of distance checks happening on the CPU. One of the wonderful benefits of having your world divided into tidy hex groups, is that you can distance check on each group, instead of each child object. Then the entire group LODs as one unit. I even have an animated transition that uses the alpha cutoff property in the shader (a simple way to get a nice blend!)
Another use for the groups is a sort of dynamic occlusion culling. I do not use Unity’s built-in occlusion culling, because last time I checked, it does not mix gracefully with multi-scene. The groups are few enough to where a few raycasts during runtime can be done to see if a hex is entirely occluded from view. I do not do these raycasts every frame, I just give enough extra leeway in the hex bounds to make sure things appear in time when going over hills and around corners. You could never do this for every object. It would be way too expensive. But a few raycasts for a group of 20-50 objects will be worth it, especially when it is only happening every few frames. I only check for occlusion against terrain, since nothing else in a forest is substantial enough to reliably occlude.
Authoring for Smooth LOD Transitions
A smooth LOD0 to LOD1 transition is usually better than a low polycount on the LOD0. If your transition is nice you can move the LOD distance closer, thereby reducing total polys on-screen. One thing I do particularly for my trees, is author with the LOD1 in mind. I build my trees from branch instances in my 3d package. This allows me to author a LOD model for a few single branches, and then update all the instances with this lower poly version to create my complete LOD1.
Terrain (by terrain I mean the ground itself) is a little outside the scope of this article, but I feel it’s important to note that I don’t use the built-in Unity terrain system. Therefore all my veggies are regular GameObjects. I opted to use regular meshes over the terrain system for three reasons: The first is that the default Unity terrain system is very poor on performance for the blobby mesh it gives you, and creates hundreds of extra draw calls and thousands of poorly distributed polygons I can’t afford to waste. Using a regular mesh allows for controlled distribution of polygon resolution where you need it. Secondly, authoring shaders for the default terrain system is very restrictive, and there are a lot of idiosyncrasies about it which are poorly documented. Lastly, I have plenty of holes and overhangs. The shader I use for my ground is fairly straightforward. It’s a three channel vertex splat with a macro overlay and normals.
It’s important that your vegetation shader general performance (known as pixel fillrate) is reasonably optimized. If you are using a deferred rendering path, getting your vegetation shader to be fully deferred can offer huge savings. Mine used to be forward rendered, and when I finally figured out how to get the same shader deferred, I shaved 30% off my render time. Creating a fully deferred vegetation shader with the required translucency was not at all straightforward, as you need access to the light attenuation, which can’t normally be accessed in a deferred shader program. I realize this next part is getting far into the weeds of Unity specifics, but to any curious, I use a surface shader with a custom lighting model that writes a very low fidelity translucency mask into the unused 2 bits in the G-buffer (the alpha of RT2). Then I added the translucency function to Internal-DeferredShading.shader. It took me years before I finally figured out how to do this. Here is the thread where a kind soul eventually helped me figure it out. It was so brutally difficult for me to figure out (took two years) that I’d gladly help anyone with this if they reach out.
Baking lighting for all the plants in a forest results in massive lightmap memory usage, production unfriendly bake times, and I found it to not look that great anyway, since alpha cards don’t generally make nice bakes. I use light probes for anything smaller than a building. I use Light Probe Proxy Volumes for the trees, so there is a nice gradient to lighter values in the canopies. Since the trees are not static, and aren’t seen by the light mapper, I needed a way to darken the probes of shrouded areas manually. I wrote a simple script that tints all the probes within a given volume by a color of my choosing.
- LOD1 half way up a tree – Some trees are tall enough that you can get away with the upper canopy being lower poly. This is where my packing all LOD stages into the same atlas and material is handy. It enables me to do this without adding extra materials to the LOD0.
- Dead trees, or trunks without canopies – I tend to reach my desired canopy density long before I reach my desired trunk density, so adding trunks without canopies is a thrifty way to make a forest look thicker.
- Mega patch assets – In flatter areas of your map, you can make large patches of grass as a single object, thereby reducing draw calls even when things are in the LOD0/uncombined state. Every one of my grass and undergrowth assets has a large patch version.
A Note About GPU Instancing
Unity introduced GPU instancing in 5.4. To use it you must draw the mesh from script. Its different than combining meshes, and in some respects it’s better since you can draw many meshes in a single draw call, but don’t pay the memory overhead associated with uniquely combined meshes. There are, however, a few disadvantages. Since you need to draw the meshes from script, if you want to do any culling of any kind, you need to maintain a list of which meshes should be visible. Beyond the simple fact that you have to write all this yourself, keeping this list maintained in C# can be expensive. Furthermore, passing this massive list (or multiple lists) to the GPU is expensive too.
I have tested GPU instancing fairly extensively. I even replaced my entire grouping system with a GPU instancing system. I found that it was not as performant as my combining system, and had more limitations (such as not being able to use light probes, or Light Probe Proxy Volumes, which are essential for my forest lighting).
There is a new method introduced more recently, called DrawMeshInstanceIndirect, wherein you can use a compute buffer to make the maintaining of your instance lists more performant. It is possible this is an even better solution than my combining system, however there isn’t much documentation on it, and I am not a good enough programmer to figure out how to do it. I tried. I failed.
The TLDR for performance optimization in lush forests:
- Draw calls will likely be your biggest problem. You need a plan to keep them reduced.
- Poly count is an easier problem, just make sure the triangle count of each asset is reasonable.
- I ignore overdraw considerations because there’s nothing I can do without ruining the look.
- All my foliage textures are atlased to one material, to ensure combined meshes become a single draw call.
- I use a grouping system that combines the LOD1s of all the meshes in a group.
- I don’t combine LOD0s because it uses too much memory as the meshes are higher poly.
- The groups can’t be too big, because then you can’t take advantage of frustum or occlusion culling.
- I use my own LODing script since I do LOD switching by group rather than by object.
- I use my own mesh combine script, to gracefully handle vertex color and submeshes.
- I author my foliage assets by hand, often with a plan for the LOD1 in mind.
- You can animate alpha cutoff to make a smoother transition.
- I use regular meshes rather than Unity’s built-in terrain system.
- I wrote a fully deferred vegetation shader to keep pixel fillrate down.
- Baking lighting for foliage was not feasible for me, I use light probes to light my forests, with a custom tinter volume for shrouded areas.
And that’s about it! My current combining system (I call it the hex grid) is my 3rd iteration of combining system, so these are the things that ended up working for me after a bit of trial and error. I use it for many of the objects in my world, not just foliage. It works well whenever there are a lot of objects of one kind (like barrels or rocks, for instance). At best multiple instances of an object will consolidate to a single draw call, and at worst there will be the same amount of draw calls as there were before, but with higher memory overhead. Please don’t hesitate to reach out if you have questions. Please follow me on twitter @eastshade!