What is the graphics tech stack of modern video games?
It's been more than seven years since the release of War Robots. In 2014, the mobile gaming market was smaller than now, and even fewer devices could support 3D gaming graphics without any problems. Although there was a shader revolution, and vertex and fragment shaders replaced the fixed-function pipeline, there were still not enough possibilities for high-quality desktop-level graphics on mobile devices' screens.
The basis of work with any graphics system is a set of standardized methods that mask calls to the video driver to send commands to the graphics processor. Or, in other words, this is a contract that describes how to use it. It is called a graphical application programming interface (API). During the original War Robots' development, the only mobile graphics API was OpenGL ES 2.0. Its next version, OpenGL ES 3.0, was just announced, and there weren't even any devices that supported it. Thus, the graphics of Walking War Robots were based on the technologies and capabilities provided by OpenGL ES 2.0.
Original War Robots had a typical graphics stack supporting one directional light source and computing its shading based on a simple material system. At the same time, the game also had rather bold decisions. For example, War Robots already used the technology of drawing terrain with a splat map. Together with smooth mixing, it allowed drawing extended complex surfaces without repeating the same small patterns. However, not all popular methods of that time had found a place in the project. For example, there was no light baking (lightmaps and light probes) nor a multiple-material system.
War Robots' graphics improved over time, but quickly its development was constrained by OpenGL ES 2.0. For example, working in a linear color space was necessary to use the full potential of physically-based rendering (PBR) techniques, which was an unsolvable problem for GLES 2.0. The moment came when we needed to develop a comprehensive revolutionary approach for a bold graphics breakthrough.
Prototype for Steam, or the First Iteration of the Remastered Version
Despite popular belief, one does not simply improve game graphics by altering the settings to the maximum. And an even greater challenge is improving the graphics of the game that had been in operation for many years, constantly changing and developing. At the very least, you should consider the players' expectations and previous best solutions. You should consider the existing gameplay and atmosphere and many details which are not that obvious. A long history of active operation means an impressive amount of code here (the so-called legacy). Over the years of War Robots' existence, the game has accumulated many highly non-obvious solutions. These include elegant, unique, and necessary ones and nontrivial workarounds that solved various technical problems that came with each new version of Unity.
Because of that, we couldn't define precisely the quantity and complexity of the work. The final result was never clear, therefore it made the problem worse. Consequently, it was reckless to immediately throw any significant resources into full-fledged preproduction to develop vision, write documentation, prototype, etc. Given the tight schedule of features and releases, it was simply unreal because you can't leave players without new game versions for something so obscure.
Fortunately, we had an opportunity to test our ideas and go over many details with minimal involvement of the core development team. We decided to experiment on the Steam version of War Robots. It was well suited for this purpose due to the similar code base and art and a relatively small number of players. These players' fairly more powerful hardware made it possible not to worry about optimization, but to focus on the graphics.
The idea of the experiment included the following:
- We decide how deeply and by what means we improve the picture.
- We rework the graphics within one game map (renderer, models, textures, shaders, lighting, shadows, processing, effects, animation, etc.) so the picture comes to life.
- We define the scope of work on the complete graphics improvement.
- As a result of the experiment, we decide whether such an upgrade is necessary at all.
As you can see, at most stages, the task is entirely solvable by the efforts of one technical artist, and we could reuse some of the results anyway.
We chose the Canyon map because:
- There is not much geometry on it.
- It is not the first to get to the new player.
- The potential for improvement is significant.
First of all, before the work began, we converted the project's color space to linear. We could obtain almost realistic graphics in the sRGB color space, but there would be so many difficulties and inconveniences at every stage that it would make sense only if you need to support devices with OpenGL ES 2.0 that are entirely outdated today.
To get the first quick results, we replaced part of the environment geometry with ready-made assets and modified the textures of one robot to technically correct PBR. We also made a few lighting fixes and a minor revision of a couple of shaders, and the scene started to look a little bit modern:
There is also a lack of dynamic range on this screenshot. The lighting is too flat, as if on a cloudy day or in an old game. In reality, the luminance of bright areas of such a scene is thousands and tens of thousands times more than the luminance of shadow areas and may reach hundreds of thousands of lux. When using render buffers of formats like R8G8B8A8_UNorm (8-bit unsigned integer per channel), the inability to work with such a range (and values above 1.0) leads to the almost mandatory PBR requirement to draw in a high dynamic range. For this, we use floating-point buffers of different precision formats from R32G32B32A32_FLOAT to R11G11B10_FLOAT.
Accuracy of the data representation of different bit formats:
The table shows that, for example, the R32G32B32A32_FLOAT format has an accuracy of 7.22 non-zero characters and a substantially maximum value, but it requires much more memory storage, four times more than R8G8B8A8_UNorm. In practice, the maximum value and accuracy of R16G16B16A16_FLOAT are enough to store frame values in a wide dynamic range. We used it in the version for Steam. Moreover, in most cases and with proper handling, the accuracy of the R11G11B10_FLOAT format is enough. This format takes up half as much memory as R16G16B16A16_FLOAT and fits perfectly into the same bit budget as R8G8B8A8_UNorm. The absence of the fourth channel and the likelihood of a slight color shift due to the lower accuracy of the blue channel does not matter, so in the mobile version, we use it whenever possible.
After getting a frame and post-processing the resulting HDR image, we must return the range to the values that can be displayed on the screen. For this, we apply the tone mapping procedure. In each version for Steam and the mobile version, we use the ACES tone mapping algorithm. It allows us to get a cinematic picture with slightly burnt colors in bright areas.
The landscape is what got the most changes on the scene. We should not affect the scene's gameplay, so the landscape's geometry in the playing area should be precisely the same. For this, we measured a heightmap of the playing area with an accuracy of ± 4 cm, and then on its basis, we procedurally generated a new landscape changing only the background. Of course, we redrew the terrain textures, generated the masks, added the tessellation, poured the stream, and then dried it. Dynamic clouds appeared in the sky, dust and haze appeared in the air, and the bridge in the center of the map was redesigned again (but not for the last). It was enough to understand how to proceed.
Robots also received improvements. More than a dozen of popular robots were remade. First of all, the robots, which didn’t need significant geometry changes, got a brand new individual skeleton on a generic rig instead of a standard Humanoid skeleton, and also a unique animation set. Textures that were originally drawn in Painter got modified to full PBR. In other cases, we redrew textures from scratch based on the existing ones or used the original masks (if any).
The lack of special jumping and landing animation clips also caused problems. To get rid of the Styrofoam clown effect, the robots got a set of animation clips for preparing for a jump, preparing for landing, landing itself, and also full logic for their choices and playback. They also got effects for turning on the engines and their subsequent smooth cooling, and even changing the direction of the nozzles by the jump direction. Not that we needed all that for the experimental version, but once you get locked into serious attention to details, the tendency is to push it as far as you can.
For fun, we redesigned the inverse kinematics system for correct leg positioning on inclines. Previously we used it only on four-legged robots but now utilized it on all remastered two-legged robots.
Also, almost two dozen weapons were remastered. What we had done:
- Replaced textures with PBR.
- Changed some animations.
- Modified the FX shaders.
- Added light sources for shots and explosions.
In addition, while remastering the Canyon scene, it became clear that we wouldn’t be able to leave the rest of the scenes, robots, and weapons as they were. At least, it was necessary to adjust the lighting after the transition to a linear color space. Besides, the difference between the redesigned Canyon map and the others was too striking, so we decided to modify all other maps at least minimally.
For each scene, we generated a new set of textures based on the old one. It had a higher resolution and included all necessary maps for the correct PBR shader operation. Some of the new textures were hand-tweaked to represent the physical properties of materials better, so the final result was quite accurate given its semi-automatic nature. Also, we adjusted the light sources, post-processing, and other parameters of the scene and the environment related to lighting on each map. We modified models on some maps, placed additional light sources, and replaced the sky with a dynamic one. At the final stage, with the help of the artists, we remade the landscape in some scenes, and, finally, we made the real-time global illumination on all maps and baked the lights. Of course, the result was not striking, but the difference with Canyon at least was not so huge.
Before and after comparison:
For the remaining robots and other assets, we created a shader that helped convert the properties from the old texture set (diffusion, specular and normal map) to those necessary for a full-fledged PBR calculation, using some heuristics.
The experimental version for Steam was released at the end of 2018. Based on it, we made the following conclusions:
- Do not underestimate the negative user reaction to the increased hardware requirements.
- It is necessary to carefully approach the usual robot animations because people can identify neurological diseases by gait. Especially Griffin’s.
- We don’t need a motion blur in our genre.
- Physically correct lighting is good, but we should approach lighting scenes with the contrast effect on gameplay in mind.
- Bright flashes from explosions and gunfire interfere with gun firing and exploding
- We need a remaster for mobile platforms.
Transfering the Remaster to Mobile Devices
After the Steam remastered version was ready, we got an even more ambitious task and had to transfer the desktop-level graphics from Steam to mobile devices.
Dynamic ocean surface rendering, water mirrors with reflections and refractions, and extended terrain layers blending require not only the capabilities of the graphics API but also great computing power. At the same time, the remaster must work as well as the original War Robots on three- or even five-year-old devices. As a result, we had a dilemma: how to implement desktop-level graphics on mobile devices and at the same time not reduce graphics performance in general?
A compromise was in dividing the devices according to the level of performance and preparing several quality presets. Each preset has its own content set (number and resolution of textures, a polygon of objects, filling of game levels) and a stack of used graphic solutions (lighting model, detailing of objects rendering, post-processing complexity). Thus, the entire range of mobile devices was divided into three categories: HD, LD, and ULD.
Modern top-end mobile devices can easily support the high-load graphics stack. They formed the HD quality preset group. The most popular segment, which includes mid-range devices, relates to an LD preset. A ULD preset was explicitly made for devices that could barely handle even the original game. Thus, we formed three groups of devices, each with its reference device (or graphics processor). We also prepared the content and configured the graphics stack for every group.
Target devices for different quality presets:
But the variation in performance within every group was quite broad and required more precise adjustment for comfortable gaming. Therefore, we began to use the scene rendering resolution scaling to mitigate the frame rate changes. Changing the resolution directly affects the amount of GPU load and can vary in real-time depending on game events and the history of the last few frames. Multisample antialiasing (MSAA) effectively masks the moments of resolution changes, reducing frame aliasing and blurring.
Thus, the remaster received a two-stage performance scaling scheme:
- At the first stage, there is a rough adjustment, and we select content and a graphic stack of varying complexity (one of three quality presets).
- At the second stage, we scale the rendering resolution for fine-tuning, providing smooth and comfortable gameplay.
OpenGL ES vs. Metal & Vulkan. What is Better?
The graphics pipeline required support for such functions as multi-pass rendering, texture arrays fetching, and writing to float-point color buffers to work with updated content. The original War Robots API, OpenGL ES 2.0, did not support any of this. Hence, we needed to switch to a more modern graphics API, thus increasing usability and improving performance.
On iOS, the obvious choice was the Metal API, which was supported on 99.2% of devices and fully met our requirements. In addition, replacing OpenGL ES with the Metal API is recommended by Apple. It has much more flexible functionality and high performance, as well as convenient debugging and profiling tools.
There is a choice between two graphics APIs for Android: OpenGL ES 3.0+ and Vulkan. The first one significantly improved functionality since its previous version 2.0, has perfectly debugged drivers, and works on 97.1% of user devices. Still, it is quite outdated and has cloudy prospects. An alternative to OpenGL ES is Vulkan, a low-level graphics API ideologically close to Metal.
Experiments with Vulkan on mobile devices had confirmed the smoother performance of the remaster graphics pipeline compared to OpenGL ES. At the same time, the availability of the Vulkan API and its stable functioning is confirmed only on 63% of devices, which makes it too risky to use. For example, Unity’s Scriptable Render Pipeline recently got the RenderPass/SubPass model, enabling it to reach the full potential of the Vulkan API on mobile platforms. Unfortunately, at the time of the remaster release, there was not such a model, so the performance results of the Vulkan API compared to OpenGL ES 3.0 were mixed and differed from device to device.
As a result, we chose OpenGL ES 3.0 as a graphical API on Android, and Vulkan support is still experimental.
The Stages of Rendering: Shadows, Lighting, Post-Processing
The two-step performance scaling scheme and new modern graphics APIs laid the foundation for developing a flexible and efficient graphics pipeline. As a result, a unified multi-pass rendering pipeline is used for all quality presets (ULD, LD, HD), which differ in the set of input data and options for their processing. So, texture resolutions and geometry detail, shader complexity, and render pass configuration are different for different presets.
Differences between texture sets in presets:
Technologically, each frame rendering consists of three main stages: a shadow pass, a lighting pass, and a post-processing pass.
Differences in rendering passes in presets:
The shadow pass uses two shadow maps. The first one is a cascade that updates every frame, including objects in the camera's field of view at close distances. The global shadow masks are used with less precision for the umbra and penumbra boundaries for objects far from the camera. When calculating lighting, samples from both shadow maps are blended for a subtle and smooth transition, and then we use them for shading. Thus, we process only a small subset of objects near the camera in each frame and hence significantly reduce the geometry processing load. At the same time, using a pre-calculated shadow map allows us to display shadows from objects over the entire area of the scene.
While rendering geometry to a shadow map, the order of the triangle indices is reversed for more efficient clipping. It is called front-face culling. As a result, the load on the rasterizer is significantly reduced because about 90% of the triangles are discarded at an early stage. In this case, the shadow map is relatively sparse, and this fact further improves performance when fetching and comparing values from the map. You can see the comparison of the shadow maps with the result of the classic and inverse rendering below.
Shadow maps with back (left) and front (right) clipping triangles:
The lighting pass calculates the illumination intensities of objects in a scene from a single primary light source. Its contribution traditionally includes two components:
- Direct lighting, calculated in real-time.
- Indirect lighting, selected from either pre-calculated lightmaps for static objects or nearby light probes for dynamic objects.
We use rendering in a high dynamic range for both types of lighting with a fragment-by-fragment calculation of the bidirectional reflectance distribution function.
When calculating lighting for all objects in the scene, we use the same physically based model. It allows us to use texture arrays typical to all objects in the scene to store the material parameters. Material parameters include albedo, smoothness, metalness, normal, and ambient occlusion. As a result, we use only five main shaders to render all objects in the scene: mechs/drones/equip shader, terrain shader, decal shader, unique props shader, detail props shader.
Some shaders use partial derivatives of the heightmap instead of the usual tangent space normal map, and a terrain shader uses a heightmap instead of a metallicity map
The post-processing pass compresses the wide dynamic range into a range suitable to display. It also includes tone mapping, color grading, vignette, and converting the color space from linear to sRGB if necessary. Then we draw the elements of the user interface (UI) on the top of the resulting image. And after that, finally, output to the screen takes place.
Not Only an Improved Picture but also Performance. How to Reduce Frame Render Time and Offload the CPU
Modern graphics APIs allow improving both image quality and performance. They significantly reduce render time and the CPU load and provide a smooth and stable frame rate.
Frequent resource swaps between draw calls usually accompany the wide variety of materials in a scene. Since the GPU cannot render during the resource binding process, latencies occur, which means incomplete CPU utilization. To avoid it, draw calls with the same materials are grouped for batching. However, if every material in the scene is unique, the traditional grouping approach will be ineffective and require resource binding on virtually every call. Therefore, the number of unique materials in a scene is fixed, and it causes restrictions on the level design decisions and content preparation.
The remaster graphics pipeline imposes no restrictions on the number of unique materials in the scene since it does not require resource binding. So, all textures from unique materials are collected in texture arrays, and additional parameters are combined into a single uniform buffer.
Scheme for combining materials (constants and textures):
The binding of such "assembled" resources occurs once along with the corresponding shader and does not change throughout all draw calls of this shader. The required material is selected from the collection directly in the shader by referring to the corresponding layer of the texture array by index and through the calculated offset to the corresponding element in the constant buffer. Thus, the number of resource bindings is strictly equal to the number of unique shaders used in the scene. In the remaster, in each scene we use an average of 12 shader variants, obtained from the five main ones through keywords. As a result, for 300-400 draw calls in a frame, there are on average 20 resource bindings, so the frame preparation time does not depend on the number of unique materials.
The technical implementation of this solution requires support from the graphics API for two features: texture array fetching and dynamic uniform buffers. The remaster's graphics pipeline fully supports SRP batching on Metal and Vulkan and partially on OpenGL ES 3.0. Remember that texture arrays assume the same format, resolution, and number of mip-levels in the layers. Dynamic constant buffers require strict adherence to device-specific alignment spacing when setting the padding. For most Mali GPUs, this is 16 bytes, and for Adreno GPUs, it is 64 bytes.
Mobile platforms have a small amount of video memory and a limited power supply capacity compared to others. To address these two issues, virtually all modern mobile GPUs are based on a tile architecture. Its key features are the following:
- To minimize pixel overdraw, the geometry is pre-distributed over small areas in screen space (tiles).
- Each tile is processed independently, localizing rendering to a small screen area.
- While processing a tile, its geometry is first sorted and then rendered, which allows for the same effect as preliminary rendering to the depth buffer (z-prepass) on stationary platforms.
We should note that using the features of the tile architecture does not require any actions or settings from the graphics API and is implicitly performed by the video driver.
Another important advantage of tile architectures is tile memory. It is located directly on the GPU and allows to avoid the energy-intensive export/import of intermediate data into video memory. Typical scenarios for using tile memory are:
- Multi-sample anti-aliasing resolving without an additional pass.
- Creating "virtual" buffers with memoryless attachments.
- Using transient attachments between render passes.
Modern APIs (Metal/Vulkan) provide explicit methods for working with tile memory, while OpenGL ES 3.0 allows the video driver to set certain hints. In the remaster, we use memoryless MSAA without a separate resolution pass for all supported graphics APIs.
Debugging and Diagnostics: How to Track that Everything is OK
Any high load project requires constant performance monitoring. Ideally, it would be best to plan graphics profiling as soon as the graphics stack is formed. So, while working on the remaster, we updated the automatic benchmarks that measure the performance while scenes, mechs/equipment, and FX rendering. Thus, we could quickly identify and respond to regressions and make a closer study of the drawdowns causes, if necessary.
We use tools such as Qualcomm Snapdragon Profiler, ARM Graphics Analyzer, and Xcode Frame Debugger. With their help, we diagnose problems specific to individual GPUs that may tangibly affect performance:
- GPU register spilling, typical for complex calculations.
- Unnecessary import/export of data from tile memory to global memory when working with full-screen anti-aliasing and frame buffer attachments.
- Scene geometry consisting of elongated triangles and is unfriendly for tile architectures.
- Excessive use of transparent and translucent materials, which significantly increases pixel overdraw.
- Optimal balance between arithmetic operations and read/write operations, leading to incomplete loading of the GPU.
- The efficiency of using L1/L2 data cache for prefetching from textures and reducing data access delays.
These are the basic performance metrics of various GPU units that are the most useful for monitoring efficient load balancing. The practice of regular profiling allowed us to identify bottlenecks and prevent performance drops from the very beginning of work on the remaster.
ARM Graphics Analyzer and Qualcomm Snapdragon Profiler:
In addition to monitoring the efficient operation of the specific mobile systems hardware, it is equally important to pay attention to the graphics API calls when working with Unity. We use RenderDoc to test GLES 3.0 and Vulkan and Xcode Frame Debugger for Metal. With their help, you can check:
- The structure of the rendering frame (installing shaders and recourse binding).
- Intermediate states of the graphics pipeline (setting the depth test, alpha blending parameters, and switching frame buffers).
- Resources consistency (format, resolution, compression).
Such inspection of graphical API calls turned out to be essential for debugging various kinds of crashes and freezes on a wide range of mobile devices.
Xcode GPU Frame Debug and RenderDoc:
Summing Up: Our Experience of Working on a New Graphics Pipeline
Now, when War Robots Remastered was already released, we can say that the work on updating the game's graphics was fascinating. Experiments with the capabilities of mobile GPUs have allowed us to find solutions that scale well to a wide range of devices:
- Texture arrays and combined constant buffers made it possible to significantly reduce the number of draw calls and remove the limitation on the maximum number of unique materials in the scene.
- Real color buffers opened up new possibilities for computing physically correct lighting in a wide dynamic range and applying post-processing effects;
- Quality presets and dynamic resolution brought to a new level the visual part of the project on modern devices while ensuring good performance on 4-5 years old devices.
Working on this project gave us an invaluable and unique experience in preparing content, reworking the graphics pipeline, developing test methods, debugging, and improving performance.