This article is the third and final part of a three-part series on machine learning for game devs. If you’re new to the series, start here with Part 1.
We have also added a video covering all three parts of this series to the SEED YouTube channel! Watch it here.
By Alan Wolfe – SEED Future Graphics Team
The C++ recipe used in this article is an adaptation of a Python recipe shared in Michael A. Nielsen’s “Neural Networks and Deep Learning.”
You will need the following to execute the C++ recipe:
We recommend cloning the full repo to your computer for easy access to everything.
Let’s start by examining the topology of our network.
The images in our data set are 28x28, or 74 pixels per image. That means our input layer will have 784 neurons.
We will also have one (1) hidden layer with 30 neurons and 10 output neurons — one (1) output neuron for each value 0 through 9. Whichever output neuron has the highest value will be the number detected.
Multiplying 784 * 30 gives us 23,520 connections between the input and hidden layers. Multiplying 30 * 10 gives us 300 connections between the hidden and output layers. There are also 30 bias values from the hidden layer and 10 from the output layer. Adding those together gives us a total of 23,860 trainable parameters.
Now, instead of using the sign neuron activation function we talked about in previous parts of this series, we’re going to use the sigmoid function. The sigmoid function takes arbitrary values and scales them into the 0 to 1 range, just like a tone mapper, and for the same reasons.
The sigmoid activation function, and derivative
When neural networks are successfully trained on a data set, there is a risk that they’ve simply memorized the data instead of actually learning something that can be generalized. The MNIST data uses a common solution for detecting this problem; the data set contains 60,000 images for training and 10,000 images for testing — if it does well on the training data, you can introduce the testing data (which it has not been trained on) to see if it can generalize what it’s learned.
Example MNIST image data.
For this training we will be using:
The network cost function will be the sum of the output neuron costs. The output neuron costs will be 1/2 (desiredOutput — actualOutput)2. The derivative of that cost function ends up being (actualOutput — desiredOutput).
After training the network using the MNIST training data, the network achieved a 95.27% accuracy on the 10,000 testing data images. That means it gets about 1 in 20 wrong. More sophisticated neural networks can do better than that, but this accuracy rate is fine for an introductory project like ours.
We ran all three gradient calculation methods on the CPU to get the following timing comparisons:
Finite Differences:
Dual Numbers:
Backpropagation:
And as you can see, backpropagation is the clear winner! And if we were to do this same test on the GPU instead of the CPU, backpropagation would win by the same substantial margin.
This demonstration uses source code from our GitHub repo:
https://github.com/electronicarts/cpp-ml-intro/tree/main/Demo
Below is a render graph diagram showing how the demo works.
The blue nodes are resources and the orange nodes are actions. Here’s what they do:
The neural network weights are 93KB of floats, and come into the shader as an SRV buffer NN Weights in the diagram above. Hidden Layer Activations is a buffer that holds 30 floats — one for each hidden layer neuron. Output Layer Activations is a buffer that holds 10 floats — one for each output layer neuron.
Below is the whole compute shader which evaluates the hidden layer. It’s very fast, but could probably be sped up by using group shared memory or actual matrix multiplies.
Evaluating the hidden layer.
When running the demo on an RTX 3090, the entire render graph executes in about half a millisecond, with the neural network part coming in at only 283 microseconds. The important takeaway here is that while training can take a long time, using the resulting network (inference) can be very fast. The training can be a tool time operation, much like baking lighting. And if you’ve ever baked lighting, you’ll know that while it can take a frustrating amount of time, it’s not a runtime cost. You just don’t want to “rebake” frequently.
Now let’s take a look at the demo below.
To the right of the green drawing area, you’ll see a small number 2. This is the input layer of the network arranged into a 28x28 image. To the right of that drawing area are 30 boxes stacked in a vertical line. These are the hidden layer neurons where black is 0, white is 1, and gray is somewhere in between. All of these went through a sigmoid activation function, so they’re conveniently tone-mapped to the display!
To the right of our hidden layer neuron boxes are our output layer neurons boxes. As you can see, the largest value (brightest white) is the value that is lit up in yellow.
This technique is fast, but it is still only a single invocation of a neural network. If we wanted to run a neural network per pixel, this network would be out of budget, even if we ran it at a smaller resolution. Doing per-pixel neural network operations thus presents us with a challenge: making a network that is small enough to fit in memory and perf budgets, but still large enough to be useful.
Neural networks can find solutions to problems where no good algorithm is yet known, such as machine vision or language tasks. However, not all problems are best solved by a chain of matrix multiplies (or convolutions for CNNs), interleaved with nonlinear functions.
At SIGGRAPH 2023, there was a paper that was competitive with all other NeRF papers, but didn’t use neural networks. Gradient descent was used to adjust 3D gaussian splats in world space to match photographs of a 3D scene, reconstructing the original scene to render novel viewpoints. Besides training quickly, the splats can be rasterized pretty efficiently, making it render in real time. They do take up quite a bit of memory though. You can find more details — plus some traditional game dev compression tricks — at https://aras-p.info/blog/2023/09/13/Making-Gaussian-Splats-smaller/
There are several works of research that try to break open the black box of successful neural networks and try to understand what was learned. One paper talks about how to convert a neural network to a decision tree.
While training our network on MNIST only took 90 seconds, standard training can take days or weeks, even with a cluster of high-end computers! The good news is that because game development teams often already have both a light baking process and a cluster of machines with nice GPUs, the same setups can be used for ML training. The bad news is they would then be competing for the same resources!
Have you heard the term “deep learning” come up in discussions around ML?
From the 1980s until the 2000s, we were in an “AI Winter” where machine learning faced a plateau of advancement it couldn’t overcome. One of the problems was that if you had too many layers in a network, the gradients would go to zero (vanishing gradients) or infinity (exploding gradients). “Deep Learning” was the term used to describe the goal of training a deep neural network with many layers. Researchers finally achieved this by changing how training worked, using different activation functions and different network architectures to allow derivatives to make it through deeper networks.
If you’ve made it this far, congratulations!
You should now have a solid understanding of the basics of machine learning and differentiable programming, and should be able to start experimenting with them on your own.
You may also be wondering if using plain C++ to do machine learning is a good idea or not. As far as training goes, it probably would make most sense to go with something like pytorch or similar and use the “industry standard” tools. Going this route would give you access to the latest and greatest algorithms and allow you to more easily find people to help you when you get stuck or run into problems. It’s also easier to hire for.
Runtime is a different story though. There are inference libraries and middleware to do inference in C++, but these tend not to be appropriate for use by game dev out of the box. In game development, we need to use our own allocators, we usually need exceptions turned off, we need to manage our own threads, we would prefer not to use the STL, we have sub-millisecond budgets, and so on. These requirements are not common outside of game development, and game developers don’t make up the largest group of coders out there, so the best practice at the moment seems to be to roll our own runtime inference.
NVIDIA did recently release differentiable slang, however. Slang is an HLSL+ shader language which can be compiled down to plain HLSL and other shader languages. The differentiable portion also survives this transformation. Differentiable slang opens the door to doing learning — not just inference — at runtime on the GPU, in DX12, Vulkan, and even consoles.
Want to continue exploring gradient descent and machine learning topics? Here are a few breadcrumbs to follow as you continue on your journey: