Here I attempt to provide a high-level overview of how Neural Networks learn over time. There are many variations on the techniques I discuss, and the theory of some newer architectures far surpasses the methods we are going to cover. However, the goal here is to provide some familiarity for the process by which data moves through a neural network and the process of how a network adjusts its own parameters to better fit the data.
If you’re familiar with machine learning algorithms and implementing them with tools like Scikit-Learn XGBoost, you might remember the
model.fit(X,y) step, or the “training” step. At a high level, this is where the algorithm is fed data, and coefficients are developed in order to produce a model capable of making predictions on new data.
In neural networks, this training phase occurs slowly over the course of a number of what’s called epochs (pronounced epics). You can think of each epoch as a round of training. Each epoch, training data is passed through the network, starting with the input layer. As data moves through the system, weights and activation functions are applied to the values, augmenting the values at each node. Once data reaches the final layer, the values have been altered in such a way that they can now be interpreted as a prediction. These predictions are then used to evaluate a loss function (more on this later), and our model weights are adjusted in such a way to reduce that loss function in the next epoch. It’s okay if this makes absolutely no sense right now, we’re going to dive deeper into each step and attempt to clarify using a real-world example.
Components of a neural network
First, let’s familiarize ourselves with the parts that make up a neural network. In the diagram below, we see 3 columns of circles, with arrows connecting each circle in a column to every circle in the next column. These columns represent layers and each circle represents a node within the layer. You can think of a node right now as something that holds a value. Layers are what give our neural net its architecture, and they outline the path by which our data moves through the network. This path can be visualized by the arrows in the diagram. Every single connection (arrow) has what’s called a weight (a small numeric value) attached to it, which is applied to values as they pass through the system. This value is set arbitrarily upon initialization. Once all the weights are applied, a final value to be passed is computed for that node. This final value depends on a few things, the most important being the activation function. For now, we aren’t going to talk about how this activation function is applied, we just need to know that some function or combination of functions is aggregating values at each node, and then passing along a final value to the next layer.
We can already see how the application of these different activation functions and weights could provide us with very different looking output than we had input. But how do these randomly assigned weights generate accurate predictions? Remember training occurs over a series of epochs, and at each epoch, our weights are updated. How does our model decide on the new weights?
The learning process
This weight update process is the whole mechanism for how neural networks “learn”. It has to do with what’s called an optimization function. One of the most common of these functions (and the one we will focus on) is called Stochastic Gradient Descent (SGD), and it’s the basis for a lot of currently favored optimizers (such as Adam). At a high level, this function helps our model quantify the difference between our predictions, and our expected value, and then make an educated adjustment to our weights.
Here’s what’s happening at a deeper level. We know that our output layer holds our predictions after each epoch. After each training round, a loss function is evaluated representing the difference between our expected value, and our predicted value. Our model then computes the gradient of the loss function with respect to every single weight in the model (this would look something like d(loss)/d(weight)). This gradient for each weight is then multiplied by another value that we set upon initialization called the learning rate. This value is usually very very small (between 0.001–0.00001), resulting in a value close to the original weight, but closer to the optimum value. Throughout the entire training cycle (multiple epochs) these weight values start to converge upon the ideal values, and SGD steadily works to minimize our loss function.
Let’s try to solidify our understanding by attaching a novel example.
Let’s say you want to build a neural network to classify images. In this case, you would have a bunch of images of dogs, and a bunch of images of cats, each with labels assigned to them. You would feed these images, along with their labels to your model, and then ideally your model would return a prediction. Now, your model isn’t going to directly tell you
"hey, this is a picture of a dog", instead, it’s going to output a list of probabilities between 0 and 1. These probabilities tell you for each category (in our case cat or dog), how sure the model is that the image belongs to that category.
Let’s say we receive an output
[0.75, 0.25] for an image of a cat in our first round of training. We’re going to assume we know which output node is which, and that our model is telling us that it thinks there is a 75% chance this picture is a cat. Our algorithm is then going to check for the correct label (in this case it would be a 1 in the first node), and evaluate the loss. After all the data has gone passed through, our model takes this loss and uses it to calculate the new weights. After training, our model would ideally output something closer to
[0.94, 0.06], representing that it is more sure that the image is a cat. If we look at the predictions for a single image over multiple epochs, what we would hope to be seeing is our output getting closer and closer to
In conclusion, we just learned what the basic components are that make up a neural network, and how they interact with data. We learned that when we say neural networks “learn”, what we mean is that they are designed in such a way that they optimize themselves over time. And we learned that the effectiveness of this optimization depends on the model architecture, optimization method chosen, and learning rate. Hopefully, some of this article helped to de-mystify neural networks for you. You can treat them as a black-box algorithm and hope for the best, but if you dig a little deeper, you can achieve some pretty phenomenal results.