Machine Learning drives much of the technology we interact with nowadays, with applications in everything from search results on Google to ETA prediction on the road to tumor diagnosis. But despite its clear importance to our every-day life, most of us are left wondering how this stuff works. We might have heard the term “artificial neural network,” but what does that really mean? Is it a robot that thinks in the same way a human does? Is it a supercomputer owned by Apple? Or is it just a fancy math equation?
Machine Learning actually covers everything from simple decision trees (similar to the ones you made in your Intro to Business Management course) to Neural Networks, the complex algorithm that mimics the function of a brain. This article will dive into neural networks since they are what’s behind most of the very impressive machine learning these days.
First, an Illustrative Example
To understand what machine learning is, consider the task of trying to predict the height of a tree based on the soil content in the ground. Now, since this is machine learning we are talking about, let’s assume we can get some really good data on this task: thousands of soil samples from all over the world.
There are a lot of measurements you can make on soil contents. Things like moisture levels, iron levels, grain size, acidity, etc. They all have some effect on the health of a tree, and how tall it grows. So let’s say that we examine thousands of trees in the world (all of the same kind, of course) and collect both data about their soil contents as well as the trees’ heights. We have just created a perfect dataset for machine learning, with both features (the soil contents) as well as labels (the heights). Our goal is to predict the labels using the features.
That definitely seems like a daunting task. Even if there is a relationship between soil contents and tree height, it certainly seems impossible to be able to make accurate predictions, right? Well, machine learning isn’t always perfectly analogous to how our brains work, even if neural networks are modeled from brains. The important thing to remember is that these models aren’t making wild guesses as we humans might. Instead, they are coming up with exact equations that determine their predictions. Let’s start with simplifying the problem a bit first.
It’s quite easy to imagine that a single feature like moisture will have a significant effect on tree height. Too dry, and the tree won’t grow, but too moist and the roots may rot. We could make an equation based on this single measurement, but it wouldn’t be very accurate because there are many many more factors that go into the growth of a tree.
See how the hypothetical relationship above is not a great estimate? The line follows the general trends of the dots, but if that’s what you use to make your predictions on height you’ll be wrong most of the time. Consider the case where there is a perfect amount of moisture, but the soil is way too acidic. The tree won’t grow very well, but our model only considers moisture, so it will assume that it will. If we consider both measurements, however, we might get a more accurate prediction. That is, we would only say that the tree will be very tall when both the moisture and acidity are at good levels, but if one or both of them are at bad levels we may predict that the tree will be short.
So what if we consider more factors? We could look at the effect of moisture and acidity at the same time by combining the relationships into one equation.
Excellent. Now we have a more complex equation that describes the tree’s height, and it considers two features (measurements). Now we can combine even more features to make an even more complex equation. For the sake of clarity, I will call the final, combined equation our “model”. It models how the features affect height. Combining simple equations like this into a multi-dimensional model is pretty straight forward, and we can create a very complex model pretty fast. But for every tweak you can make on one of the simple equations (choosing a slightly different equation for the relationship between height and moisture), there are now thousands if not millions of more ‘models’ that we have to try, all slightly different from one another. One of these models might be great at modeling the relationship between soil content and height, but most are probably really bad at it.
This is where machine learning comes in. It will create a model composed of many simpler equations, and then test how well it works. Based on its error (that is, how wrong the predictions are) it then tweaks the simpler equations only slightly, and tests how well that one works. When it tweaks the simpler equations, it is simply altering one of the graphs in the image above to look slightly different. It may shift the graph to the right or up and down, or it could slightly elongate peaks or increase the size of the valleys. Through a process similar to evolution, it will arrive at the best — or at least a good — solution. In fact, that’s why it’s called “machine learning”. The machine learns the pattern on its own, without humans having to tell it even simple information like “moisture is good for trees”.
If you’re curious about how the machine learning model picks the next combination of equations, you should read further about model training. Specifically, the concepts to master are stochastic gradient descent and backpropagation.
Sidenote: If you ever studied the Fourier series at university, it is useful to think of them as an analogy for a neural network. In school, we learn that you can create complex waves like a square wave using a combination of simple sine waves. Well, we can also create a machine learning model from many simple equations in a similar fashion.
What are the Components of a Neural Network?
Neural networks are specifically designed based on the inner workings of biological brains. These models imitate the functions of interconnected neurons by passing input features through several layers of what are referred to as perceptrons (think ‘neurons’), each transforming the input using a set of functions. This section will explain the components of a perceptron, the smallest component of a neural network.
A perceptron (above) is typically made up of three main math operations: scalar multiplication, a summation, and then a transformation using a distinct equation called an activation function. Since a perceptron represents a single neuron in the brain, we can put together many perceptrons to represent a brain. That would be called a neural network, but more on that later.
The inputs are simply the measures of our features. For a single soil sample, this would be an array of values for each measurement. For example, we may have an input of:
representing 58% moisture, 1.3mm grain size, and 11 micrograms iron per kg soil weight. These inputs are what will be modified by the perceptron.
Weights represent scalar multiplications. Their job is to assess the importance of each input, as well as directionality. For example, does more iron contribute a lot or a little to height? Does it make the tree taller or shorter? Getting these weights right is a very difficult task, and there are many different values to try.
Let’s say we tried values for all three weights at 0.1 increments on the range of -10 to 10. The weights that showed the best results were w0 = 0.2, w1 = 9.6, w3 = -0.9. Notice that these weights don’t have to add up to 100. The important thing is how large and in what direction they are compared to one another. If we then multiply these weights by the inputs we had from before, we get the following result:
These values will then be passed onto the next component of the perceptron, the transfer function.
The transfer function is different from the other components in that it takes multiple inputs. The job of the transfer function is to combine multiple inputs into one output value so that the activation function can be applied. This is usually done with a simple summation of all the inputs to the transfer function.
On its own, this scalar value is supposed to represent some information about the soil content. This value has already factored in the importance of each measurement, using the weights. Now it is a single value that we can actually use. You can almost think of this as an arbitrary weighted index of the soil’s components. If we have a lot of these indexes, it might become easier to predict tree height using them. Before the value is sent out of the perceptron as the final output, however, it is transformed using an activation function.
An activation function will transform the number from the transfer function into a value that dramatizes the input. Often times, the activation function will be non-linear. If you haven’t taken linear algebra in university you might think that non-linear means that the function doesn’t look like a line, but it’s a bit more complicated than this. For now, just remember that introducing non-linearity to the perceptron helps avoid the output varying linearly with the inputs and therefore allows for greater complexity to the model. Below are two common activation functions.
ReLU is a simple function that compares zero with the input and picks the maximum. That means that any negative input comes out as zero, while positive inputs are unaffected. This is useful in situations where negative values don’t make much sense, or for removing linearity without having to do any heavy computations.
The sigmoid function does a good job of separating values into different thresholds. It is particularly useful for values such as z-scores, where values towards the mean (zero) need to be looked at carefully since a small change near the mean may significantly affect a specific behavior, but where values far from the mean probably indicate the same thing about the data. For example, if soil has lots and lots of moisture, a small addition to moisture probably won’t affect tree height, but it if has a very average level of moisture then removing some small amount of moisture could affect the tree height significantly. It emphasizes the difference in values if they are closer to zero.
When you think of activation functions, just remember that it’s a nonlinear function that makes the input more dramatic. That is, inputs closer to zero are typically affected more than inputs far away from zero. It basically forces values like 4 and 4.1 to be much closer, while values like 0 and 0.1 become more spread apart. The purpose of this is to allow us to pick more distinct decision boundaries. If, for example, we are trying to classify a tree as either “tall,” “medium,” or “short,” values of 5 or -5 are very obviously representing tall and short. But what about values like 1.5? Around these numbers, it may be more difficult to determine a decision boundary, so by dramatizing the input it may be easier to split the three categories.
We pick an activation function before training our model, so the function itself is always the same. It is not one of the parameters we toggle when testing thousands of different models. That only happens to the weights. The output of the ReLU activation function will be:
Up until now, I have ignored one element of the perceptron that is essential to its success. It is an additional input of 1. This input always stays the same, in every perceptron. It is multiplied by a weight just like the other inputs are, and its purpose is to allow the value before the activation function to be shifted up and down, independent of the inputs themselves. This allows the other weights (for the actual inputs, not the weight for the bias) to be more specific since they don’t have to also try to balance the total sum to be around 0.
To be more specific, bias might shift graphs like the left graph to something like the right graph:
And that’s it! We’ve now built a single perceptron. We’ve now created a model that imitates the brain’s neuron. We also understand that while that sounds fancy, it really just means that we can create complex multi-dimensional equations by altering a few weights. As you saw, the components are surprisingly simple. In fact, they can be summarized by the following equation:
From here on out I will be representing this equation (i.e. a single perceptron) with a green circle. All of the components we have seen so far: inputs, bias, weights, transfer function, and an activation function are all present in every single green circle. When an arrow points into this green circle, it represents an individual input node, and when the arrow points out of the green circle it represents the final output value.
To represent a network of perceptrons we simply plug the output of one into the input of another. We connect many of these perceptrons in chains, flowing from one end to another. This is called a Multi-Layer Perceptron (MLP), and as the name suggests there are multiple layers of interconnected perceptrons. For simplicity, we will look at a fully-connected MLPs, where every perceptron in one layer is connected to every perceptron in the next layer.
You might be wondering what a ‘layer’ is. A layer is just a row of perceptrons that are not connected to each other. Perceptrons in an MLP are connected to every perceptron in the layer before it and every perceptron in the layer after it, but not to any of the perceptrons within the same layer. Let’s look at an MLP with two input values, 2 hidden layers and an output of a single value. Let’s say the first hidden layer has two perceptrons and the second hidden layer has three.
The perceptrons here will all take in the inputs (arrows pointing towards the circle), perform the operations described in the previous section, and then push the output forward (arrow pointing out of the circle). This is done many times to create more and more complex equations, all considering the same information multiple times to make an accurate prediction. Now, although this article is meant to remove “the magic” from neural networks, it is very difficult to explain why this helps make more accurate predictions. In fact, the method I am describing is often referred to as a “black box” approach, because we don’t know why the equations it picks are important. It is currently an active area of research. What we can understand, however, is what the neural network is doing. That is as simple as following the weights through each and every perceptron.
The reason we call the layers between the input layer and output layers “hidden” is because once the values are fed from the input, it doesn’t serve us well to look at how that value is transformed until it exits the last output node. This is because these intermediary values are never used to evaluate the performance of our model (i.e. getting error values for predictions made on sample data).
And that’s really it. Combining many of these perceptrons helps us create even more sophisticated equations that a single perceptron can create.
The output value of an MLP like this is capable of making predictions on height using soil content measurements. Of course, picking the correct weights inside every single perceptron takes a lot of computational power, but this is exactly what a ‘neural network’ does.
Let’s see it in Action!
Here I will take two measurements from before through an entire neural network. The structure will be the same as the network I showed above. This will be very tedious, but you may follow along if you wish. I will be ignoring the bias for the sake of simplicity.
Here are the values of the two features I will use. They represent 58% moisture and 1.3mm grain size.
I will use the following (random) weights and activation functions for each perceptron. Recall that the ReLU activation function turns negative values into 0 and does not transform positive values:
So let’s get to it! The first two perceptrons both take the two inputs (blue), multiplies them by the associated weights (yellow), adds them (purple), and then applies the ReLU function (green):
These outputs become the inputs for each perceptron in the third layer. So every perceptron in the second hidden layer (there are three) will use 338.9 and 42 as inputs. Those perceptrons follow the following equations:
For the next layer, however, notice that we now have three, not two, inputs: 89.9, 16.22, and 0. All three inputs have to be included in the equation of the last perceptron, and therefore it will have three weights (in yellow below). Its equation is still as straightforward as the others.
As a summary, here are the values each perceptron produced given its inputs:
And there you have it! This neural network predicted a tree with a height of 165.72 feet! Now we have to compare the predicted results to the actual height of the sample tree in our data. Calculating some error value can be as straightforward as taking the difference between our predicted height and the actual height. Then we repeat this process with slightly different weights over and over until we find weights that predict tree height well for many samples. But that takes much too long for a human to do, so we need a machine to compute the optimal weights.
The weights were totally random to simulate the starting point of a neural network. This model is clearly not ‘trained’ and therefore won’t do well once we put another sample into it. We would use the results above to determine how to alter the weights.
The intermediary values don’t tell us much at all. For example, the output from the top node in the first hidden layer is 338.9, but that’s nowhere close to the value that the neural network predicted, 166ft. It’s important to not try to interpret the intermediary values as having a real-world meaning. This is why we call these layers ‘hidden.’
That’s how neural networks work. Make sure to hit the applause button if you enjoyed this explanation, and feel free to leave comments :)