Machine Learning - Neural Networks Tutorial
Published 20 August 2015 by Paul Tero of Existor Ltd
Earlier this year, Existor committed to developing a new version of the software behind Evie and Cleverbot, based on machine learning techniques. Machine learning is a fairly old branch (1950s) of computer science, which uses even older statistical and mathematical techniques to get computers to learn the solutions to complicated problems. However, to be really effective, machine learning needs lots of data, and very fast computers, which has only happened in the last few years.
At Existor, we are particularly interested in the text and natural language possibilities, and we currently have lots of data and some very fast GPU enabled servers. This tutorial comes out of our initial research. Because to use a tool, you first have to understand it, and one of the main tools in machine learning is the neural network. So this article describes how a neural network works both visually and mathematically. It uses a very simple text problem as the example - guessing the next letter in the alphabet.
Throughout this article, important phrases and jargon are italicised. You will need an HTML5 capable browser to view most of the graphics, which are displayed using SVG (Scalable Vector Graphics). And you'll need Javascript to see the equations which are written in TeX and rendered using MathJax. While reading this article, you may also like to follow our corresponding article with the actual code for the network described below.
Visualising a neural network
I'll start by graphing the network. Imagine that each of the circles below are a cell in your brain. To be honest, you probably have more brain cells than this. In fact, the smallest number of neurons in any animal anywhere is 302 for the caenorhabditis elegans, a very small roundworm. In neural networks, the neurons are called nodes. The red nodes on the left are inputs, such as your eyes and ears. The green ones are outputs, such as your muscles or vocal cords. The yellow ones are extra ones in the middle that help to learn things. In neural networks, they are called hidden nodes, and they are part of a hidden layer. A neural network can have any number of hidden layers, but this one only has one. This is a three layer neural network:
In neural networks all the nodes in one layer are connected to all the nodes in the next layer. In real brains, the connecting lines are a lot more squiggly. In this graph, I've made most of the lines a faint grey colour or else it gets a bit overwhelming visually:
In order to discuss these things, we have to give them labels. Input nodes are generally awarded x and output nodes y. Hidden nodes are usually given h which gets confusing if there are more than one of them, but is okay here:
And now we'll label the connections. This gets tricky. In neural networks each connection has a weight associated with it, so normally the letter w is used. But there are lots of these connections, 18 in this very small graph, about 7500 in the roundworm and 100,000,000,000,000 in a human adult. So in this graph, the weight has a superscript indicating its position (such as xh for a weight between the x input layer and h hidden layer) and a subscript for its number (21 means between the second node on the left and first node on the right). For clarity I have only labelled the black lines here:
Inputting letters into a network
Neural networks can learn things. We will train this one to tell us the next letter in the alphabet. So if we present it with an "A", it should output "B", etc. The first problem is how to give it the letter "A". Computers and neural networks only deal with numbers. So we could say that A=1, B=2, C=3. But this implies an order and structure which really isn't there. It implies that B lies between A and C, in the same way that a roundworm (302 cells) is between our neural network (12 nodes) and a human (86 billion cells) in brain sizes. This is not the case. A and B are just as different as A and C.
So instead we'll use binary inputs. Since our network only has three input nodes, we can only represent 3 letters. The letter A will be represented as 100, B as 010 and C as 001. This is called the 1-of-k encoding. Each input is a bunch of 0s and a single 1. Let's send in the letter "A":
Random starting weights
The weights in a neural network are usually initialised to be small random values between -1 and 1. This is very important because it ensures that the different connections learn different things. I've assigned some small random weights to the connections in the first layer.
Weighted sums
Now we will compute the values of the nodes in the second layer starting with the first hidden node \(h_1\). First we have to compute the weighted sum of the inputs. This means we multiply each input times each weight and add them together. Sometimes this weighted sum is given the label z. So we compute the weighted sum z for \(h_1\):
\(z_{h1} = x_1 w^{xh}_{11} + x_2 w^{xh}_{21} + x_3 w^{xh}_{31} = 1 * 0.2 + 0 * -0.03 + 0 * 0.14 = 0.2 \)
We took the value of the first input \(x_1\) (1) and multiplied it by the weight connecting it to \(h_1\) (0.2). We did this for the second and third inputs as well and added the results together to get 0.2. This can be written more succinctly using summation notation:
\(z_{h1} = \sum_{i=1}^{3} x_i w^{xh}_{i1}\)
The \(\sum\) is the Greek letter Sigma and means "take a sum". It has a variable i which goes from the values 1 up to 3. So this equation takes the sum over all three inputs, just as the longer equation above.
Activation values
In animal brains, a neuron either fires, passing on an electrochemical signal to other neurons, or it doesn't fire. It is thought that neurons perform a similar operation to the one above, adding up all their inputs, multiplying by a set of weights, and then "deciding" whether to fire or not, depending on whether the answer exceeded some preset threshold.
A neural network does something similar using an activation function. It is needed because, if the raw weighted sums computed from the first layer were passed directly to the second layer, a neural network would just be a calculator. It needs a non-linear function to separate each layer.
So the sum computed above is sent through an activation function such as sigmoid or tanh which converts the incoming sum into either a -1 (to mimic a neuron not firing) or +1 (for firing). Unlike a real neuron, these functions also have a limited range of values in the middle, so they can result in -0.25 or 0.01 or just 0. This helps make the neural network less rigid.
Our network will use the tanh function, graphed below. Notice that for numbers around 0, tanh is also around 0. For numbers greater than 2, tanh gets ever closer to 1. And for less than -2 it approaches -1:
So to get the actual value, the activation value, of \(h_1\) we have to take the tanh of the number above:
\(h_1 = \tanh\ (z_{h1}) = \tanh\ (0.2) = 0.19738 \)
I'll now add the weighted sum and activation value to the graphs:
Our weights were initialised to very small random values, so at first all the activation values will be around 0. But after the network learns and becomes more confident, the weights will rise or fall, and many of the activation values may go to near -1 or +1.
Matrix notation
Now we need to compute the other activation values. We can rewrite the equation above so it applies to all the activation nodes in the h layer. Here j refers to number of the hidden node, so this equation applies to \(h_1\), \(h_2\) and \(h_3\). I've also replaced the 3 with \(n_x\) which says that the sum goes from 1 to the number of xs, which makes the equation more generic if more inputs are added.
\(h_j = \tanh\ (z_{hj}) = \tanh\ (\sum_{i=1}^{n_x} x_i w^{xh}_{ij}) \)
To compute the other activation values in the hidden layer, we need to know the other weights. Rather than adding lots of numbers to the graph, I'll present the 9 random weights between x and h as a matrix:
\(w^{xh} = \begin{bmatrix} 0.2 & 0.15 & -0.01 \\ 0.01 & -0.1 & -0.06 \\ 0.14 & -0.2 & -0.03 \end{bmatrix} \)
In this matrix the first row represents the weights coming from \(x_1\) going into the three hidden \(h\) nodes. The second row is for \(x_2\) and so on. The first column therefore represents weights going into \(h_1\) (these are the weights shown in the graph), the second column into \(h_2\) and so on.
I can also show the input as a vector. The input vector x has three components. In neural networks, the components of the input are also known as features:
\(x = \begin{bmatrix} 1 & 0 & 0 \end{bmatrix} \)
The three weighted sums for h can now be computed using matrix multiplication. When vectors and/or matricies are multiplied, you multiply the rows of the first thing by the columns of the second and add up the results (see hre for a very friendly description of matrix multiplcation). Like this:
\(z_h = x w_{xh} = \begin{bmatrix} 1 & 0 & 0 \end{bmatrix} \begin{bmatrix} 0.2 & 0.15 & -0.01 \\ 0.01 & -0.1 & -0.06 \\ 0.14 & -0.2 & -0.03 \end{bmatrix} = \begin{bmatrix} 1*0.2 + 0*0.01 + 0*0.14 \\ 1*0.15 + 0*-0.1 + 0*-0.2 \\ 1*-0.01 + 0*-0.06 + 0*-0.03 \end{bmatrix} = \begin{bmatrix} 0.2 & 0.15 & -0.01 \end{bmatrix} \)
Which basically makes a copy of the first row of weights. And now we can compute the three activation values for h:
\(h = \tanh\ (x w_{xh}) = \tanh\ (\begin{bmatrix} 0.2 & 0.15 & -0.01 \end{bmatrix}) = \begin{bmatrix} 0.197 & 0.149 & -0.01 \end{bmatrix} \)
Output layer
We now do a similar thing for the output layer of the network. But we first we need some initial random values for the weights between the hidden and output layers:
\(w^{hy} = \begin{bmatrix} 0.08 & 0.11 & -0.3 \\ 0.1 & -0.15 & 0.08 \\ 0.1 & 0.1 & -0.07 \end{bmatrix} \)
Now we can compute the weighted sums for the output layer y. In this sum the weights connect the \(i^{th}\) hidden node with the \(j^{th}\) output node:
\(z_{yj} = \sum_{i=1}^{n_h} h_i w^{hy}_{ij} \)
Which can be written even more concisely in matrix form as:
\(z_y = h w^{hy} = \begin{bmatrix} 0.197 & 0.149 & -0.01 \end{bmatrix} \begin{bmatrix} 0.08 & 0.11 & -0.3 \\ 0.1 & -0.15 & 0.08 \\ 0.1 & 0.1 & -0.07 \end{bmatrix} = \begin{bmatrix} 0.03 & -0.0017 & -0.0465 \end{bmatrix} \)
These are the weighted sums, and we could use this directly as the "answer". Because unlike the hidden layer, the output layer does not have an activation function. The output nodes are like the direct signals output by the brain. In an animal brain outputs like these could be interpretted as look 0.03cm up, -0.0017 to the left and rotate the head -0.0465 degrees.
There is a type of neural network which does use the outputs directly, known as linear regression. But in our neural network, we need a binary output as well, so we can convert it back into a letter. So we need something else.
The probable answer
Our original aim was that the network should tell us the next letter of the alphabet. So given an input of A=100, we want an output of 010=B. But the output above doesn't look anything like that. Instead it looks like
This is called a classification problem, because we need to know which class this output belongs: A, B or C. For classification, neural networks often use a softmax function. This gives a probability distrubtion. Applied to the weighted sums above:
\(p = \text{softmax}\ (z_y) = \text{softmax}\ (\begin{bmatrix} 0.03 & -0.0017 & -0.0465 \end{bmatrix}) = \begin{bmatrix} 0.345 & 0.335 & 0.32 \end{bmatrix} \)
This means the network "thinks" there is a 34.5% chance the answer is A, 33.5% for B and 32% for C. Basically, it doesn't know, but if it had to give an answer, it would probably plomp for A. We can add these results to the graph, as percentages to indicate that they represent probabilities:
Feed forward
That represents one full cycle through a very small neural network. It was supposed to answer B given the input A, but it got it wrong. This is not at all surprising since all its weights were completely random and we haven't actually taught it anything. The whole process above is known as the feed forward phase, because input values are fed forward into the network, until they get to the end, and output values pop out.
Output error
Soon we will be training the network, going back through it and correcting our mistakes. But first we need to know where we went wrong. Our network is an example of supervised learning because we know the right answer. We know that given A, it should answer B. We can therefore compute the output error, how much it got each answer wrong by. This error is just the actual output minus the expected output.
I suspiciously avoided using the label y in the sections above. That's because y is reserved for the expected output, which in this case is the letter B=010:
\(y = \begin{bmatrix} 0 & 1 & 0 \end{bmatrix} \)
We can compute the output error as the difference between the probabilities and the expected output:
\(e_y = p - y = \begin{bmatrix} 0.345 & 0.335 & 0.32 \end{bmatrix} - \begin{bmatrix} 0 & 1 & 0 \end{bmatrix} = \begin{bmatrix} 0.345 & -0.665 & 0.32 \end{bmatrix} \)
I'll add these to the graph, but in red to indicate they are errors:
Loss
We would also like a figure which represents the total error for all the outputs combined. This is called the loss or cost of the network and is often labelled with the capital letter J. I'm not sure why. This number is very useful as it enables us to very quickly see how well the network is doing, and how fast it is learning.
To compute the loss, we need to find an appropriate loss function or cost function. We could just add up the absolute values of the output errors:
\(J = \sum_{j=1}^{n_y} |e_{yj}| = 0.345 + 0.665 + 0.32 = 1.32? \)
Absolute values aren't very mathematically-friendly though (because they can't be differentiated). So instead the squared error is often used:
\(J = \sum_{j=1}^{n_y} e_{yj}^2 = 0.664? \)
But for softmax classification networks a more efficient alternative is cross-entropy which is a type of log loss. We know that our actual output p is a probability distribution. We can also view y as a probability distribution. In this example, the probability of A is 0%, B is 100% and C is 0%. In information theory, cross entropy is a way of measuring the difference between two probability distributions and is often used for classification networks. The formula looks like:
\(J = - \sum_{j=1}^{n_y} y_j \log p_j \)
Since y is either 0 or 1 only one output will end up contributing to the loss. The sum will equal the negative log of p when y=1:
\(J = - y_1 \log p_1 - y_2 \log p_2 - y_3 \log p_3) = - 0 - 1 * \log (0.335) - 0 = 1.0936 \)
The negative of the log function between 0 and 1 looks like the graph below. So if we have 0 error (if y=1 and we predicted p=1), then the loss will also be zero. If we got it very wrong (y=1 and we predicted p=0), the loss will be very high, approaching infinite.
That loss number of 1.0936 is valuable. It is a real and concrete estimate of how well our network performed. We can measure it after every training example. If after hours of training, the loss is about the same, then we know we're not getting anywhere. If the loss falls dramatically at first and then settles at 0.04, then we know that we may as well stop. If it goes up, then we probably made a mistake.
Note that the log loss above is for network with more than two outputs which use softmax as the loss function. Softmax is used because the outputs are treated as probabilities and have to add up to 1. Networks with only two outputs can instead use the sigmoid function to calculate the answer (where the highest of the two outputs wins and they don't have to add up to 1). Such logistic regression networks can then use the following as the loss formula:
\(J = - \sum_{j=1}^{n_y} y_j \log p_j + (1 - y_j) \log (1 - p_j) \)
Adjusting output weights intuitively
Now we are ready to go back and correct our network's mistakes. There are 18 adjustable weights in our network, and we have to change each of them a little bit so that the next time we input a letter A, the network is more likely to answer B. We'll start with the output weights.
Intuitively, it feels like all the weights going into \(y_1\) and \(y_3\) should be lowered a bit, because their estimate was too high. And the weights going into \(y_2\) should be raised beause they were way too low and caused a large negative error. And the bigger the error, the more the weights should be changed. Mathematically this is fairly easy to do. The size of the error is proportional to the size of the thing on the other end of the connection (the activation value of the hidden node). So we can just multiply the value of the hidden node times the error.
The Greek letter delta \(\delta\) is used for these weight changes. For example, the adjustment on the top weight connecting the first hidden node to the first output node could just be:
\(\delta w^{hy}_{11} = h_1 e_{y1} = 0.197 * 0.345 = 0.068? \)
This would be a mammoth adjustment though. That weight is only 0.08 to begin with. Changing it by 0.068 would change it to 0.08-0.068 = 0.012. We are in danger of overcompensating. So in practice we also multiply by a learning rate, respresented by the Greek letter alpha \(\alpha\). This is typically set to some small value like 0.01 or 0.001. Then we have:
\(\delta w^{hy}_{11} = \alpha h_1 e_{y1} = 0.01 * 0.197 * 0.345 = 0.00068 \)
This would change the weight to 0.08-0.00068=0.07932. That would in turn lower the activation value of the first output node to 0.0295 and lower the probability of the answer being A by about 0.004%. Which is not much, but this process will be repeated thousands of times before the network is done learning.
In fact we can compute all the adjustments with one matrix operation. We multiply the hidden values times the error values. Except we flip the hidden values to make them into a vertical vector (by adding the T which stands for "transpose"). Then when we multiply the 3 rows of \(h^T\) times the 3 columns of \(e_y\) and we get a 9 element matrix:
\(\delta w^{hy} = \alpha h^T e_y = 0.01 \begin{bmatrix} 0.197 \\ 0.149 \\ -0.01 \end{bmatrix} \begin{bmatrix} 0.345 & -0.665 & 0.32 \end{bmatrix} = \begin{bmatrix} 0.00068 & -0.00131 & 0.00063 \\ 0.00051 & -0.00099 & 0.00047 \\ -0.00003 & 0.00007 & -0.00003 \end{bmatrix} \)
I can add those adjustments to the graph:
Gradient descent
However, that was just an informed guess. Mathematically we want to reduce or minimise the loss. If we could figure out exactly how each of the 18 weights in the network impacts that loss, then we could adjust the weight optimally. For example, if we found that increasing one of the weights (such as the one between the first hidden node and second output node) by 0.01 made the loss go up by 0.0001, then we would know we need to decrease that weight by a very small amount.
To do this, we need a tool called gradient descent which involves figuring out the partial derivatives of the loss with respect to each weight. If you're not familiar with calculus, then this may sound like jargon, but it's not too difficult.
For example, take the equation \(y=x^2\). According to the rules of calculus we can find out how fast y is changing with respect to x. It is written like this:
\(\frac {\partial y}{\partial x} = 2x\)
So for example, when x=4, y=16. At that point y is increasing 2x=2*4=8 times faster than x. So if x goes up by 0.01, then y should go up by 0.08. And it does (approximately), when x=4.01, y=16.0801. It's approximate because calculus deals with infinitesimally small changes, and 0.01 is small, but not that small. (This is also a mathematical justifcation for having a very small learning rate needs, because the gradient is only an approximation.)
In our case, to figure out how a weight affects the loss, we could just change the weight a little bit and recalculate the loss. But as you've seen above, this involves an awful of calculation, and there are 18 of these weights. If we could instead figure out the partial derivatives, we would have a neat shortcut for adjusting the weights.
In the example above, 2x is also the gradient or slope of the equation \(y=x^2\). If we want to minimise the value of y, then we need to find the gradient and descend it, hence the name gradient descent. So when x=4, we can reduce x by \(\alpha 2 x = 0.01 * 2 * 4 = 0.08\). That leaves us with x=4-0.08=3.92. And then we do that again and again, and eventually we will get to x=0, which is the value which minimises y. At that point the gradient will be 2x=2*0=0 and won't change any more.
Partial derivatives
Therefore we need to calculate how the loss changes as each weight changes. This section gets very intense. If the word "calculus" sends shivers down your spine, then look away now. Unless they are pleasurable shivers. In which case, please continue.
First of all some notation. We can mathematically write each output weight as:
\(\frac {\partial J} {\partial w^{hy}_{ij}}\)
This represents the "change in the loss J" with respect to the change in the weight between the \(i^{th}\) hidden node an \(j^{th}\) output node. So first we need to take a more careful look at J:
\(J = - \sum_{j=1}^{n_y} y_j \log p_j \)
This does not even refer to the weight in question, so we need to proceed according to the chain rule of calculus, deriving one step at a time. The chain rule allows us to chain together a bunch of different derivatives like this:
\(\frac {\partial J} {\partial w^{hy}_{ij}} = \frac {\partial J} {\partial p_j} \frac {\partial p_j} {\partial z_j} \frac {\partial z_j} {\partial w^{hy}_{ij}}\)
We now need to figure out all things on the right. We'll start with the second one (which is softmax) because it will reveal something that is needed to compute the first one. First I should present the innards of the softmax function. It uses the number e to the power of the probability \(z_j\). The reason for this is that e is a very special number, because the derivative of \(e^z\) is also \(e^z\). It is a function which changes at the same rate as itself. For this reason, in maths circles, e is right up there with \(\pi\). Here's softmax:
\(\text{softmax} (z_j) = \frac {e^{z_j}} {\sum_{k=1}^{n_y} {e^{z_k}}} \)
Note that the input to softmax is the vector \(z_j\) and the output of softmax is a vector of probabilities p_q. But unfortunately the derivative is different when \(j=q\) versus \(j \ne q\). We'll do the more difficult case where j=q first. The derivative of this starts out relatively easily. We break apart the top and bottom halves of the fraction and treat them separately and then apply the chain rules for things that are being multiplied together. For clarity, I'm dropping the extra stuff on the \(\sum\):
\(\frac {\partial p_j} {\partial z_{j}} = \partial (e^{z_j} \frac1{\sum e^{z_k}}) = \partial (e^{z_j}) \frac1{\sum e^{z_k}} + e^{z_j} \partial (\frac1{\sum e^{z_k}}) \)
We have to apply the chain rule again to find derivative of the sum. The sum contains the thing we are differentiating \(z_j\) along with all the other \(z_k\)s which we are not interested in. Those other \(z_k\)s are constants and can be removed, which reduces away the sum:
\(\frac {\partial p_j} {\partial z_{j}} = \partial (e^{z_j}) \frac1{\sum e^{z_k}} - e^{z_j} \frac1{(\sum e^{z_k})^2} \partial (\sum e^{z_k}) = \partial (e^{z_j}) \frac1{\sum e^{z_k}} - e^{z_j} \frac1{(\sum e^{z_k})^2} \partial (e^{z_j}) \)
Because of the wonderful magic of e this simplifies to:
\(\frac {\partial p_j} {\partial z_{j}} = e^{z_j} \frac1{\sum e^{z_k}} - e^{z_j} \frac1{(\sum e^{z_k})^2} e^{z_j} = \frac {e^{z_j}} {\sum e^{z_k}} - \frac{(e^{z_j})^2}{(\sum e^{z_k})^2} \)
And then we can resubstitute back in the probability p:
\(\frac {\partial p_j} {\partial z_{j}} = p_j - {p_j}^2\)
And now for the other case where \(j \ne q\). In this case the derivative of \(e^{z_j}\) with respect to q is 0, and the derivative of the sum with respect to q is \(\partial (e^{z_q})\):
\(\frac {\partial p_q} {\partial z_{j \ne q}} = \partial (e^{z_j} \frac1{\sum e^{z_k}}) = \partial (e^{z_j}) \frac1{\sum e^{z_k}} + e^{z_j} \partial (\frac1{\sum e^{z_k}}) = - e^{z_j} \frac1{(\sum e^{z_k})^2} \partial (\sum e^{z_k}) = - e^{z_j} \frac1{(\sum e^{z_k})^2} \partial (e^{z_q}) = - \frac{e^{z_j} e^{z_q}} {(\sum e^{z_k})^2} = -p_j p_q\)
So now we have two dervatives:
\(\frac {\partial p_j} {\partial z_{j}} = p_j - {p_j}^2\)
\(\frac {\partial p_j} {\partial z_{q \ne j}} = - p_j p_q \)
Now we can return to the first term on the right and see how it combines with the second term. To do this we have to split the sum that makes J into two parts, one for j and the other for \(q \ne j\) and then combine with the equations above:
\( \frac {\partial J} {\partial p_j} \frac {\partial p_j} {\partial z_j} = \partial (- \sum_{j=1}^{n_y} y_j \log p_j) \frac {\partial p_j} {\partial z_j} = - \partial (y_j \log p_j) \frac {\partial p_j} {\partial z_j} - \partial (\sum_{j \ne q} y_q \log p_q) \frac {\partial p_{j \ne q}} {\partial z_j} = \) \( - \frac {y_j}{p_j} (p_j - {p_j}^2) - \sum_{j \ne q} (\frac {y_q}{p_q} (- p_j p_q)) = - y_j + y_j p_j + \sum_{j \ne q} y_q p_q = p_j \sum y_j - y_j = p_j - y_j \)
The last couple steps recombine all the ys back into a sum, which nicely equals 1, because there is only one 1 in the outputs. This Maths StackExchange answer about the derivative of softmax helped me a lot with this derivation.
And the last derivative is fairly easy:
\(\frac {\partial z_j} {\partial w^{hy}_{ij}} = h_i\)
Multiplying all these we get:
\(\frac {\partial J} {\partial w^{hy}_{ij}} = (\frac {\partial J} {\partial p_j} \frac {\partial p_j} {\partial z_j}) \frac {\partial z_j} {\partial w^{hy}_{ij}} = (p_j - y_j) h_i = h_i e_{yj} \)
Not so coincidentally, this is the same as the intuitive guess above! It's amazing that it all works out so nicely despite all the gory steps in the middle. In fact, it is clear that softmax and cross entropy were specifically chosen to go together because they have such a nice friendly derivative. Because there may be many simpler and less computationally expensive formulas which give a nice probability distribution, but softmax is so popular because of its derivative. In a similar way, tanh is also a very complicated function with a nice derivative, as is its alternative sigmoid. The same goes for other versions of log loss and the squared error loss. All are carefully selected for their graceful derivatives.
Going further back
Phew! That takes care of the output weights. But we must now also adjust the weights between the input and hidden layer. This gets even more complicated but follows a similar path and ends up simplifying quite nicely. Feel free to skip this section too.
In the following equations, the output nodes are indexed by j and the hidden nodes with i to be consistent with above. The input nodes are indexed by m. So we need to compute how the loss J changes with respect to the weight between the \(m^{th}\) input node and \(i^{th}\) hidden node. Note first of all that all three of the probabilities at the end of the network depend on these weights, so the derivative will be an immense sum:
\(\frac {\partial J} {\partial w^{xh}_{mi}} = \sum_{j=1}^{n_y} (\frac {\partial J} {\partial p_j} \frac {\partial p_j} {\partial z_{yj}} \frac {\partial z_{yj}} {\partial h_i} \frac {\partial h_i} {\partial z_{hi}} \frac {\partial z_{hi}} {\partial w^{xh}_{mi}}) \)
The first thing to notice is that we have already computed the first two terms. And the third term is fairly easy. It calculates how the \(j^{th}\) weighted sum changes with respect to the \(i^{th}\) hidden node activation value. And this changes according to the weight connecting them:
\( \frac {\partial z_{yj}} {\partial h_i} = w^{hy}_{ij} \)
Only those first three terms have a j index and are involved in the sum. So we can take the other two out of the sum:
\(\frac {\partial J} {\partial w^{xh}_{mi}} = \sum_{j=1}^{n_y} (e_{yj} w^{hy}_{ij}) \frac {\partial h_i} {\partial z_{hi}} \frac {\partial z_{hi}} {\partial w^{xh}_{mi}} \)
Next we have to calculate how the hidden node's activation changes according to its weighted sum. You may remember that the activation value is the hyperbolic tangent of the weighted sum. tanh is a complicated function, but it's got a pleasing derivative (which I won't derive):
\(\partial h_i = \tanh\ (z_{hi}) = \frac {e^{z_{hi}} - e^{-z_{hi}}} {e^{z_{hi}} + e^{-z_{hi}}} = 1 - \tanh^2 (z_{hi}) \)
And finally, the last one is fairly easy too:
\(\frac {\partial z_{hi}} {\partial w^{xh}_{mi}} = x_m \)
Putting all these together:
\(\frac {\partial J} {\partial w^{xh}_{mi}} = \sum_{j=1}^{n_y} (w^{hy}_{ij} e_{yj}) (1 - \tanh^2 (z_{hi})) x_m \)
One very helpful aspect of this formula is that the sum at the beginning is really just the network running in reverse using the errors as inputs. So that sum represents the error propagated back to the hidden nodes. This is very important in neural networks and is called backward propagation:
\(e_{hi} = \sum_{j=1}^{n_y} (w^{hy}_{ij} e_{yj})\)
This sum is then multiplied by the derivative/gradient of the tanh function. This is sort of like taking the error back through the activation function:
\(z_{ehi} = e_{hi} (1 - \tanh^2 (z_{hi}))\)
And that whole value is multiplied by the corresponding input. And just as we formed a matrix above by multiplying the transposed hidden nodes times the error vector, we can also form a matrix by multiplying the calculation above by the corresponding input (and by a learning rate as well):
\(\delta w^{xh} = \alpha x^T z_{ehi} \)
What's especially nice about this is that it would be the same process if there were more hidden layers.
Backward propagation
That's the maths! If you skipped those two sections, then what you missed was a thorough mathematical treatment of something called backward propagation. Or backprop for short if you want to impress your machine learning friends. It's called this because we must propagate the errors back to the previous layer, so that we can adjust the weights between the input and hidden layers. Now we'll look at that visually.
The first part of this is fairly easy. We just run the neural network in reverse to get the hidden node's errors:
\(e_h = e_y w^{hy} = \begin{bmatrix} 0.345 & -0.665 & 0.32 \end{bmatrix} \begin{bmatrix} 0.08 & 0.11 & -0.3 \\ 0.1 & -0.15 & 0.08 \\ 0.1 & 0.1 & -0.07 \end{bmatrix} = \begin{bmatrix} -0.007 & 0.17 & -0.18 \end{bmatrix} \)
I'll add those to the graph. Notice that something spectacular has happened to the connections. The bold black connections are now going the other way around feeding 3 output errors into the top hidden node. I bet you weren't expecting that level of excitement:
Now we have to multiply those hidden errors by something. The something comes from the maths above. It is the derivative of the activation function applied to the weighted sums computed earlier. It is akin to undoing the activation function to find the corresponding weighted sum. It's not truly running the network in reverse, but can be thought of in that way.
I wasn't really sure what to call this, so I've gone with \(z_{eh}\) because z is for weighted sums and this is like the weighted sum which leads to the hidden error. In this equation the \(\odot\) signifies that each element in one vector is multiplied by the corresponding element in the other vector. So it isn't a normal vector/vector multiplication. It is elementwise multiplication. In the programming package Octave or Matlab, this is done using .*:
\(z_{eh} = e_h \odot (1 - \tanh^2 (z_h)) = e_h \odot (1 - \tanh^2 (\begin{bmatrix} 0.2 & 0.15 & -0.01 \end{bmatrix})) =\) \(\begin{bmatrix} -0.007 & 0.17 & -0.18 \end{bmatrix} \odot \begin{bmatrix} 0.961 & 0.978 & 0.999 \end{bmatrix} = \begin{bmatrix} 0.192 & 0.147 & -0.001 \end{bmatrix}\)
Adding these to the graph:
And as before we can use this to create a 9 element matrix of adjustments:
\(\delta w^{xh} = \alpha x^T z_{eh} = 0.01 \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} \begin{bmatrix} 0.192 & 0.147 & -0.001 \end{bmatrix} = \begin{bmatrix} 0.00192 & 0.00147 & -0.00001 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix}\)
Changing weights
Once we have computed the weight adjustments, we add them to actual weights in the network:
And then it is ready to receive the next training example. (Though in practice, many networks are fed their training examples in batches or in bulk rather than one a time.)o
Final network
Our network only has three possible training examples: A->B, B->C and C->A. Once we've given these to the network thousands of times it should end up learning the appropriate weights. The final learned weights might look something like this (highlighting different weights this time so I can show it in action):
In fact, this is a very simple problem and doesn't require a hidden layer. What happens in this network is that an input of A=100 will cause the first hidden node to have a weighted sum of 10, and the other nodes to have weighted sums of -10. The tanh will give the hidden nodes activation values of +1, -1 and -1. The output layer will have weighted sums of -10, 10, -10, which will lead to probabilities of close to 0%, 100% and 0%, which is an output of 010.
In the graph above you can see the input A=100 flows through the network along the high positive weights and outputs approximately 010=B. In reality, the weights in a real network would probably be quite different from the examples above, but the result should hopefully be the same.
What else
This article has covered some stuff in great depth, and left a lot of other stuff completely alone. One major oversight is the bias node. In an actual neural network, the hidden and output layers would have an additional weight coming into them, from a fake node whose value is always 1. This allows the network to learn an offset or bias. It wasn't needed in this simple example, but would be used in any useful neural network.
There is also a whole science to choosing the structure of the network (how many layers, how many nodes per layer), the activation function to use (tanh isn't the only one) and the loss function. Along with this, there are many approaches to gradient descent, including different optimisation algorithms, ways to dynamically set the learning rate, whether to train and descend in batches or one at a time (as we did here), how to numerically verify the process is learning correctly, and so on.
This was a supervised learning network. In supervised learning, the training data is usually split into training, cross-validation and testing data sets, so you can test how well your network does on data it has never seen before. But there are also unsupervised learning networks which find their own structure in data. And their are methods for graphing the loss in the network over time to determine if it is overfitting or underfitting the data.
Finally, this is only one type of neural network. There are more complex networks such as recurrent neural networks. And there are many other algorithms that come under the heading of machine learning but that do different things - such as support vector machines, principle component analysis and k-clustering.
Conclusion
This has been an in-depth taster of a small slice of neural networks and machine learning. We hope you found it useful. Please keep an eye on our websites and technology, as we hope to release a first version of a neural-network-enabled Evie and Cleverbot within the next few months.
There is a corresponding article which provides the C code for this neural network and a further artcle which extends the maths and code to cover recurrent neural networks.