Programming Code For A Simple Neural Network
Published 1 September 2015 by Paul Tero of Existor Ltd
This article is a continuation of our neural networks tutorial. This provides a practical implementation of a neural network in C. Each section of code cross references a section in the tutorial so you can relate the equations to the code.
Bias in a neural network
The previous article presented a three layer neural network, but we left out the bias to keep it simple. But the bias is needed for a real network. For example, imagine you have a friend who is 150cm tall and weighs 75kg. We might guess an equation to relate these two variables: weight = 0.5 * height. But this would imply that when your friend was a newborn baby and only 50cm long, s/he weighed 25kg, which would be an extremely heavy baby. So we can add a number to the equation: weight = 0.6 * height - 25. That equation roughly fits both a newborn baby and an adult. The 0.6 in this equation is like the weight in a neural network. And the number -25 is the bias. You can also think of it like an offset. In algebra it is the point where the line crosses the y-axis.
This bias can be connected into the neural network fairly easily. We can add it as a new fake node to the input and hidden layers. The bias nodes will always have a value of +1 and their connection weights will represent the number like -25 above. Then the network will look like this:
It will operate pretty much the same as before with a forward pass and backward propagation. The only difference is that the bias nodes will always be set to 1 and will never be updated.
Programming a basic neural network
This article gives the programming code for a basic neural network in C. I chose C because many other languages are based on it (Javascript, PHP, Java, C++, Cuda, OpenCL). It is also fast and (if you already have a C compiler) doesn't require any extra installations.
The objective of this code is to show all the operations of the network in full. If you were really programming this, you would do the matrix operations using a highly optimised linear algebra library. And you would dynamically allocate memory, and pass in options from the command line, and so on.
The code starts with some library includes:
#include <stdio.h> //for printf #include <stdlib.h> //for rand #include <string.h> //for memcpy #include <math.h> //for tanh
Then some compiler definitions for the number of training examples and number of nodes in each layer. These could be set dynamically at run-time, but it's clearer to do them in advance.
#define NUM_TRAINING 3 #define NUM_INPUTS 3 #define NUM_HIDDEN 3 #define NUM_OUTPUTS 3
And C program has a "main" function:
int main (int argc, char **argv) { return 0; }
Save this to a file called neuralnetwork.c. You'll need a C compiler to compile it such as gcc or g++ or Visual Studio. On Mac or Linux, you can compile and run on the command line with:
gcc neuralnetwork.c -o neuralnetwork.out; ./neuralnetwork.out
This first command compiles your file and creates an executable called neuralnetwork.out. The second command runs that file. At this point it won't do anything, but shouldn't give any errors.
The input and output
Within the main function, we'll first create the training examples. These are just two dimensional arrays of our binary representations of the letters. So an input of A=100 should give an output B=010.
float inputs[NUM_TRAINING][NUM_INPUTS] = {{1, 0, 0}, {0, 1, 0}, {0, 0, 1}}; //the 3 possible inputs ABC float outputs[NUM_TRAINING][NUM_OUTPUTS] = {{0, 1, 0}, {0, 0, 1}, {1, 0, 0}}; //the corresponding outputs BCA
Then we'll set a couple of parameters for the neural network. The learning rate specifies how fast it learns, and the number of iterations is how many times it goes through the inputs and outputs above before it has finished learning:
float learningrate = 0.01; int iterations = 10000;
Initialise the weights
In code, the weights can be stored in a two dimensional array (a matrix). The size of the array depends on the number of nodes. For example, the weights between the input and hidden nodes has NUM_INPUTS+1 rows and NUM_HIDDEN columns. The +1 accounts for the bias node:
float Wxh[NUM_INPUTS+1][NUM_HIDDEN]; //weights between inputs x and hidden nodes, including an extra one for bias float Why[NUM_HIDDEN+1][NUM_OUTPUTS]; //weights between hidden nodes and output y, including an extra one for bias
Now we randomly initialise these weights to small values around 0. The C rand function can do this. It gives a random number from 0 to the built-in constant RAND_MAX. We divide by that to get a number from 0 to 1, then reduce that to the range -0.1 to 0.1. As we are not seeding the random number geneator, it may well produce the same "random" numbers every time you run it. In a real network, this would be undesireable (and we'd probably also randomise according to a Gaussian distribution). For testing it's useful:
int t, i, h, o; //looping variables for iterations, input nodes, hidden nodes, output nodes for (i=0; i<NUM_INPUTS+1; i++) for (h=0; h<NUM_HIDDEN; h++) Wxh[i][h] = ((float)rand() / (double)RAND_MAX) * 0.2 - 0.1; for (h=0; h<NUM_HIDDEN+1; h++) for (o=0; o<NUM_OUTPUTS; o++) Why[h][o] = ((float)rand() / (double)RAND_MAX) * 0.2 - 0.1;
Then we need to make space for a bunch of other variables used in the training:
float x[NUM_INPUTS+1]; //input vector including one extra for bias float y[NUM_OUTPUTS]; //output vector float zhWeightedSums[NUM_HIDDEN]; //weighted sums for the hidden nodes float hActivationValues[NUM_HIDDEN+1]; //activation values of the hidden nodes, including one extra for the bias float zyWeightedSums[NUM_OUTPUTS]; //weighted sums for the output nodes float probabilities[NUM_OUTPUTS]; //activation values of the output nodes float outputErrors[NUM_OUTPUTS]; //error in the output float deltaWxh[NUM_INPUTS+1][NUM_HIDDEN]; //adjustments to weights between inputs x and hidden nodes float deltaWhy[NUM_HIDDEN+1][NUM_OUTPUTS]; //adjustments to weights between hidden nodes and output y float loss, sum; //for storing the loss
Training Loop
The code then begins an iterations loop. Each iteration specifies one input/output cycle. This code copies an input and output into the x and y arrays. x has NUM_INPUTS+1 entries for each input node. The first entry x[0] is the bias node and is always set to 1. The C memcpy copies the other 3 input nodes into x. y is for the output nodes, but it doesn't have a bias:
for (t=0; t<iterations; t++) { //this is how many times we will train the network x[0]=1; memcpy (&x[1], inputs[t % NUM_TRAINING], sizeof (inputs[0])); //copy the input into x with the bias=1 memcpy (y, outputs[t % NUM_TRAINING], sizeof (outputs[0])); //copy the output into y
Weighted sums for hidden nodes
Now the forward pass begins by computing the weighted sums for the hidden nodes. This is the vector product of the input vector x and the weights between the input and hidden nodes. It is computed as:
\(z_{hj} = \sum_{i=0}^{n_h} x_i w^{xh}_{ij}\)
In C code, we first set all the weighted sums to 0, and then loop through the inputs and weights adding up the products. In a real neural network, this would be done by an optimised linear algebra function or library. It's shown here explicitly for clarity:
memset (zhWeightedSums, 0, sizeof (zhWeightedSums)); //set all the weighted sums to zero for (h=0; h<NUM_HIDDEN; h++) for (i=0; i<NUM_INPUTS+1; i++) zhWeightedSums[h] += x[i] * Wxh[i][h]; //multiply and sum inputs * weights
Activation values
Next we compute the activation values using the tanh function. The variable hActivationValues stores the values of the NUM_HIDDEN+1 hidden nodes. The 0th entry is the bias node and is always set to 1. The other entries are the tanh of the weighted sums from above:
\(h_{j+1} = \tanh\ (z_{hj})\)
hActivationValues[0]=1; //set the bias for the first hidden node to 1 for (h=0; h<NUM_HIDDEN; h++) hActivationValues[h+1] = tanh (zhWeightedSums[h]); //apply activation function on other hidden nodes
Weighted sums for output nodes
Now we do the weighted sums for the output layer by computing the vector product of the hidden node activation values and the weights between the hidden and output nodes:
\(z_{yj} = \sum_{i=1}^{n_y} h_i w^{hy}_{ij} \)
memset (zyWeightedSums, 0, sizeof (zyWeightedSums)); //set all the weighted sums to zero //multiply and sum inputs * weights for (o=0; o<NUM_OUTPUTS; o++) for (h=0; h<NUM_HIDDEN+1; h++) zyWeightedSums[o] += hActivationValues[h] * Why[h][o];
Probable answer
Then we compute the probable answer using the softmax function.
\(p = \text{softmax}\ (z_y)\)
for (sum=0, o=0; o<NUM_OUTPUTS; o++) {probabilities[o] = exp (zyWeightedSums[o]); sum += probabilities[o];} //compute exp(z) for softmax for (o=0; o<NUM_OUTPUTS; o++) probabilities[o] /= sum; //apply softmax by dividing by the the sum all the exps
Output error
The output error is the computed softmax probability minus the expected output y:
\(e_y = p - y\)
for (o=0; o<NUM_OUTPUTS; o++) outputErrors[o] = probabilities[o] - y[o]; //the error for each output
Loss
The loss is a single number which tells us how well the network is doing. When the network is operating correctly, this loss should reduce after each iteration and eventually reach close to zero. This is a cross-entropy log loss for more than 2 ouputs:
\(J = - \sum_{i=1}^{n_y} y_j \log p_j \)
for (loss=0, o=0; o<NUM_OUTPUTS; o++) loss -= y[o] * log (probabilities[o]); //the loss
Finding the gradient
This concludes the feed foward stage. And now we have to calculate how to adjust the weights to improve the network. This uses an algorithm called gradient descent which is based on the partial derivatives of the loss function.
\(\frac {\partial J} {\partial w^{hy}_{ij}} = h_i e_j \)
We use the C variable deltaWhy to store the partial derivatives. For the weights between the hidden and output layers, it is computed by multiplying each hidden node activation value times the output errors:
for (h=0; h<NUM_HIDDEN+1; h++) for (int o=0; o<NUM_OUTPUTS; o++) deltaWhy[h][o] = hActivationValues[h] * outputErrors[o];
Backward propagation
In order to adjust the weights between the input and hidden nodes, we need to backward propagate the error. This involves multiplying the output error by the hidden/output weight matrix:
\(e_h = e_y w^{hy}\)
In the code, we reuse the hActivationValues variable for this. Note that it starts at index 1 because hActivationValues[0] is the bias node and isn't involved in backwards propagation. We first set it to all zeros before doing the sums:
memset (hActivationValues, 0, sizeof (hActivationValues)); //set all the weighted sums to zero for (h=1; h<NUM_HIDDEN+1; h++) for (o=0; o<NUM_OUTPUTS; o++) hActivationValues[h] += Why[h][o] * outputErrors[o];
Then we back prop through the tanh activation function by calculating its gradient and multiplying by the backwards propagated error above:
\(z_{eh} = e_h \odot (1 - \tanh^2 (z_h))\)
//apply activation function gradient for (h=0; h<NUM_HIDDEN; h++) zhWeightedSums[h] = hActivationValues[h+1] * (1 - pow (tanh (zhWeightedSums[h]), 2));
Finally the changes to the weights between the input and hidden nodes are the input times the back propagated weighted sums:
\(\delta w^{xh} = x^T z_{eh}\)
//Multiply x*eh*zh to get the adjustments to deltaWxh, this does not include the bias node for (int i=0; i<NUM_INPUTS+1; i++) for (int h=0; h<NUM_HIDDEN; h++) deltaWxh[i][h] = x[i] * zhWeightedSums[h];
Changing the weights
We've now computed the gradients and we can use them to change the weights. We multiply by the learning rate and then subtract from the existing weights:
for (int h=0; h<NUM_HIDDEN+1; h++) for (int o=0; o<NUM_OUTPUTS; o++) Why[h][o] -= learningrate * deltaWhy[h][o]; for (int i=0; i<NUM_INPUTS+1; i++) for (int h=0; h<NUM_HIDDEN; h++) Wxh[i][h] -= learningrate * deltaWxh[i][h];
Results
And that completes a single iteration! Once we have done this hundreds or thousands of times, the loss settles down to a value close to zero and the network has learned. You can download the full code which also includes a debugging variable and a function for displaying vectors and matrices. If you run it as is, you should see a decreasing loss:
Iteration: 1, loss: 1.99973 Iteration: 2, loss: 1.96918 Iteration: 3, loss: 1.76910 Iteration: 4, loss: 1.99869 Iteration: 5, loss: 1.96801 Iteration: 6, loss: 1.77050 Iteration: 7, loss: 1.99766 Iteration: 8, loss: 1.96684 Iteration: 9, loss: 1.77189 Iteration: 10, loss: 1.99664 Iteration: 11, loss: 1.96568 Iteration: 12, loss: 1.77326 Iteration: 13, loss: 1.99563 Iteration: 14, loss: 1.96452 Iteration: 15, loss: 1.77461 ... Iteration: 9997, loss: 0.01497 Iteration: 9998, loss: 0.01678 Iteration: 9999, loss: 0.02031 Iteration: 10000, loss: 0.01496
Followed by the ending weights. These don't look much like the weights I predicted, but they work:
input/hidden weights: -0.07408 -0.00894 0.21300 -0.15978 -1.23593 -1.67984 -1.59281 0.94939 0.94431 1.76659 0.37032 0.88094 hidden/output weights: 0.00746 0.25910 -0.38276 2.66296 -0.20895 -2.51230 0.38154 -1.59694 1.08895 1.15777 -2.25131 1.23477
And then you'll see the predicted output for the last input/output pair, which proves that it is working:
last input: 1.00000 0.00000 0.00000 last output: 0.00000 1.00000 0.00000 predicted output: 0.00349 0.99254 0.00397
For the input A=100, it predicted the correct answer B=010 with over 99% confidence. If you set it to run for a million iterations it will reach 99.99% confidence. You can also increase the learning rate to 0.1 to make it learn more quickly. This is such a simple network that a relatively high learning rate still works.
Running the network
In a real network, this training process could take hours or days rather than seconds, so you would be careful to save the computed weights to a file. When you were ready to run the network, you would just need to load the file and do the feed forward part of the algorithm above to get the predicted probabilities.
Stochastic gradient descent
The type of gradient descent used here is called stochastic gradient descent, where training examples are fed into the network one at a time. For a network like this simple example, you would usually use batch gradient descent, giving all the training examples at once. In batch gradient descent x and y and other variables in the code above would be matrices rather than vectors. Also the loss should decrease every iteration, and not with small jumps in between as above. There are also other approaches such as mini-batch gradient descent which does the descent in small batches.
Conclusion
This article has provided some C code for the network described in the neural networks tutorial.