Wait, what? Teeny tiny neurons wire together to create one of the most remarkable aspects of Machine Learning?
Hell, yeah!
But how exactly? Aha, that's exactly what you are gonna find out in this article!
๐ฆ Basic Terminology
Neurons - A function that takes in a value, modifies it and holds it.
Connection - Link between 2 neurons.
Hyperparameters - The model adjusts its parameters in the learning process. So, the parameters you can specify are called hyperparameters.
Input Layer - The initial layer of neurons that takes in the external input.
Output Layer - The final layer of neurons that spits out the hopefully desired output.
Hidden Layer/s - Layers of neurons inside a neural network that derive different values from the inputs to activate the output layer.
โ What is a Neural Network?
Think of a Neural Network (NN) as layers of neurons with each neuron connected to another neuron (in a different layer) through connections.
Imagine we are building a neural network to help a farmer sort their fruits into three categories: Apples, Bananas, and Oranges based on their weight (grams), colour (red, yellow, orange, green), texture (smooth or rough) and size (small, medium or large).
When every neuron is connected in this manner, we call it a Dense Network. There are different types of neural networks. We'll delve into them in later articles.
Input data is fed forward through the network using these connections to activate neurons in the output layer.
The hidden layers between the input and output layers are what determine the final activations.
In the lingo, the hidden layers are collectively called a black box. Yet, as a smart beginner, you don't have to think about them that way. That's because we'll break down the process step by step!
โ๏ธ How does an NN work?
When you input data into a neural network, they are stored in the neurons in the input layer.
How a Neuron gets Activated
Every connection is associated with a specific weight (w). So, to activate one neuron in the next layer, we take the weighted sum of the previous layer.
Let's take a look at the first two layers of our neural network and calculate the weighted sum of a neuron in the first hidden layer;
In this case, the weighted sum is,
a = activation of the previous neurons (i.e. the values they hold)
$$\text{Weighted Sum} = (a_1\times w_1) + (a_2\times w_2) + (a_3\times w_3) + (a_4\times w_4)$$
Then, we add a constant at the end called the bias before passing down the weighted sum to the next neuron.
$$\text{Value Passed to the Next Neuron}= \text{Weighted Sum} \pm b$$
In fact, we can just use the weighted sum + bias
without any modifications. That's called a linear combination (just like x=y in a graph).
Yet, this isn't helpful when it comes to finding complex patterns in data.
So, we pass the weighted sum + bias
to an activation function. The value it returns is the activation of the neuron. That way, we can find a fancy non-linear model to fit the data.
There's a variety of activation functions, each suited for their purpose:
Activation Function | Process | Pros | Cons |
ReLU - Rectified Linear Unit | if input < 0: return 0 else: return input | Cost-effective & efficient | Leaves behind important data in converting negatives to 0s. |
Sigmoid | Squishes the input into the range 0-1 | Easy to compute | Introduces the vanishing gradient problem. |
Binary Step Function | if input < 0: return 0 else: return 1 | Yay, just 0 and 1 to deal with! | Activation information is lost :( |
Don't forget! Your purpose defines the activation function.
For the input layer, you don't need an activation function.
However, there is a special activation function that is only used in the output layer. It's called softmax.
It takes in all the raw values passed down to the output layer and calculates an activation for each neuron.
let O = list of raw values passed down to the output neurons.
let i = the raw value passed down to the current neuron.
$$P_i = \frac{e^{i}}{\sum^{len(O)}_{j=0} e^{O[j]}}$$
For example, if the raw values passed down to all output neurons were [34, 32, 25];
The final activation of the Apple neuron would be,
$$P_{32} = \frac{e^{32}}{e^{34}+ e^{32} + e^{25}} = 0.12$$
Giving the NN Feedback
At first, all the weights and biases are selected randomly. So, our NN sucks at making predictions. We need a way to give it feedback and alter its direction towards success.
To achieve that, we use loss functions:
Loss Function | Process | Pros | Cons |
Mean Squared Error | Means of the squared difference between predicted and actual values. | Smooth optimization | Sensitive to outliers, may not be robust. |
Cross-Entropy Loss | Quantifies the dissimilarity between predicted and true class probabilities. | Effective for classification tasks. | Sensitive to class imbalance. |
Mean Absolute Error | The mean absolute difference between predicted and actual values. | Less sensitive to outliers. | Slower convergence. |
$$\text{Cost Function} = \sum \text{Loss for each prediction}$$
To find the total cost of the neural network, we average this cost function for every training example.
Ultimate Neural Network
Let's assume we input data that suits an apple. The whole process would be,
Loss = Squared Error
Activation Function of Both Hidden Layers = ReLU (Rectified Linear Unit)
Output Layer Activation Function = Softmax
$$\text{Cost of our NN} = 0.7744 + 0.7744+1 \times 10^8$$
You see, because we haven't trained the network, it predicts the data represents an orange when the correct answer is an apple. So, let's train it!
๐งโ๐ซ Training an NN
During the training process, the neural network constantly tries to minimise the cost function for each prediction.
To do so, it uses Backpropagation along with a fancy algorithm called Gradient Descent.
Backpropagation
During this process, we are trying to tweak the weights and biases to get the minimum possible cost function of a neural network.
The cost of the neural network is supported by the activations of the output layer neurons. So, to change the activations of the output layer, we can,
Change the weights of previous activations
Change the bias added to the weighted sum
Change the activation of the previous neuron
This can't be directly tweaked since we have to change its weighted sum and bias.
Each activation of a neuron depends on its previous weights, bias added to the weighted sum and the activations of the previous layer.
let a = an activation of a neuron in any neural network
$$a = (\sum{w\times a_{\text{previous}}}) + b$$
Note how the current activation depends on previous activations. Those previous activations are even influenced by their own weights and biases.
This way, layer by layer, we got deeper to proportionally optimise the weights and biases based on the magnitude they have on the cost function of the neural network. This is how this gets the name, Backpropagation.
Take the prediction of our neural network for example.
Proportional changes are illustrated through the thickness and length of arrows in the image.
Optimization Functions
These are fancy functions to find the optimal values for different weights and biases in the network.
These are based on some wonderful mathematical operations. Especially, Calculus (e.g. the chain rule).
We will take a look at the Maths in a future article.
For now, you just have to know some basic optimization functions like,
Optimization Function | Process | Pros | Cons |
Gradient Descent | Iteratively updates weights & biases by moving in the direction of the steepest decrease in the loss function. | Simple and widely applicable | High computation power is required + Can get stuck in local minima |
Stochastic Gradient Descent (SGD) | A variant of gradient descent that randomly samples a subset (mini-batch) of data for each iteration. | Faster computation | Can be noisy (i.e. random) + May require fine-tuning of the learning rate. |
Adam (Adaptive Moment Estimation) | A smarter approach to SGD where the network self teaches the optimal learning rate and the mini-batch size | Fast computation | Sensitive to the choice of hyperparameters. |
๐ Conclusion
With this knowledge, you can fluently use a library like Tensorflow Keras and create fantastic neural networks.
With user-friendly and high-level APIs like Tensorflow and PyTorch, you can crack 98% of machine learning without being Math Savvy. That's the reality.
So, don't be afraid! Over time, you'll start to realise what is going on under the hood, intuitively.
Keep in mind that a neural network is like a team of Pokemon. You have to choose your players (i.e. different functions, number of layers/neurons in each layer) to win the battle. So, developing an NN is an art instead of heavy math :)
๐ Good luck with developing stunning NNs... You've got this!