Neural Networks for Beginners Step-by-Step

Wait, what? Teeny tiny neurons wire together to create one of the most remarkable aspects of Machine Learning?

Hell, yeah!

But how exactly? Aha, that's exactly what you are gonna find out in this article!

📦 Basic Terminology

Neurons - A function that takes in a value, modifies it and holds it.

Connection - Link between 2 neurons.

Hyperparameters - The model adjusts its parameters in the learning process. So, the parameters you can specify are called hyperparameters.

Input Layer - The initial layer of neurons that takes in the external input.

Output Layer - The final layer of neurons that spits out the hopefully desired output.

Hidden Layer/s - Layers of neurons inside a neural network that derive different values from the inputs to activate the output layer.

❔ What is a Neural Network?

Think of a Neural Network (NN) as layers of neurons with each neuron connected to another neuron (in a different layer) through connections.

Imagine we are building a neural network to help a farmer sort their fruits into three categories: Apples, Bananas, and Oranges based on their weight (grams), colour (red, yellow, orange, green), texture (smooth or rough) and size (small, medium or large).

When every neuron is connected in this manner, we call it a Dense Network. There are different types of neural networks. We'll delve into them in later articles.

Input data is fed forward through the network using these connections to activate neurons in the output layer.

The hidden layers between the input and output layers are what determine the final activations.

In the lingo, the hidden layers are collectively called a black box. Yet, as a smart beginner, you don't have to think about them that way. That's because we'll break down the process step by step!

⚙️ How does an NN work?

When you input data into a neural network, they are stored in the neurons in the input layer.

How a Neuron gets Activated

Every connection is associated with a specific weight (w). So, to activate one neuron in the next layer, we take the weighted sum of the previous layer.

Let's take a look at the first two layers of our neural network and calculate the weighted sum of a neuron in the first hidden layer;

In this case, the weighted sum is,

a = activation of the previous neurons (i.e. the values they hold)

$$\text{Weighted Sum} = (a_1\times w_1) + (a_2\times w_2) + (a_3\times w_3) + (a_4\times w_4)$$

Then, we add a constant at the end called the bias before passing down the weighted sum to the next neuron.

$$\text{Value Passed to the Next Neuron}= \text{Weighted Sum} \pm b$$

💡

Don't forget: Every connection has a unique weight. So, even though the activations in one layer are constant, neurons in the next layer are activated differently.

In fact, we can just use the weighted sum + bias without any modifications. That's called a linear combination (just like x=y in a graph).

Yet, this isn't helpful when it comes to finding complex patterns in data.

So, we pass the weighted sum + bias to an activation function. The value it returns is the activation of the neuron. That way, we can find a fancy non-linear model to fit the data.

There's a variety of activation functions, each suited for their purpose:

Activation Function	Process	Pros	Cons
ReLU - Rectified Linear Unit	`if input < 0: return 0 else: return input`	Cost-effective & efficient	Leaves behind important data in converting negatives to 0s.
Sigmoid	Squishes the input into the range 0-1	Easy to compute	Introduces the vanishing gradient problem.
Binary Step Function	`if input < 0: return 0 else: return 1`	Yay, just 0 and 1 to deal with!	Activation information is lost :(

💡

There are modified versions of ReLU like Parametric ReLU and Leaky ReLU to address the issue with classic ReLU.

Don't forget! Your purpose defines the activation function.

For the input layer, you don't need an activation function.

However, there is a special activation function that is only used in the output layer. It's called softmax.

It takes in all the raw values passed down to the output layer and calculates an activation for each neuron.

let O = list of raw values passed down to the output neurons.
let i = the raw value passed down to the current neuron.

$$P_i = \frac{e^{i}}{\sum^{len(O)}_{j=0} e^{O[j]}}$$

For example, if the raw values passed down to all output neurons were [34, 32, 25];

The final activation of the Apple neuron would be,

$$P_{32} = \frac{e^{32}}{e^{34}+ e^{32} + e^{25}} = 0.12$$

💡

We raise the values to the power of e (Euler's number) to pronounce the differences.

Giving the NN Feedback

At first, all the weights and biases are selected randomly. So, our NN sucks at making predictions. We need a way to give it feedback and alter its direction towards success.

To achieve that, we use loss functions:

Loss Function	Process	Pros	Cons
Mean Squared Error	Means of the squared difference between predicted and actual values.	Smooth optimization	Sensitive to outliers, may not be robust.
Cross-Entropy Loss	Quantifies the dissimilarity between predicted and true class probabilities.	Effective for classification tasks.	Sensitive to class imbalance.
Mean Absolute Error	The mean absolute difference between predicted and actual values.	Less sensitive to outliers.	Slower convergence.

💡

When you add up all the losses for each prediction, you get the cost function of the predictions.

$$\text{Cost Function} = \sum \text{Loss for each prediction}$$

To find the total cost of the neural network, we average this cost function for every training example.

Ultimate Neural Network

Let's assume we input data that suits an apple. The whole process would be,

Loss = Squared Error

Activation Function of Both Hidden Layers = ReLU (Rectified Linear Unit)

Output Layer Activation Function = Softmax

$$\text{Cost of our NN} = 0.7744 + 0.7744+1 \times 10^8$$

You see, because we haven't trained the network, it predicts the data represents an orange when the correct answer is an apple. So, let's train it!

🧑‍🏫 Training an NN

During the training process, the neural network constantly tries to minimise the cost function for each prediction.

To do so, it uses Backpropagation along with a fancy algorithm called Gradient Descent.

Backpropagation

During this process, we are trying to tweak the weights and biases to get the minimum possible cost function of a neural network.

The cost of the neural network is supported by the activations of the output layer neurons. So, to change the activations of the output layer, we can,

Change the weights of previous activations
Change the bias added to the weighted sum
Change the activation of the previous neuron

This can't be directly tweaked since we have to change its weighted sum and bias.

Each activation of a neuron depends on its previous weights, bias added to the weighted sum and the activations of the previous layer.

let a = an activation of a neuron in any neural network

$$a = (\sum{w\times a_{\text{previous}}}) + b$$

Note how the current activation depends on previous activations. Those previous activations are even influenced by their own weights and biases.

This way, layer by layer, we got deeper to proportionally optimise the weights and biases based on the magnitude they have on the cost function of the neural network. This is how this gets the name, Backpropagation.

💡

We start from the last layer because that's the starting place to calculate the cost function of a neural network.

Take the prediction of our neural network for example.

Proportional changes are illustrated through the thickness and length of arrows in the image.

Optimization Functions

These are fancy functions to find the optimal values for different weights and biases in the network.

These are based on some wonderful mathematical operations. Especially, Calculus (e.g. the chain rule).

We will take a look at the Maths in a future article.

For now, you just have to know some basic optimization functions like,

Optimization Function	Process	Pros	Cons
Gradient Descent	Iteratively updates weights & biases by moving in the direction of the steepest decrease in the loss function.	Simple and widely applicable	High computation power is required + Can get stuck in local minima
Stochastic Gradient Descent (SGD)	A variant of gradient descent that randomly samples a subset (mini-batch) of data for each iteration.	Faster computation	Can be noisy (i.e. random) + May require fine-tuning of the learning rate.
Adam (Adaptive Moment Estimation)	A smarter approach to SGD where the network self teaches the optimal learning rate and the mini-batch size	Fast computation	Sensitive to the choice of hyperparameters.

👋 Conclusion

With this knowledge, you can fluently use a library like Tensorflow Keras and create fantastic neural networks.

With user-friendly and high-level APIs like Tensorflow and PyTorch, you can crack 98% of machine learning without being Math Savvy. That's the reality.

So, don't be afraid! Over time, you'll start to realise what is going on under the hood, intuitively.

Keep in mind that a neural network is like a team of Pokemon. You have to choose your players (i.e. different functions, number of layers/neurons in each layer) to win the battle. So, developing an NN is an art instead of heavy math :)

🍀 Good luck with developing stunning NNs... You've got this!

Demystifying Neural Networks

A Step-by-Step Guide for Beginners

Table of contents

📦 Basic Terminology

❔ What is a Neural Network?

⚙️ How does an NN work?

🧑‍🏫 Training an NN

👋 Conclusion