Neural networks are a class of machine learning algorithms inspired by the structure and functioning of the human brain. A neural network consists of interconnected nodes, also known as neurons, that work together to solve complex problems. The number of neurons used in a neural network can significantly impact its performance and accuracy. In this article, we’ll explore neural networks. We’ll cover how to estimate the ideal number of neurons. You’ll also learn how forward propagation helps make predictions. By the end, you’ll understand neural networks work better. You’ll know how to optimize them for your needs. So, let’s dive into Estimation of neurons and forward propagation in neural networks.
In the context of neural networks, estimating neurons refers to determining the optimal number of neurons to use in each network layer. This is an important step in designing and training neural networks, as the number of neurons can significantly impact the network’s performance. Few neurons can result in underfitting, where the model cannot capture the complexity of the data. At the same time, too many neurons can result in overfitting, where the model fits the training data too closely and performs poorly on new data. Various methods estimate the optimal number of neurons, including trial and error, cross-validation, and more advanced techniques such as pruning.
Let’s start with a binary classification problem where we want to classify whether the customer will churn or not churn. We will use a small dummy data for our understanding purpose with four input variables and eight observations.
It has the following neural net with an architecture of [4, 5, 3, 2] and is depicted below:
Let’s label the neurons in our hidden layers for reference. In Hidden Layer L2, we’ll call them N1 through N5. In Hidden Layer L3, they’ll be N6, N7, and N8. The output layer in a classification problem can be structured in two ways. It can either have a single node, or it can have one node for each class or category.
This network here is called Fully Connected Network (FNN) or Dense Network since every neuron has a connection with the node of the previous layer output. It is also known as the Feedforward Neural Network or Sequential Neural Network.
The equation for the neural network is a linear combination of the independent variables and their respective weights and bias (or the intercept) term for each neuron. The neural network equation looks like this:
Z = Bias + W1X1 + W2X2 + …+ WnXn
where,
There are three steps to perform in any neural network:
Firstly, we will understand how to calculate the output for a neural network and then will see the approaches that can help to converge to the optimum solution of the minimum error term.
The output layer receives information from hidden layer L3, which connects to hidden layer 2 and ultimately the input variables. The hidden layers automatically create features without requiring manual derivation. This automatic feature generation is what distinguishes deep learning from traditional machine learning.”
I have broken down the long sentences into shorter ones while preserving the core concepts about how deep learning networks process information through their layers and automatically generate features.
So, to compute the output, we will have to calculate for all the nodes in the previous layers. Let us understand what is the mathematical explanation behind any kind of neural nets.
Now, as from the above architecture, we can see that each neuron cannot have the same general equation for the output as the above one. We will have one such equation per neuron both for the hidden and the output layer.
The nodes in the hidden layer L2 are dependent on the Xs present in the input layer therefore, the equation will be the following:
Similarly, the nodes in the hidden layer L3 are derived from the neurons in the previous hidden layer L2, hence their respective equations will be:
The output layer nodes are coming from the hidden layer L3 which makes the equations as:
Now, how many weights or betas will be needed to estimate to reach the output? On counting all the weights Wis in the above equation will get 51. However, no real model will have only three input variables to start with!
Additionally, the Estimation of neurons and the hidden layers themselves are the tuning parameters so in that case, how will we know how many weights to estimate to calculate the output? Is there an efficient way than the manual counting approach to know the number of weights needed? The weights here are referred to the beta coefficients of the input variables along with the bias term as well (and the same will be followed in the rest of the article).
The structure of the network is 4,5,3,2. The hidden layer L2 has 25 weights in total. This comes from (4 + 1) * 5, where 4 is the number of input variables in L1, and 5 is the number of neurons in L2. Each input X has one bias term. This gives us 5 bias terms total, represented as (4 + 1).
The weight count for each layer follows a specific formula. Take the number of nodes from the previous layer, add their bias terms. Then multiply this sum by the number of neurons in the next layer.
Similarly, the number of weight for the hidden layer L3 = (5 + 1) * 3 = 18 weights, and for the output layer the number of weights = (3 + 1) * 2 = 8.
The total number of weights for this neural network is the sum of the weights from each of the individual layers which is = 25 + 18 + 8 = 51
We now know how many weights will we have in each layer and these weights from the above neuron equations can be represented in the matrix form as well. Each of the weights of the layers will take the following form:
Hidden Layer L2 will have a 5 * 5 matrix as seen the number of weights is (4 + 1) * 5:
A 3*6 matrix for the hidden layer L3 having the number of weights as (5 + 1) * 3 = 18
Lastly, the output layer would be 4*2 matrix with (3 + 1) * 2 number of weights:
nd how to optimize their performance for your specific use case.
Okay, so now we know how many weights we need to compute for the output but then how do we calculate the weights? In the first iteration, we assign randomized values between 0 and 1 to the weights. In the following iterations, these weights are adjusted to converge at the optimal minimized error term.
We are so persistent about minimizing the error because the error tells how much our model deviates from the actual observed values. Therefore, to improve the predictions, we constantly update the weights so that loss or error is minimized.
This adjustment of weights is also called the correction of the weights. There are two methods: Forward Propagation and Backward Propagation to correct the betas or the weights to reach the convergence. We will go into the depth of each of these techniques; however, before that lets’ close the loop of what the neural net does after estimating the betas.
The next step on the ladder of computation of output is to apply a transformation on these linear equations. As we have a neural net related to classification at hand, how will this linear equation apply when categorizing the output into classes?
For a binary classification problem, we need Sigmoid to transform the linear equation into a nonlinear equation. In case you are not sure why we use Sigmoid to transform a linear equation to a nonlinear equation, then would suggest refreshing the logistic regression.
For a particular node, the transformation is as follows:
N1 = W11*X1 + W12*X2 + W13*X3 + W14*X4 + W10
After implementing the Sigmoid transformation, it becomes:
h1 = sigmoid(N1)
where,
sigmoid(N1) = exp(W11*X1 + W12*X2 + W13*X3 + W14*X4 + W10)/(1+ exp(W11*X1 + W12*X2 + W13*X3 + W14*X4 + W10))
This alteration applies to the hidden layers and output layers and is known as the Activation or Squashing Function. This transformation adds non-linearity to the network because every business problem may not be solved linearly.
There are various types of activation functions available and each function has a different utilization. On the output layer, the activation function is dependent on the type of business problem. The squashing function for the output layer for binary classification is the Sigmoid.
Hence, to find the output we estimate the weights and perform the mathematical transformation. The output of a node is the outcome of this activation function.
Till this point, we have just completed step 1 of the neural network that is taking the input variables and finding the output. Then we calculate the error term. And mind you, right now this is only done for one record! We perform this entire cycle all over again for all the records!
Relax, we don’t have to do this manually. This is just the process, the network does these steps in its background. The idea here is to know how the network works, we don’t have to do it manually.
In the neural network, we can move from left to right and right to left as well. The right to left process of adjusting the weights from the Output to the Input layer is Backward Propagation (I will cover this in the next article).
The process of going from left to right i.e from the Input layer to the Output Layer is Forward Propagation. We move from left to right to adjust or correct the weights. We will understand how this mathematically works and update the weights to have the minimized loss function.
Our binary classification dataset had input X as 4 * 8 matrix with 4 input variables and 8 records and the Y variable is 2 * 8 matrix with two columns, for class 1 and 0, respectively with 8 records. It had some categorical variables post converting it to dummy variables, we have the set as below:
We begin with a 48 input matrix and aim for a 28 output. The number of hidden layers and neurons in each layer are hyperparameters. These values are defined by the user. How we achieve the output is via matrix multiplication between the input variables and the weights of each layer.
We have seen above that the weights will have a matrix for each of the respective layers. Let’s begin with an input matrix of 4 * 8. We multiply this by the weight matrix between L1 and L2 layers. This gives us the matrix for layer L3. We repeat these matrix multiplications through each layer until we reach the final 2 * 8 output layer.
Note: that the above explanation of neuron estimation applies to a single observation. The network repeats this process for all observations.
Now, let’s break down the steps to understand how the matrix multiplication in Forward propagation works:
We can multiply element by element but that result will be only for one observation or one record. To get the result for all the 8 observations in one go, we need to multiply the two matrices.
For matrix multiplication, the first matrix’s columns must match the second matrix’s rows. Our input matrix has 8 columns, but the weight matrix has 4 rows, so we cannot multiply them.
So, what do we do? We take the transpose of one of the matrices to conduct the multiplication. Transposing the weight matrix to 5 * 4 will help us resolve this issue.
So, now after adding the bias term, the result between the input layer and the hidden layer L2, becomes Z1 = Wh1T * X + bh1.
5. The next step is to apply the activation function on Z1. Note, the shape of Z1 does not change by applying the activation function so h1 = activation function(Z1) is of shape 5*8.
6. In a similar manner to the above five steps, the network using the forward propagation gets the outcome of each layer:
Note: that for the next layer between L2 and L3, the input this time will not be X but will be h1, which results from L1 and L2.
Z2 = Wh2T * h1 + bh2,
where ,
So, Z2 = Wh2T * h1 + bh2 with its matrix multiplication is:
Z2 = [(3*5) * (5*8)] + bh2 will result Z2 with dimension of 3*8 and post this again apply the activation function, which results in: h2 = activation function(Z2) is of shape 3*8.
7. We repeat these steps for the computation of the last layer.
This time for the next layer between L3 and L4, the input will be h2, resulting from L2 and L3.
Z3 = Wh0T * h2 + bh0,
So, Z3 = Wh0T * h2 + bh0, with its matrix multiplication is:
Z3 = [(2*3) * (3*8)] + bh0 will result in Z3 with the dimension of 2*8 and post this again apply the activation function, this time use Sigmoid to transform as need to get the output, which results in O = Sigmoid(Z3) is of shape 2*8.
After estimating the output through forward propagation, we calculate the error. The process of adjusting weights to minimize this error continues until we find the optimal solution.
The other, preferred method to adjust weights is Backward Propagation, which we will explore in the next article.
The estimation of neurons and forward propagation are fundamental concepts in neural networks. Estimating the required number of neurons in a neural network is crucial to prevent overfitting and underfitting, which can harm performance. Forward propagation is the process of moving data through the neural network, allowing it to make predictions. This article has only introduced these concepts briefly. There is much more to explore in the field of neural networks.
Unlock your potential in neural networks with our Blackbelt program! Gain hands-on skills to build and deploy advanced AI models. Join now and start your journey to becoming a data expert!
A.The cost formula in a neural network, also called the loss function, measures the difference between predicted outputs and target values during training. It quantifies the network’s performance and aids in adjusting the model’s parameters to minimize errors. Common examples include mean squared error for regression tasks and cross-entropy for classification tasks.
A. The formula for deep network calculation involves sequentially computing the output of each layer. The process begins with input data fed into the first layer. Activation functions are then applied to the weighted sum of inputs at each layer. This continues layer by layer until the output layer is reached. Finally, the output layer provides the prediction or result of the computation by the deep neural network.
Analytics Vidhya does not own the media shown in this article, and the author uses it at their discretion.
Going great 👍
This article is helpful to understand how the nodes are estimated, how a neural network operates, its parameters, and the working of the forward propagation method. Nice
A lot of thanks, Neha Seth! This article is written in a very clear and fluent language. I wish you further development in your creativity.