The activation function is an integral part of a neural network. A neural network is a simple linear regression model without an activation function. This means the activation function gives non-linearity to the neural network gradient parameter. In this article, we will discuss the SoftMax activation function, which is popularly used for multiclass classification problems. Let’s first understand the neural network architecture for a multiclass classification problem and why other activation functions can not be used in this case.
The SoftMax activation function is commonly used in machine learning, particularly in neural networks for classification tasks. An activation function converts a vector of raw prediction scores (logits) into probabilities.
Suppose we have the following dataset: For every observation, we have five features from FeatureX1 to FeatureX5, and the target variable has three classes.
Now, let’s create a simple neural network to solve this problem. Here, we have an Input layer with five neurons, as we have five features in the dataset. Next, we have one hidden layer with four neurons. Each of these neurons uses inputs, weights, and biases to calculate a value, which is represented as Zij here.
For example, the first neuron of the first layer is represented as Z11 Similarly the second neuron of the first layer is represented as Z12, and so on.
We apply the activation function, let’s say a tanh activation function, to these values and send the values or result to the output layer.
The number of neurons in the output layer depends on the number of classes in the dataset. Since we have three classes in the dataset, we will have three neurons in the output layer. Each of these neurons will give the probability of individual classes. This means the first neuron will give you the probability that the data point belongs to class 1. Similarly, the second neuron will give you the probability that the data point belongs to class 2.
Also Read: Unlocking The Power of Activation Functions in Neural Networks
SoftMax is a mathematical function often used in machine learning, especially in classification tasks. It converts a vector of numbers (or scores) into probabilities, where each probability corresponds to the likelihood of a particular class. This makes it easier to interpret the utput of a model, especially in multi-class classification problems.
Here’s how SoftMax works in simple terms:
[2.0, 1.0, 0.1]
) to each score. This ensures all values are positive and amplifies the differences between them.[e^2.0, e^1.0, e^0.1]
, which is approximately [7.39, 2.72, 1.11]
.7.39 + 2.72 + 1.11 = 11.22
.[7.39/11.22, 2.72/11.22, 1.11/11.22]
, which is approximately [0.66, 0.24, 0.10]
.[0.66, 0.24, 0.10]
means the first class has a 66% chance, the second class has a 24% chance, and the third class has a 10% chance.Here’s how the softmax function in machine learning works in the last layer of a neural network :
Output: The result is a new vector the same size as the input vector z. However, each element in the output vector now represents a probability between 0 and 1. The argmax function is typically used to select the index of the highest probability, determining the predicted class generalization.
Checkout this article about the Fundamentals of deep learning
Suppose we calculate the Z value using weights and biases of this layer and apply the sigmoid activation function over these values. We know that the sigmoid activation function gives the value between 0 and 1 suppose these are the values we get as output.
There are two problems in this case-
First, if we apply a threshold of 0.5, this network says the input data point belongs to two classes. Secondly, these probability values are independent of each other. That means the probability that the data point belongs to class 1 does not consider the probability of the other two classes.
The sigmoid activation function is not preferred in multi-class classification problems.
In the above example, we will use the Softmax activation function in the output layer instead of sigmoid. The Softmax activation function calculates relative probabilities. That means it uses the values of Z21, Z22, and Z23 to determine the final probability value.
Let’s see how the softmax activation function works. Like the sigmoid activation function, the SoftMax function in machine learning returns the probability of each class. Here is the equation for the SoftMax activation function.
Here, Z represents the values from the neurons of the output layer. The exponential acts as the non-linear function. Later, these values are divided by the sum of exponential values to normalize them and then convert them into probabilities.
Note that, when the number of classes is two, it becomes the same as the sigmoid activation function. In other words, sigmoid is simply a variant of the Softmax function. To learn more about this concept, refer to this link.
Let’s understand with a simple example how the softmax function works. We have the following neural network.
Suppose the values of Z21, Z22, and Z23 are 2.33, -1.46, and 0.56, respectively. Now, the SoftMax activation function is applied to each of these neurons, and the following values are generated.
These are the probability values that a data point belonging to the respective classes. Note that, in this case, the sum of the probabilities is equal to 1.
In this case, the input belongs to class 1. So, if the probability of any of these classes is changed, the probability value of the first class would also change.
In machine learning, functions like softmax output are implemented in frameworks such as numpy and Python to facilitate the process. The softmax function, through exponentiation, transforms the logits into a probability distribution. This method is crucial in determining the loss function during model training and optimization. CNN’s ability to make precise predictions hinges on these fundamental principles.
Here is how Softmax function in machine learning is used in CNN :
This article is all about the SoftMax activation function. In it, we saw why we should not use activation functions like sigmoid or threshold in multiclass classification problems and how the SoftMax function works through an example. In this article, these algorithms will show different SoftMax output values and different output vectors.
The softmax activation function is a mathematical function that converts a vector of real numbers into a probability distribution. It exponentiates each element, making them positive, and then normalizes them by dividing by the sum of all exponentiated values. This ensures that the output probabilities add up to one, making it suitable for multiclass classification tasks.
The sigmoid function is used for binary classification, mapping any real value to a range between 0 and 1. It’s suitable for independent predictions. The softmax function in machine learning, on the other hand, converts a vector of real numbers into a probability distribution for multiclass classification tasks, ensuring that the sum of the probabilities is equal to one
Softmax: A function that converts logits into probabilities, ensuring the output sums to 1. It’s used in classification tasks.
ReLU (Rectified Linear Unit): An activation function that sets negative values to zero, enabling non-linearity and reducing vanishing gradients in deep learning models.
Softmax is used in CNNs for multi-class classification. It converts the output layer’s scores into probabilities, helping the model predict the most likely class.
If you want to kick-start your Data Science Journey and want every topic under one roof, your search stops here. Check out Analytics Vidhya’s Certified AI & ML BlackBelt Plus Program