The softmax activation function is one of the most popular terms we come across while resolving problems related to machine learning, or, more specifically, deep learning. We place softmax activation function at the end of a neural network in the deep learning model. Why? Because it normalizes the network output to a probability distribution over the estimated output classes.

We leverage this as an activation function to resolve multiclass classification problems. It can also be perceived as a generalization of the sigmoid function since it is used to present a probability distribution for a binary variable.

**Note:** If we look for softmax in the output layer in the Keras deep learning library with three-class classification activity, it might look like the example given below.

```
model.add(Dense(3, activation=’softmax’))
```

In order to implement it in Python, we will help you to briefly learn it from scratch. Let’s start with the mathematical expression for the same in Python.

The softmax function scales logits/numbers into probabilities. The output of this function is a vector that offers probability for each probable outcome. It is represented mathematically as:

Where:

**- Z =** It is the input vector of the softmax activation function. It comprises n elements for n classes.

**- Z(i) =** It is i-th element of the input vector that takes up any value between negative infinity to positive infinity.

**- exp [Z(i)] =** It is the standard exponential function applied to Z(i). The resulting values may be tiny but they are never zero.

**- n=** It represents the number of classes

**- Σexp [Z(i)] =** It is a normalization term that makes sure the output vector value sums up to 1 for the i-th class.

Here’s an example to understand how it works as per this formula.

It works for a batch of inputs with a 2D array where n rows = n samples and n columns = n nodes. It can be implemented with the following code.

```
import numpy as np
def Softmax(x):
'''
Performs the softmax activation on a given set of inputs
Input: x (N,k) ndarray (N: no. of samples, k: no. of nodes)
Returns:
Note: Works for 2D arrays only(rows for samples, columns for nodes/outputs)
'''
max_x = np.amax(x, 1).reshape(x.shape[0],1) # Get the row-wise maximum
e_x = np.exp(x - max_x ) # For stability
return e_x / e_x.sum(axis=1, keepdims=True)
```

This is the simplest implementation of softmax in Python. Another way is the Jacobian technique. An example code is given below.

```
import numpy as np
def Softmax_grad(x): # Best implementation (VERY FAST)
'''Returns the Jacobian of the softmax function for the given set of inputs.
Inputs:
x: should be a 2d array where the rows correspond to the samples
and the columns correspond to the nodes.
Returns: jacobian
'''
s = Softmax(x)
a = np.eye(s.shape[-1])
temp1 = np.zeros((s.shape[0], s.shape[1], s.shape[1]),dtype=np.float32)
temp2 = np.zeros((s.shape[0], s.shape[1], s.shape[1]),dtype=np.float32)
temp1 = np.einsum('ij,jk->ijk',s,a)
temp2 = np.einsum('ij,ik->ijk',s,s)
return temp1-temp2
```

The above-mentioned Python code implementations are only pitched and tested for a batch of inputs. This means that the expected input is a 2D array with rows. It presents different samples and columns defining different nodes.

If you wish to speed up these implementations, use Numba which is best known for translating the subset of Python and NumPy code into fast machine code. To install it, use:

`pip install numba`

**Note:** Make sure that your NumPy is compatible with Numba, though pip takes care of the same most of the time.

Here’s how you can implement softmax NUMBA.

```
from numba import njit
@njit(cache=True,fastmath=True) # Best implementation (VERY FAST)
def Softmax(x):
'''
Performs the softmax activation on a given set of inputs
Input: x (N,k) ndarray (N: no. of samples, k: no. of nodes)
Returns:
Note: Works for 2D arrays only (rows for samples, columns for nodes/outputs)
'''
max_x = np.zeros((x.shape[0],1),dtype=x.dtype)
for i in range(x.shape[0]):
max_x[i,0] = np.max(x[i,:])
e_x = np.exp(x - max_x)
return e_x / e_x.sum(axis=1).reshape((-1, 1)) # Alternative of keepdims=True for Numba compatibility
```

Here’s the code for the softmax derivative (Jacobian) NUMBA implementation.

```
from numba import njit
@njit(cache=True,fastmath=True)
def Softmax_grad(x): # Best implementation (VERY FAST)
'''Returns the Jacobian of the softmax function for the given set of inputs.
Inputs:
x: should be a 2d array where the rows correspond to the samples
and the columns correspond to the nodes.
Returns: jacobian
'''
s = Softmax(x)
a = np.eye(s.shape[-1])
temp1 = np.zeros((s.shape[0], s.shape[1], s.shape[1]),dtype=np.float32)
temp2 = np.zeros((s.shape[0], s.shape[1], s.shape[1]),dtype=np.float32)
# Einsum is unsupported with Numba (nopython mode)
# temp1 = np.einsum('ij,jk->ijk',s,a)
# temp2 = np.einsum('ij,ik->ijk',s,s)
for i in range(s.shape[0]):
for j in range(s.shape[1]):
for k in range(s.shape[1]):
temp1[i,j,k] = s[i,j]*a[j,k]
temp2[i,j,k] = s[i,j]*s[i,k]
return temp1-temp2
```

These methods are quite fast with TensorFlow and PyTorch. However, there is another method that can be used to accelerate the implementation of softmax in Python. It is with the help of Cupy (CUDA) which is an open-source array library that is used for GPU-accelerated computing with Python.

Here’s what the Cupy implementation looks like:

```
import cupy as cp
def Softmax_cupy(x):
'''
Performs the softmax activation on a given set of inputs
Input: x (N,k) ndarray (N: no. of samples, k: no. of nodes)
Returns:
Note: Works for 2D arrays only (rows for samples, columns for nodes/outputs)
'''
max_x = cp.amax(x, 1).reshape(x.shape[0],1)
e_x = cp.exp(x - max_x) # For stability as it is prone to overflow and underflow
# return e_x / e_x.sum(axis=1, keepdims=True) # Alternative 1
return e_x / e_x.sum(axis=1).reshape((-1, 1)) # Alternative 2
def Softmax_grad_cupy(x): # Best implementation (VERY FAST)
'''Returns the Jacobian of the softmax function for the given set of inputs.
Inputs:
x: should be a 2d array where the rows correspond to the samples
and the columns correspond to the nodes.
Returns: jacobian
'''
s = Softmax_cupy(x)
a = cp.eye(s.shape[-1])
temp1 = cp.zeros((s.shape[0], s.shape[1], s.shape[1]),dtype=cp.float32)
temp2 = cp.zeros((s.shape[0], s.shape[1], s.shape[1]),dtype=cp.float32)
temp1 = cp.einsum('ij,jk->ijk',s,a)
temp2 = cp.einsum('ij,ik->ijk',s,s)
return temp1-temp2
```

Using these different methods, you can efficiently implement the softmax activation function in Python.

**Example -**

Assume a neural network that classifies an input image, whether it is of a dog, cat, tiger, or none.

Consider the feature vector to be X = [x1, x2, x3, x4]). In a neural network, the layout takes place as follows.

For the above-given neural network, the matrix will be:

Where the weight matrix is:

Here,

m = Total number of nodes in layer L-1

n = Total number of nodes in output layer L.

For the above-given example, we have m=3 and n=4.

the following are the calculated values at layer L-1.

The bias matrix is:

If we calculate the exponential values of each element in the Z [L] matrix, we will have to calculate values for:

And also for:

Now, let’s consider the numeric values of each expression we listed above to drive output. Suppose we input the following values for the Z [L] matrix:

With this, the exponential values will be:

And,

Thus, if we calculate the probability distribution as per the following formula,

The probability distribution for the above example will result as:

It is clear from the above values that the input image was that of a dog since it has a greater probability.

**Implementing code for softmax function**

If we have to implement a code for the above softmax function example in Python, here’s how it will be done.

```
import numpy as np
def softmax(Z_data):
exp_Z_data = np.exp(Z_data)
#print("exp(Z_data) = ",exp_Z_data)
sum_of_exp_Z_data = np.sum(exp_Z_data)
#print("sum of exponentials = ", sum_of_exp_Z_data)
prob_dist = [exp_Zi/sum_of_exp_Z_data for exp_Zi in exp_Z_data]
return np.array(prob_dist, dtype=float)
Z_data = [1.25,2.44,0.78,0.12] #for cat, dog, tiger, none
p_cat = softmax(Z_data)[0]
print("probability of being cat = ",p_cat)
p_dog = softmax(Z_data)[1]
print("probability of being dog = ",p_dog)
p_tiger = softmax(Z_data)[2]
print("probability of being tiger = ",p_tiger)
p_none = softmax(Z_data)[3]
print("probability of being none = ",p_none)
```

**Output of the code:**

- Probability (Cat) = 0.19101770813831334
- Probability (Dog) = 0.627890718698843
- Probability (Tiger) = 0.11938650086860875
- Probability (None) = 0.061705072294234845

It clearly depicts that the input image is a dog as its output depicts the highest probability. This is the result we derived when we manually calculated the probability using the mathematical formula of the softmax activation function.

This article sums up everything to get you started with implementing the softmax activation function in Python. From the multiple methods to speeding up the implementation using practical mathematical expressions and how they work behind the code - you have a clear understanding of all the resulting layers.

What's up with Turing?
Get the latest news about us here.

Know more about remote work.

Checkout our blog here.

Checkout our blog here.

Have any questions?

We'd love to hear from you.

We'd love to hear from you.

Tell us the skills you need and we'll find the best developer for you in days, not weeks.