Hamburger_menu.svg

FOR DEVELOPERS

Softmax Activation Function with Python

Softmax activation function with Python.

The softmax activation function is one of the most popular terms we come across while resolving problems related to machine learning, or, more specifically, deep learning. We place softmax activation function at the end of a neural network in the deep learning model. Why? Because it normalizes the network output to a probability distribution over the estimated output classes.

We leverage this as an activation function to resolve multiclass classification problems. It can also be perceived as a generalization of the sigmoid function since it is used to present a probability distribution for a binary variable.

Note: If we look for softmax in the output layer in the Keras deep learning library with three-class classification activity, it might look like the example given below.

model.add(Dense(3, activation=’softmax’))

Code source

In order to implement it in Python, we will help you to briefly learn it from scratch. Let’s start with the mathematical expression for the same in Python.

Mathematical representation of softmax in Python

The softmax function scales logits/numbers into probabilities. The output of this function is a vector that offers probability for each probable outcome. It is represented mathematically as:

Softmax mathematical formula.webp

Where:

- Z = It is the input vector of the softmax activation function. It comprises n elements for n classes.

- Z(i) = It is i-th element of the input vector that takes up any value between negative infinity to positive infinity.

- exp [Z(i)] = It is the standard exponential function applied to Z(i). The resulting values may be tiny but they are never zero.

- n= It represents the number of classes

- Σexp [Z(i)] = It is a normalization term that makes sure the output vector value sums up to 1 for the i-th class.

Here’s an example to understand how it works as per this formula.

How softmax formula works

It works for a batch of inputs with a 2D array where n rows = n samples and n columns = n nodes. It can be implemented with the following code.

import numpy as np
def Softmax(x):
    '''
    Performs the softmax activation on a given set of inputs
    Input: x (N,k) ndarray (N: no. of samples, k: no. of nodes)
    Returns:
    Note: Works for 2D arrays only(rows for samples, columns for nodes/outputs)
    '''
    max_x = np.amax(x, 1).reshape(x.shape[0],1) # Get the row-wise maximum
    e_x = np.exp(x - max_x ) # For stability
    return e_x / e_x.sum(axis=1, keepdims=True)

Code source

This is the simplest implementation of softmax in Python. Another way is the Jacobian technique. An example code is given below.

import numpy as np
def Softmax_grad(x): # Best implementation (VERY FAST)
    '''Returns the Jacobian of the softmax function for the given set of inputs.
    Inputs:
    x: should be a 2d array where the rows correspond to the samples
        and the columns correspond to the nodes.
    Returns: jacobian
    '''
    s = Softmax(x)
    a = np.eye(s.shape[-1])
    temp1 = np.zeros((s.shape[0], s.shape[1], s.shape[1]),dtype=np.float32)
    temp2 = np.zeros((s.shape[0], s.shape[1], s.shape[1]),dtype=np.float32)
    temp1 = np.einsum('ij,jk->ijk',s,a)
    temp2 = np.einsum('ij,ik->ijk',s,s)
    return temp1-temp2

Code source

The above-mentioned Python code implementations are only pitched and tested for a batch of inputs. This means that the expected input is a 2D array with rows. It presents different samples and columns defining different nodes.

If you wish to speed up these implementations, use Numba which is best known for translating the subset of Python and NumPy code into fast machine code. To install it, use:

pip install numba

Note: Make sure that your NumPy is compatible with Numba, though pip takes care of the same most of the time.

Here’s how you can implement softmax NUMBA.

from numba import njit
@njit(cache=True,fastmath=True)  # Best implementation (VERY FAST)
def Softmax(x):
    '''
    Performs the softmax activation on a given set of inputs
    Input: x (N,k) ndarray (N: no. of samples, k: no. of nodes)
    Returns:
    Note: Works for 2D arrays only (rows for samples, columns for nodes/outputs)
    '''
    max_x = np.zeros((x.shape[0],1),dtype=x.dtype)
    for i in range(x.shape[0]):
        max_x[i,0] = np.max(x[i,:])
    e_x = np.exp(x - max_x)
    return e_x / e_x.sum(axis=1).reshape((-1, 1)) # Alternative of keepdims=True for Numba compatibility

Code source

Here’s the code for the softmax derivative (Jacobian) NUMBA implementation.

from numba import njit
@njit(cache=True,fastmath=True)
def Softmax_grad(x): # Best implementation (VERY FAST)
    '''Returns the Jacobian of the softmax function for the given set of inputs.
    Inputs:
    x: should be a 2d array where the rows correspond to the samples
        and the columns correspond to the nodes.
    Returns: jacobian
    '''
    s = Softmax(x)
    a = np.eye(s.shape[-1])
    temp1 = np.zeros((s.shape[0], s.shape[1], s.shape[1]),dtype=np.float32)
    temp2 = np.zeros((s.shape[0], s.shape[1], s.shape[1]),dtype=np.float32)
    # Einsum is unsupported with Numba (nopython mode)
    # temp1 = np.einsum('ij,jk->ijk',s,a)
    # temp2 = np.einsum('ij,ik->ijk',s,s)
    for i in range(s.shape[0]):
        for j in range(s.shape[1]):
            for k in range(s.shape[1]):
                temp1[i,j,k] = s[i,j]*a[j,k]
                temp2[i,j,k] = s[i,j]*s[i,k]
   return temp1-temp2

Code source
These methods are quite fast with TensorFlow and PyTorch. However, there is another method that can be used to accelerate the implementation of softmax in Python. It is with the help of Cupy (CUDA) which is an open-source array library that is used for GPU-accelerated computing with Python.

Here’s what the Cupy implementation looks like:

import cupy as cp
def Softmax_cupy(x):
    '''
    Performs the softmax activation on a given set of inputs
    Input: x (N,k) ndarray (N: no. of samples, k: no. of nodes)
    Returns:
    Note: Works for 2D arrays only (rows for samples, columns for nodes/outputs)
    '''
    max_x = cp.amax(x, 1).reshape(x.shape[0],1)
    e_x = cp.exp(x - max_x) # For stability as it is prone to overflow and underflow
# 	return e_x / e_x.sum(axis=1, keepdims=True) # Alternative 1
    return e_x / e_x.sum(axis=1).reshape((-1, 1)) # Alternative 2


def Softmax_grad_cupy(x): # Best implementation (VERY FAST)
    '''Returns the Jacobian of the softmax function for the given set of inputs.
    Inputs:
    x: should be a 2d array where the rows correspond to the samples
        and the columns correspond to the nodes.
    Returns: jacobian
    '''
    s = Softmax_cupy(x)
    a = cp.eye(s.shape[-1])
    temp1 = cp.zeros((s.shape[0], s.shape[1], s.shape[1]),dtype=cp.float32)
    temp2 = cp.zeros((s.shape[0], s.shape[1], s.shape[1]),dtype=cp.float32)
    temp1 = cp.einsum('ij,jk->ijk',s,a)
    temp2 = cp.einsum('ij,ik->ijk',s,s)
    return temp1-temp2

Code source

Using these different methods, you can efficiently implement the softmax activation function in Python.

Example -

Assume a neural network that classifies an input image, whether it is of a dog, cat, tiger, or none.

Consider the feature vector to be X = [x1, x2, x3, x4]). In a neural network, the layout takes place as follows.

The internal layout of a neural network with Softmax layer..webp


Image source

The internal layout of a neural network with Softmax layer..webp


Image source

For the above-given neural network, the matrix will be:

The vector Z [L] defined with its elements. (2).webp

Where the weight matrix is:

Weight matrix W [L] defined with its elements..webp

Here,

m = Total number of nodes in layer L-1
n = Total number of nodes in output layer L.

For the above-given example, we have m=3 and n=4.

the following are the calculated values at layer L-1.

H[L] matrix defined with its elements..webp

The bias matrix is:

Bias matrix b[L] defined with its elements..webp

If we calculate the exponential values of each element in the Z [L] matrix, we will have to calculate values for:

Formula for exponential values of each element in the Z [L] matrix.webp

And also for:

Formula for sum of all exponential values of each element in Z [L] matrix.webp

Now, let’s consider the numeric values of each expression we listed above to drive output. Suppose we input the following values for the Z [L] matrix:

Z[L] matrix defined with numeric values..webp

With this, the exponential values will be:

Calculation for exponential values of each element in Z [L] matrix.webp

And,

Sum of all exponential values of each element in the Z [L] matrix.webp

Thus, if we calculate the probability distribution as per the following formula,

Formula to calculate probability of every possible value of an input image..webp

The probability distribution for the ‌above example will result as:

Probability distribution calculation formula.webp

It is clear from the above values that the input image was that of a dog since it has a greater probability.

Implementing code for softmax function

If we have to implement a code for the above softmax function example in Python, here’s how it will be done.

import numpy as np

def softmax(Z_data):
    exp_Z_data = np.exp(Z_data)
    #print("exp(Z_data) = ",exp_Z_data)
    
    sum_of_exp_Z_data = np.sum(exp_Z_data)
    #print("sum of exponentials = ", sum_of_exp_Z_data)
    prob_dist = [exp_Zi/sum_of_exp_Z_data for exp_Zi in exp_Z_data]

    return np.array(prob_dist, dtype=float)
    

Z_data = [1.25,2.44,0.78,0.12] #for cat, dog, tiger, none

p_cat = softmax(Z_data)[0]
print("probability of being cat = ",p_cat)
p_dog = softmax(Z_data)[1]
print("probability of being dog = ",p_dog)
p_tiger = softmax(Z_data)[2]
print("probability of being tiger = ",p_tiger)
p_none = softmax(Z_data)[3]
print("probability of being none = ",p_none)

Code source

Output of the code:

  • Probability (Cat) = 0.19101770813831334
  • Probability (Dog) = 0.627890718698843
  • Probability (Tiger) = 0.11938650086860875
  • Probability (None) = 0.061705072294234845

It clearly depicts that the input image is a dog as its output depicts the highest probability. This is the result we derived when we manually calculated the probability using the mathematical formula of the softmax activation function.

It’s time to get started!

This article sums up everything to get you started with implementing the softmax activation function in Python. From the multiple methods to speeding up the implementation using practical mathematical expressions and how they work behind the code - you have a clear understanding of all the resulting layers.

Press

Press

What’s up with Turing? Get the latest news about us here.
Blog

Blog

Know more about remote work. Checkout our blog here.
Contact

Contact

Have any questions? We’d love to hear from you.

Hire remote developers

Tell us the skills you need and we'll find the best developer for you in days, not weeks.