For Developers

Using Pandas Groupby for Grouping Data in Python

Using Pandas Groupby for Grouping Data in Python

Panda groupby () is a method used to group data in Python according to categories and apply functions to these categorized data. It summarizes and aggregates data quickly making way for an easy interpretation of the data. When you require quick results from a data science project, Pandas groupby function comes as a blessing.

To make it simple, any groupby function executes the following operations for any original data: splitting the data, applying a function, and combining the results as per the analyzed dataset.

Let's understand it technically with a real-time example to get a clear picture.

Real-life analogy for a basic understanding of Pandas groupby

Imagine that you are the principal of a school. To offer a raise to the teachers, you decide to check the performance of different classes and lectures. However, manually segregating the teachers based on the subjects they teach and comparing the grades of the students to derive the final results will take a lot of time. Your main motive is to check the students' performance in different classes and lectures. And you have to do this by comparing their grades.

If you have a CSV file that includes the grades of all the students corresponding to all lectures they have attended, it would easily sort out the problem in the very beginning. Further, if you utilize the Python groupby function, you can group the data into multilevel or other grouping types and get the result immediately.

This is exactly how the Pandas groupby function cuts down all the manual work - by imbibing the power of technology.

Before we head on to implementing groupby in Python with a practical dataset, here’s a more detailed look at what Python Pandas groupby() is.

Introduction to Pandas groupby

Groupby is a powerful function in Python that enables you to split your data into distinct groups and perform computations on them for better analysis.

The name groupby itself explains the meaning. It includes a combination of three steps, thus referred to as a group by function. The steps are as follows:

  • Splitting of data into groups

  • Applying the function to respective groups separately

  • Combining the results to form a data structure

In all the above-mentioned steps, Pandas groupby () is mainly used in the first step, i.e., splitting the data into groups. In the second step, we apply the function to all the groups. The functions include aggregation, transformation, and filtration. In the end, all the results are combined to form a data structure. We will discuss each step in detail in the next parts of the article.

Latest version alert

Before embarking on learning how to use Pandas groupby in Python, ensure that you are using the latest version. Use these codes to find out whether you are indeed using the latest version based on your OS.

1. For Windows Powershell

PS> python -m venv venv
PS> venv\Scripts\Activate.ps1
(venv) PS> python -m pip install pandas

2. For Linux + macOS

$ python3 -m venv venv
$ source venv/bin/activate
(venv) $ python -m pip install pandas

Code source

Once you are done downloading the .zip file, unzip it to a folder called groupby-data/ in your present directory. Ensure that the directory tree matches the one given below.

./
│
└── groupby-data/
    │
    ├── legislators-historical.csv
    ├── airqual.csv
    └── news.csv

Code source

Once you have Pandas installed with the virtual environment activated and datasets downloaded, you are all set to start using the groupby function for grouping data in Python.

Note: The Pandas Groupby version 0.20.1 changed the aggregation and grouping APIs in May 2017.

Now, let’s move to practical examples which can help in understanding how Pandas Groupby works and how you can group data in Python.

Creating a data frame object with groupby

Let's take an example to understand how the entire process works and how to execute operation with it!

#import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

print df

Code source

The output received from the above dataset will be:

Points Rank Team Year
0 876 1 Riders 2014
1 789 2 Riders 2015
2 863 2 Devils 2014
3 673 3 Devils 2015
4 741 3 Kings 2014
5 812 4 kings 2015
6 756 1 Kings 2016
7 788 1 Kings 2017
8 694 2 Riders 2016
9 701 4 Royals 2014
10 804 1 Royals 2015
11 690 2 Riders 2017

Code source

Now that we have an organized dataset, we can start implementing the various operations included in the Python Pandas groupby function.

Implementing operations using Pandas groupby

1. Split into groups

Let’s begin with splitting the dataset into groups. You can do so in the following ways.

  • obj.groupby('key')

  • obj.groupby(['key1','key2'])

  • obj.groupby(key,axis=1)

Here’s an example of the same.

# import the pandas library
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

print df.groupby('Team')

Code source

Once you split the data, you will receive the following output.

<pandas.core.groupby.DataFrameGroupBy object at 0x7fa46a977e50>

To view the group categories in which the data has been grouped, proceed with the following code.

# import the pandas library
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

print df.groupby('Team').groups

Code source

With the following code, you will receive a detailed output that will showcase the groups in which the dataset is distributed.

{'Kings': Int64Index([4, 6, 7], dtype='int64'),
'Devils': Int64Index([2, 3], dtype='int64'),
'Riders': Int64Index([0, 1, 8, 11], dtype='int64'),
'Royals': Int64Index([9, 10], dtype='int64'),
'kings' : Int64Index([5], dtype='int64')}

Code source

If you wish to receive the output in multiple columns, here’s how you can proceed with the Python groupby function.

# import the pandas library
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

print df.groupby(['Team','Year']).groups

Code source

For the above, we added team and year as an input to display the results with these values in the output. Here’s what you will receive as an output:

{('Kings', 2014): Int64Index([4], dtype='int64'),
('Royals', 2014): Int64Index([9], dtype='int64'),
('Riders', 2014): Int64Index([0], dtype='int64'),
('Riders', 2015): Int64Index([1], dtype='int64'),
('Kings', 2016): Int64Index([6], dtype='int64'),
('Riders', 2016): Int64Index([8], dtype='int64'),
('Riders', 2017): Int64Index([11], dtype='int64'),
('Devils', 2014): Int64Index([2], dtype='int64'),
('Devils', 2015): Int64Index([3], dtype='int64'),
('kings', 2015): Int64Index([5], dtype='int64'),
('Royals', 2015): Int64Index([10], dtype='int64'),
('Kings', 2017): Int64Index([7], dtype='int64')}

Code source

2. Applying the function

  • Aggregation/agg(): This function computes a summary statistic for every group like mean, count, or sum. It is also known as the reduction method and results in a single value.

  • Transformation/transform(): This function forms group-specific computations. Further, it returns a like indexed object. You will receive different values with the same indices and shape.

  • Filtration/filter (): This function discards some groups in which only a few members or data are filtered out. This is mainly done based on the group of the sum or mean. They return a subset of the original data frame.

3. Output received

You will receive outputs based on the functions you have applied to the groups in the first step. If you want to apply any function to the grouped result of your choice, Pandas apply () allows you to do that within an axis of the data frame.

When you pass multiple group keys in the Pandas groupby function, only those rows whose group key value matches each other will be added. This is a practical tip to note when using groupby. Apply what you’ve learned to practice this impressive function to analyze your data.

Summary:

Pandas Groupby function is now a cakewalk for you with a clear understanding of all the points mentioned above. This is a widely used function in data analysis for its ability to transform, aggregate, and filter data in each group. So, what’s the wait for? Group large amounts of data and perform operations on these groups. Later format and get the results.

FAQ

1. How do you use Groupby pandas function?

We use Groupby pandas function as groupby (). Here’s how you can get started with it.

Use groupby () and apply () to

  • Find max values

  • Find relative frequencies

  • Perform custom calculations and more.

2. What is group by function in Python?

Groupby function in Python is a powerful function that allows you to split your data into separate groups to perform computations for detailed analysis.

3. What does GroupBy return Python?

Pandas goupby returns aggregations. An aggregated function returns only one aggregated value for each group. However, several aggregation operations can be performed on the grouped data once the groupby object is created.

4. How do you split data into groups in Python?

To split data into groups in Python, follow these steps.

Step 1: Split the data into groups by creating a groupby object from raw data frame. Step 2: Apply a function by using an aggregate function that calculates the summary statistic. Step 3: Combine its results in a new data frame.

Press

Press

What's up with Turing? Get the latest news about us here.
Blog

Blog

Know more about remote work.
Checkout our blog here.
Contact

Contact

Have any questions?
We'd love to hear from you.

Hire and manage remote developers

Tell us the skills you need and we'll find the best developer for you in days, not weeks.

Hire Developers