Panda groupby () is a method used to group data in Python according to categories and apply functions to these categorized data. It summarizes and aggregates data quickly making way for an easy interpretation of the data. When you require quick results from a data science project, Pandas groupby function comes as a blessing.
To make it simple, any groupby function executes the following operations for any original data: splitting the data, applying a function, and combining the results as per the analyzed dataset.
Let's understand it technically with a real-time example to get a clear picture.
Imagine that you are the principal of a school. To offer a raise to the teachers, you decide to check the performance of different classes and lectures. However, manually segregating the teachers based on the subjects they teach and comparing the grades of the students to derive the final results will take a lot of time. Your main motive is to check the students' performance in different classes and lectures. And you have to do this by comparing their grades.
If you have a CSV file that includes the grades of all the students corresponding to all lectures they have attended, it would easily sort out the problem in the very beginning. Further, if you utilize the Python groupby function, you can group the data into multilevel or other grouping types and get the result immediately.
This is exactly how the Pandas groupby function cuts down all the manual work - by imbibing the power of technology.
Before we head on to implementing groupby in Python with a practical dataset, here’s a more detailed look at what Python Pandas groupby() is.
Groupby is a powerful function in Python that enables you to split your data into distinct groups and perform computations on them for better analysis.
The name groupby itself explains the meaning. It includes a combination of three steps, thus referred to as a group by function. The steps are as follows:
Splitting of data into groups
Applying the function to respective groups separately
Combining the results to form a data structure
In all the above-mentioned steps, Pandas groupby () is mainly used in the first step, i.e., splitting the data into groups. In the second step, we apply the function to all the groups. The functions include aggregation, transformation, and filtration. In the end, all the results are combined to form a data structure. We will discuss each step in detail in the next parts of the article.
Latest version alert
Before embarking on learning how to use Pandas groupby in Python, ensure that you are using the latest version. Use these codes to find out whether you are indeed using the latest version based on your OS.
1. For Windows Powershell
PS> python -m venv venv
PS> venv\Scripts\Activate.ps1
(venv) PS> python -m pip install pandas
2. For Linux + macOS
$ python3 -m venv venv
$ source venv/bin/activate
(venv) $ python -m pip install pandas
Once you are done downloading the .zip file, unzip it to a folder called groupby-data/ in your present directory. Ensure that the directory tree matches the one given below.
./
│
└── groupby-data/
│
├── legislators-historical.csv
├── airqual.csv
└── news.csv
Once you have Pandas installed with the virtual environment activated and datasets downloaded, you are all set to start using the groupby function for grouping data in Python.
Note: The Pandas Groupby version 0.20.1 changed the aggregation and grouping APIs in May 2017.
Now, let’s move to practical examples which can help in understanding how Pandas Groupby works and how you can group data in Python.
Let's take an example to understand how the entire process works and how to execute operation with it!
#import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print df
The output received from the above dataset will be:
Points Rank Team Year
0 876 1 Riders 2014
1 789 2 Riders 2015
2 863 2 Devils 2014
3 673 3 Devils 2015
4 741 3 Kings 2014
5 812 4 kings 2015
6 756 1 Kings 2016
7 788 1 Kings 2017
8 694 2 Riders 2016
9 701 4 Royals 2014
10 804 1 Royals 2015
11 690 2 Riders 2017
Now that we have an organized dataset, we can start implementing the various operations included in the Python Pandas groupby function.
1. Split into groups
Let’s begin with splitting the dataset into groups. You can do so in the following ways.
obj.groupby('key')
obj.groupby(['key1','key2'])
obj.groupby(key,axis=1)
Here’s an example of the same.
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print df.groupby('Team')
Once you split the data, you will receive the following output.
<pandas.core.groupby.DataFrameGroupBy object at 0x7fa46a977e50>
To view the group categories in which the data has been grouped, proceed with the following code.
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print df.groupby('Team').groups
With the following code, you will receive a detailed output that will showcase the groups in which the dataset is distributed.
{'Kings': Int64Index([4, 6, 7], dtype='int64'),
'Devils': Int64Index([2, 3], dtype='int64'),
'Riders': Int64Index([0, 1, 8, 11], dtype='int64'),
'Royals': Int64Index([9, 10], dtype='int64'),
'kings' : Int64Index([5], dtype='int64')}
If you wish to receive the output in multiple columns, here’s how you can proceed with the Python groupby function.
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print df.groupby(['Team','Year']).groups
For the above, we added team and year as an input to display the results with these values in the output. Here’s what you will receive as an output:
{('Kings', 2014): Int64Index([4], dtype='int64'),
('Royals', 2014): Int64Index([9], dtype='int64'),
('Riders', 2014): Int64Index([0], dtype='int64'),
('Riders', 2015): Int64Index([1], dtype='int64'),
('Kings', 2016): Int64Index([6], dtype='int64'),
('Riders', 2016): Int64Index([8], dtype='int64'),
('Riders', 2017): Int64Index([11], dtype='int64'),
('Devils', 2014): Int64Index([2], dtype='int64'),
('Devils', 2015): Int64Index([3], dtype='int64'),
('kings', 2015): Int64Index([5], dtype='int64'),
('Royals', 2015): Int64Index([10], dtype='int64'),
('Kings', 2017): Int64Index([7], dtype='int64')}
2. Applying the function
Aggregation/agg(): This function computes a summary statistic for every group like mean, count, or sum. It is also known as the reduction method and results in a single value.
Transformation/transform(): This function forms group-specific computations. Further, it returns a like indexed object. You will receive different values with the same indices and shape.
Filtration/filter (): This function discards some groups in which only a few members or data are filtered out. This is mainly done based on the group of the sum or mean. They return a subset of the original data frame.
3. Output received
You will receive outputs based on the functions you have applied to the groups in the first step. If you want to apply any function to the grouped result of your choice, Pandas apply () allows you to do that within an axis of the data frame.
When you pass multiple group keys in the Pandas groupby function, only those rows whose group key value matches each other will be added. This is a practical tip to note when using groupby. Apply what you’ve learned to practice this impressive function to analyze your data.
Summary:
Pandas Groupby function is now a cakewalk for you with a clear understanding of all the points mentioned above. This is a widely used function in data analysis for its ability to transform, aggregate, and filter data in each group. So, what’s the wait for? Group large amounts of data and perform operations on these groups. Later format and get the results.
FAQ
1. How do you use Groupby pandas function?
We use Groupby pandas function as groupby (). Here’s how you can get started with it.
Use groupby () and apply () to
Find max values
Find relative frequencies
Perform custom calculations and more.
2. What is group by function in Python?
Groupby function in Python is a powerful function that allows you to split your data into separate groups to perform computations for detailed analysis.
3. What does GroupBy return Python?
Pandas goupby returns aggregations. An aggregated function returns only one aggregated value for each group. However, several aggregation operations can be performed on the grouped data once the groupby object is created.
4. How do you split data into groups in Python?
To split data into groups in Python, follow these steps.
Step 1: Split the data into groups by creating a groupby object from raw data frame. Step 2: Apply a function by using an aggregate function that calculates the summary statistic. Step 3: Combine its results in a new data frame.
Tell us the skills you need and we'll find the best developer for you in days, not weeks.