Groupby aggregation in pandas



Aggregation is one of the most common operations in data analysis. Different technologies offer us a bunch of ways to effectively group and aggregate the fields of interest to us (columns, attributes). This article will talk about the implementation of aggregation in pandas.
In my specialization, I work very little with python, but I often hear about the pros and power of this language, especially when it comes to working with data. Therefore, I will draw here a parallel operation with T-SQL and give some code examples. As the data I will use probably the most popular data set - Irises Fisher .

The first thing that comes to mind is to get the maximum, minimum or average value for any of the iris parameters and group by the species of this plant, which in python using pandas will look something like this:

import pandas as pd

df = pd.read_csv('iris.csv', delimiter = ',')
print(df.groupby('variety').max()[['sepalLength']].to_markdown())

Result:

| variety | sepal.length |
|: ----------- | ---------------: |
| Setosa | 5.8 |
| Versicolor | 7 |
| Virginica | 7.9 |

Or so:

import pandas as pd
df = pd.read_csv('iris.csv', delimiter = ',')

print(df.groupby('variety').sepalLength.agg(
    maxSepalLength  = 'max',
    minSepalLength  = 'min',
    ).to_markdown())

Result:

| variety | maxSepalLength | minSepalLength |
|: ----------- | -----------------: | ----------------- : |
| Setosa | 5.8 | 4.3 |
| Versicolor | 7 | 4.9 |
| Virginica | 7.9 | 4.9 |

Or using lambda expressions:

import pandas as pd
df = pd.read_csv('iris.csv', delimiter = ',')

print(df.groupby('variety').sepalLength.agg([
    lambda x: x.max(), 
    lambda x: x.min()
    ]).to_markdown())

Result:

| variety | <lambda_0> | <lambda_1> |
|: ----------- | -------------: | -------------: |
| Setosa | 5.8 | 4.3 |
| Versicolor | 7 | 4.9 |
| Virginica | 7.9 | 4.9 |

DataFrame Instance Function
to_markdown()
allows you to display a table (DataFrame) in the usual (console) form.

On T-SQL, this operation looks something like this:

select i.Variety, max(i.SepalLength) as maxSepalLength
    from Iris i
        group by i.Variety

Result:

Setosa 5.8
Versicolor 7.0
Virginica 7.9

But suppose now we want to get both the maximum and minimum (if you like the average) values ​​for all parameters of the iris, naturally for each type of plant, the T-SQL code has been generated here:

select
	i.Variety 
	,max(i.SepalLength) as maxSepalLength 
	,min(i.SepalLength) as minSepalLength
	,max(i.SepalWidth) as maxSepalWidth
	,min(i.SepalWidth) as minSepalWidth
	,max(i.PetalLength) as maxPetalLength
	,min(i.PetalLength) as mibPetalLength
	,max(i.PetalWidth) as maxPetalWidth
	,min(i.PetalWidth) as minPetalWidth
from Iris i
	group by i.Variety

Result:

Setosa 5.8 4.3 4.4 2.3 1.9 1.0 0.6 0.1
Versicolor 7.0 4.9 3.4 2.0 5.1 3.0 1.8 1.0
Virginica 7.9 4.9 3.8 2.2 6.9 4.5 2.5 1.4

In pandas, the possibility of group aggregation appeared only in version 0.25.0 of July 18, 2019 (what was done before) ?) and there are several variations, consider them:

import pandas as pd
df = pd.read_csv('iris.csv', delimiter = ',')

df.groupby('variety').agg(
    maxSepalLength = pd.NamedAgg(column = 'sepalLength', aggfunc = 'max'),
    minSepalLength = pd.NamedAgg(column = 'sepalLength', aggfunc = 'min'),
    maxSepalWidth = pd.NamedAgg(column = 'sepalWidth', aggfunc = 'max'),
    minSepalWidth = pd.NamedAgg(column = 'sepalWidth', aggfunc = 'min'),
    maxPetalLength = pd.NamedAgg(column = 'petalLength', aggfunc = 'max'),
    minPetalLength = pd.NamedAgg(column = 'petalLength', aggfunc = 'min'),
    maxPetalWidth = pd.NamedAgg(column = 'petalWidth', aggfunc = 'max'),
    minPetalWidth = pd.NamedAgg(column = 'petalWidth', aggfunc = 'min'),
    )

Result:

Setosa 5.8 4.3 4.4 2.3 1.9 1.0 0.6 0.1
Versicolor 7.0 4.9 3.4 2.0 5.1 3.0 1.8 1.0
Virginica 7.9 4.9 3.8 2.2 6.9 4.5 2.5 1.4

Function
DataFrame.agg(self, func, axis=0, *args, **kwargs)

allows aggregation of several operations on a given axis. As parameters, the function receives ** kwargs (named arguments, see the article on habr for details ), which are a column on which the operation is performed and the name of the aggregation function in single quotes. The recording looks quite voluminous. Move on.

The same solution using lambda expressions looks much more concise and simple:

import pandas as pd
df = pd.read_csv('iris.csv', delimiter = ',')

df.groupby('variety').agg([
    lambda x: x.max(),
    lambda x: x.min()
    ])

Result:

Setosa 5.8 4.3 4.4 2.3 1.9 1.0 0.6 0.1
Versicolor 7.0 4.9 3.4 2.0 5.1 3.0 1.8 1.0
Virginica 7.9 4.9 3.8 2.2 6.9 4.5 2.5 1.4

I often hear about much less written when Python when solving the same type of problem in comparison with other languages. Here, in comparison with T-SQL, one can agree with this, but the clarity and sequence of expressions of linguistic tools such as SQL or T-SQL is completely lost (personal opinion).

Data set and code from the article can be found here

What’s new in 0.25.0 (July 18, 2019)

pandas

All Articles