Pandas Tutorial Pandas References

Pandas - DataFrame Aggregations



The Pandas DataFrame aggregate() function is used to perform aggregations using one or more operations over the specified axis. The syntax for using this function is given below:

Note: The agg() function is an alias for aggregate() function.

Syntax

DataFrame.aggregate(func=None, axis=0)

Parameters

func Required. Specify function used for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. Accepted combinations are:
  • function
  • string function name
  • list of functions and/or function names, e.g. [np.sum, 'mean']
  • dictionary of axis labels -> functions, function names or list of such.
axis Optional. Specify axis on which the function need to be applied. Default is 0. If 0 or 'index': applies function to each column. If 1 or 'columns': applies function to each row.

Return Value

Returns following:

  • Scalar when Series.aggregate is called with single function.
  • Series when DataFrame.aggregate is called with a single function.
  • DataFrame when DataFrame.aggregate is called with multiple functions.

Example: using aggregate() on whole DataFrame

In the example below, a DataFrame df is created. The aggregate() function is applied on whole DataFrame to calculate sum of each columns.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3),
   index = pd.date_range('1/1/2018', periods=5),
   columns = ['col1', 'col2', 'col3'])

print("The DataFrame contains:")
print(df)

print("\nAggregation returns:")
print(df.aggregate(np.sum))

The output of the above code will be:

The DataFrame contains:
                col1      col2      col3
2018-01-01 -0.687624  0.831343  0.369147
2018-01-02 -0.196517  1.979898 -1.000479
2018-01-03  0.258959  1.040191  0.001425
2018-01-04  0.630665 -0.739803  0.875488
2018-01-05  0.082997 -0.826209  1.453134

Aggregation returns:
col1    0.088481
col2    2.285421
col3    1.698715
dtype: float64

Example: using more operations on whole DataFrame

Multiple operations can be applied on a DataFrame at the same time. Like in the example below, three operations - sum, mean and average are applied at the same time.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3),
   index = pd.date_range('1/1/2018', periods=5),
   columns = ['col1', 'col2', 'col3'])

print("The DataFrame is:")
print(df)

print("\nAggregation returns:")
print(df.aggregate([np.sum, np.mean, 'average']))

The output of the above code will be:

The DataFrame is:
                col1      col2      col3
2018-01-01  0.535302 -0.791378 -0.858626
2018-01-02 -1.465922  0.375763  0.588740
2018-01-03 -0.407567  0.452181  0.687858
2018-01-04  0.327220  0.626945 -2.319354
2018-01-05  0.337624  0.041807  0.278022

Aggregation returns:
             col1      col2      col3
sum     -0.673343  0.705318 -1.623361
mean    -0.134669  0.141064 -0.324672
average -0.134669  0.141064 -0.324672

Example: using aggregate() on selected columns

Instead of whole DataFrame, the aggregate() function can be applied on selected columns. Consider the following example.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3),
   index = pd.date_range('1/1/2018', periods=5),
   columns = ['col1', 'col2', 'col3'])

print("The DataFrame contains:")
print(df)

#aggregation on single column
print("\nAggregation on col2 returns:")
print(df['col2'].aggregate(np.sum))

#aggregation on multiple columns
print("\nAggregation on col2 and col3 returns:")
print(df[['col2', 'col3']].aggregate(np.sum))

The output of the above code will be:

The DataFrame contains:
                col1      col2      col3
2018-01-01 -0.495941  0.600591 -0.193495
2018-01-02  0.057907  1.990024  1.523120
2018-01-03  0.592138  0.260888 -0.547469
2018-01-04 -0.225838 -1.233463 -0.152349
2018-01-05  0.454969 -0.500580  0.703518

Aggregation on col2 returns:
1.11745945804

Aggregation on col2 and col3 returns:
col2    1.117459
col3    1.333325
dtype: float64

Example: using different operation on different column

It is possible to use different operation on different column. Consider the following example.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3),
   index = pd.date_range('1/1/2018', periods=5),
   columns = ['col1', 'col2', 'col3'])

print("The DataFrame contains:")
print(df)

#different operation on different columns
print("\nAggregation on col2 and col3 returns:")
print(df.aggregate({'col2':np.sum, 'col3':'average'}))

The output of the above code will be:

The DataFrame contains:
                col1      col2      col3
2018-01-01  1.120440 -0.229896 -0.133962
2018-01-02  0.568975 -0.577267  1.605496
2018-01-03 -0.077285 -0.439441  0.763634
2018-01-04 -1.538413  2.900758 -0.848652
2018-01-05 -0.135597  0.477658 -0.108792

Aggregation on col2 and col3 returns:
col2    2.131813
col3    0.255545
dtype: float64