Pandas Tutorial Pandas References

Pandas - DataFrame Statistical Functions



Pandas has a number of statistical functions which can be used to understand and analyze the behavior of data. In this sections we will discuss few of such functions.

FunctionsDescription
pct_change() Returns percentage change between the current and a prior element.
cov() Computes pairwise covariance of columns, excluding NA/null values.
corr() Computes pairwise correlation of columns, excluding NA/null values.
rank() Computes numerical data ranks (1 through n) along axis.

Lets discuss these functions in detail:

Percentage Change

The Pandas DataFrame pct_change() function computes the percentage change between the current and a prior element by default. This is useful in comparing the percentage of change in a time series of elements.

Syntax

DataFrame.pct_change(periods=1, fill_method='pad', limit=None, freq=None)

Parameters

periods Optional. Specify the period to shift for calculating percent change. Default: 1
fill_method Optional. Specify how to handle NAs before computing percent changes. Default: 'pad'. It can take values from {'backfill', 'bfill', 'pad', 'ffill', None}. pad / ffill: use last valid observation to fill gap. backfill / bfill: use next valid observation to fill gap.
limit Optional. Specify the number of consecutive NAs to fill before stopping. Default is None.
freq Optional. A DateOffset, timedelta, or str to specify increment to use from time series API (e.g. 'M' or BDay()). Default is None.

Example:

In the example below, a DataFrame df is created. The pct_change() function is used to calculate the percentage change of elements of all numerical columns.

import pandas as pd
import numpy as np

df = pd.DataFrame({
  "GDP": [1.5, 2.5, 3.5, 1.5, 2.5, -1],
  "GNP": [1, 2, 3, 3, 2, -1],
  "HPI": [2, 3, 2, np.NaN, 2, 2]},
  index= ["2015", "2016", "2017", 
          "2018", "2019", "2020"]
)

print("The DataFrame is:")
print(df)

#percentage change of element with period = 1
print("\ndf.pct_change() returns:")
print(df.pct_change())

#percentage change of element with period = 2
print("\ndf.pct_change(periods=2) returns:")
print(df.pct_change(periods=2))

The output of the above code will be:

The DataFrame is:
      GDP  GNP  HPI
2015  1.5    1  2.0
2016  2.5    2  3.0
2017  3.5    3  2.0
2018  1.5    3  NaN
2019  2.5    2  2.0
2020 -1.0   -1  2.0

df.pct_change() returns:
           GDP       GNP       HPI
2015       NaN       NaN       NaN
2016  0.666667  1.000000  0.500000
2017  0.400000  0.500000 -0.333333
2018 -0.571429  0.000000  0.000000
2019  0.666667 -0.333333  0.000000
2020 -1.400000 -1.500000  0.000000

df.pct_change(periods=2) returns:
           GDP       GNP       HPI
2015       NaN       NaN       NaN
2016       NaN       NaN       NaN
2017  1.333333  2.000000  0.000000
2018 -0.400000  0.500000 -0.333333
2019 -0.285714 -0.333333  0.000000
2020 -1.666667 -1.333333  0.000000

Example: using axis=1

To calculate the percentage change row-wise, the axis=1 can be passed. Consider the example below:

import pandas as pd
import numpy as np

df = pd.DataFrame({
  "2015": [1.5, 1, 2],
  "2016": [2.5, 2, 3],
  "2017": [3.5, 3, 2],
  "2018": [1.5, 3, np.NaN],
  "2019": [2.5, 2, 2],
  "2020": [-1, -1, 2]},
  index= ["GDP", "GNP", "HDI"]
)

print("The DataFrame is:")
print(df)

#percentage change of element with period = 1
print("\ndf.pct_change(axis=1) returns:")
print(df.pct_change(axis=1))

#percentage change of element with period = 2
print("\ndf.pct_change(axis=1, periods=2) returns:")
print(df.pct_change(axis=1, periods=2))

The output of the above code will be:

The DataFrame is:
     2015  2016  2017  2018  2019  2020
GDP   1.5   2.5   3.5   1.5   2.5    -1
GNP   1.0   2.0   3.0   3.0   2.0    -1
HDI   2.0   3.0   2.0   NaN   2.0     2

df.pct_change(axis=1) returns:
     2015      2016      2017      2018      2019  2020
GDP   NaN  0.666667  0.400000 -0.571429  0.666667  -1.4
GNP   NaN  1.000000  0.500000  0.000000 -0.333333  -1.5
HDI   NaN  0.500000 -0.333333  0.000000  0.000000   0.0

df.pct_change(axis=1, periods=2) returns:
     2015  2016      2017      2018      2019      2020
GDP   NaN   NaN  1.333333 -0.400000 -0.285714 -1.666667
GNP   NaN   NaN  2.000000  0.500000 -0.333333 -1.333333
HDI   NaN   NaN  0.000000 -0.333333  0.000000  0.000000

Covariance

The Pandas DataFrame cov() function computes pairwise covariance of columns, excluding NA/null values. The returned DataFrame is the covariance matrix of the columns of the DataFrame. Both NA and null values are automatically excluded from the calculation.

Syntax

DataFrame.cov(min_periods=None, ddof=1)

Parameters

min_periods Optional. An int to specify minimum number of observations required per pair of columns to have a valid result. Default is None.
ddof Optional. Specify Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

Example:

In the example below, a DataFrame report is created. The cov() function is used to create a covariance matrix using all numeric columns of the DataFrame.

import pandas as pd
import numpy as np

report = pd.DataFrame({
  "GDP": [1.02, 1.03, 1.04, 0.98],
  "GNP": [1.05, 0.99, np.nan, 1.04],
  "HDI": [1.02, 1.01, 1.02, 1.03]},
  index= ["Q1", "Q2", "Q3", "Q4"]
)

print(report,"\n")
print(report.cov())

The output of the above code will be:

     GDP   GNP   HDI
Q1  1.02  1.05  1.02
Q2  1.03  0.99  1.01
Q3  1.04   NaN  1.02
Q4  0.98  1.04  1.03 

          GDP       GNP       HDI
GDP  0.000692 -0.000450 -0.000167
GNP -0.000450  0.001033  0.000250
HDI -0.000167  0.000250  0.000067

Correlation

The Pandas DataFrame corr() function computes pairwise correlation of columns, excluding NA/null values. The returned DataFrame is the correlation matrix of the columns of the DataFrame. Both NA and null values are automatically excluded from the calculation.

Syntax

DataFrame.corr(method='pearson', min_periods=1)

Parameters

method Optional. Specify method of correlation. Default is 'pearson'. Possible values are:
  • pearson : standard correlation coefficient
  • kendall : Kendall Tau correlation coefficient
  • spearman : Spearman rank correlation
  • callable : callable with two 1d ndarrays as input and returning a float. Please note that the returned correlation matrix will have 1 along the diagonals and will be symmetric regardless of the callable's behavior.
min_periods Optional. An int to specify minimum number of observations required per pair of columns to have a valid result. Default is 1.

Example:

In the example below, a DataFrame report is created. The corr() function is used to create a correlation matrix using all numeric columns of the DataFrame.

import pandas as pd
import numpy as np

report = pd.DataFrame({
  "GDP": [1.02, 1.03, 1.04, 0.98],
  "GNP": [1.05, 0.99, np.nan, 1.04],
  "HDI": [1.02, 1.01, 1.02, 1.03]},
  index= ["Q1", "Q2", "Q3", "Q4"]
)

print(report,"\n")
print(report.corr())

The output of the above code will be:

     GDP   GNP   HDI
Q1  1.02  1.05  1.02
Q2  1.03  0.99  1.01
Q3  1.04   NaN  1.02
Q4  0.98  1.04  1.03 

          GDP       GNP       HDI
GDP  1.000000 -0.529107 -0.776151
GNP -0.529107  1.000000  0.777714
HDI -0.776151  0.777714  1.000000

Data Ranking

The Pandas DataFrame rank() function computes numerical data ranks (1 through n) along specified axis. By default, The function assigns equal values a rank which is the average of the ranks of those values.

Syntax

DataFrame.rank(axis=0, method='average', numeric_only=None, 
               na_option='keep', ascending=True, pct=False)

Parameters

axis Optional. Index to direct ranking. It can be {0 or 'index', 1 or 'columns'}. Default is 0.
method Optional. Specify how to rank the group of records in case of tie:
  • average: average rank of tied group
  • min: lowest rank in the group
  • max: highest rank in the group
  • first: ranks assigned in order they appear in the array
  • dense: like 'min', but rank always increases by 1 between groups
Default is 'average'.
numeric_only Optional. Specify True to rank only numeric columns.
na_option Optional. Specify how to rank NaN values:
  • keep: assign NaN rank to NaN values
  • top: assign lowest rank to NaN values
  • bottom: assign highest rank to NaN values
Default is 'keep'.
ascending Optional. Specify whether or not the elements should be ranked in ascending order. Default is True.
pct Optional. Specify whether or not to display the returned rankings in percentile form. Default is False.

Example:

The example below demonstrates how this function behaves with the above parameters:

  • default_rank: Default behavior obtained without using any parameter.
  • max_rank: When setting method = 'max'. The records that have the same values are ranked using the highest rank (For example - 'x2' and 'x3' are both in the first and second position, rank 2 is assigned).
  • NA_bottom: When setting na_option = 'bottom'. If there are NaN values in the record they are placed at the bottom of the ranking.
  • pct_rank: When setting pct = True. The ranking is expressed as percentile rank.
import pandas as pd
import numpy as np

df = pd.DataFrame({
  "values": [20, 10, 10, np.NaN, 30]},
  index= ["x1", "x2", "x3", "x4", "x5"]
)

print(df,"\n")

df['default_rank'] = df['values'].rank()
df['max_rank'] = df['values'].rank(method='max')
df['NA_bottom'] = df['values'].rank(na_option='bottom')
df['pct_rank'] = df['values'].rank(pct=True)

print(df,"\n")

The output of the above code will be:

    values
x1    20.0
x2    10.0
x3    10.0
x4     NaN
x5    30.0 

    values  default_rank  max_rank  NA_bottom  pct_rank
x1    20.0           3.0       3.0        3.0     0.750
x2    10.0           1.5       2.0        1.5     0.375
x3    10.0           1.5       2.0        1.5     0.375
x4     NaN           NaN       NaN        5.0       NaN
x5    30.0           4.0       4.0        4.0     1.000