Pandas Tutorial Pandas References

Pandas DataFrame - corr() function



The Pandas DataFrame corr() function computes pairwise correlation of columns, excluding NA/null values. The returned DataFrame is the correlation matrix of the columns of the DataFrame. Both NA and null values are automatically excluded from the calculation.

Syntax

DataFrame.corr(method='pearson', min_periods=1)

Parameters

method Optional. Specify method of correlation. Default is 'pearson'. Possible values are:
  • pearson : standard correlation coefficient
  • kendall : Kendall Tau correlation coefficient
  • spearman : Spearman rank correlation
  • callable : callable with two 1d ndarrays as input and returning a float. Please note that the returned correlation matrix will have 1 along the diagonals and will be symmetric regardless of the callable's behavior.
min_periods Optional. An int to specify minimum number of observations required per pair of columns to have a valid result. Default is 1.

Return Value

Returns the correlation matrix of the series of the DataFrame.

Example: Creating a correlation matrix using whole DataFrame

In the example below, a DataFrame report is created. The corr() function is used to create a correlation matrix using all numeric columns of the DataFrame.

import pandas as pd
import numpy as np

report = pd.DataFrame({
  "GDP": [1.02, 1.03, 1.04, 0.98],
  "GNP": [1.05, 0.99, np.nan, 1.04],
  "HDI": [1.02, 1.01, 1.02, 1.03]},
  index= ["Q1", "Q2", "Q3", "Q4"]
)

print(report,"\n")
print(report.corr())

The output of the above code will be:

     GDP   GNP   HDI
Q1  1.02  1.05  1.02
Q2  1.03  0.99  1.01
Q3  1.04   NaN  1.02
Q4  0.98  1.04  1.03 

          GDP       GNP       HDI
GDP  1.000000 -0.529107 -0.776151
GNP -0.529107  1.000000  0.777714
HDI -0.776151  0.777714  1.000000

Example: Creating a correlation matrix using selected columns

Instead of whole DataFrame, the corr() function can be applied on selected columns. Consider the following example.

import pandas as pd
import numpy as np

report = pd.DataFrame({
  "GDP": [1.02, 1.03, 1.04, 0.98],
  "GNP": [1.05, 0.99, np.nan, 1.04],
  "HDI": [1.02, 1.01, 1.02, 1.03],
  "Agriculture": [1.02, 1.02, 0.99, 0.98]},
  index= ["Q1", "Q2", "Q3", "Q4"]
)

#displaying the dataframe
print(report,"\n")

#correlation matrix using two columns
print("report[['GDP', 'HDI']].corr() returns:")
print(report[['GDP', 'HDI']].corr(),"\n")

#correlation matrix using three columns
print("report[['GDP', 'HDI', 'Agriculture']].corr() returns:")
print(report[['GDP', 'HDI', 'Agriculture']].corr(),"\n")

The output of the above code will be:

     GDP   GNP   HDI  Agriculture
Q1  1.02  1.05  1.02         1.02
Q2  1.03  0.99  1.01         1.02
Q3  1.04   NaN  1.02         0.99
Q4  0.98  1.04  1.03         0.98 

report[['GDP', 'HDI']].corr() returns:
          GDP       HDI
GDP  1.000000 -0.776151
HDI -0.776151  1.000000 

report[['GDP', 'HDI', 'Agriculture']].corr() returns:
                  GDP       HDI  Agriculture
GDP          1.000000 -0.776151     0.507212
HDI         -0.776151  1.000000    -0.792118
Agriculture  0.507212 -0.792118     1.000000 

❮ Pandas DataFrame - Functions