Pandas Tutorial Pandas References

Pandas DataFrame - drop_duplicates() function



The Pandas DataFrame drop_duplicates() function returns DataFrame with duplicate rows removed.

Syntax

DataFrame.drop_duplicates(subset=None, keep='first', 
                          inplace=False, ignore_index=False)

Parameters

subset Optional. Specify columns to use to identify duplicates, by default use all of the columns.
keep Optional. Determines which duplicates (if any) to keep. Possible values are:
  • first : (Default) Drop duplicates except for the first occurrence
  • last : Drop duplicates except for the last occurrence
  • False : Drop all duplicates
inplace Optional. If True, drop duplicates in place or to return a copy.
ignore_index Optional. If True, the resulting axis will be labeled 0, 1, …, n - 1.

Return Value

Returns DataFrame with duplicates removed or None if inplace=True.

Example: drop_duplicates() example

In the example below, a DataFrame df is created. The drop_duplicates() function is used to drop duplicate rows from this DataFrame.

import pandas as pd
import numpy as np

df = pd.DataFrame({
  "Name": ["John", "John", "Kim", "Kim", "Kim"],
  "Age": [25, 25, 25, 30, 30],
  "Country": ["UK", "UK", "USA", "FRA", "JPN"]
})

#displaying the dataframe
print(df,"\n")

#removes duplicate rows based on all columns
print("df.drop_duplicates() returns:")
print(df.drop_duplicates(),"\n")

The output of the above code will be:

   Name  Age Country
0  John   25      UK
1  John   25      UK
2   Kim   25     USA
3   Kim   30     FRA
4   Kim   30     JPN 

df.drop_duplicates() returns:
   Name  Age Country
0  John   25      UK
2   Kim   25     USA
3   Kim   30     FRA
4   Kim   30     JPN 

Example: using subset parameter

By using subset parameter, we can specify columns to identify duplicate rows from the DataFrame. Consider the example below:

import pandas as pd
import numpy as np

df = pd.DataFrame({
  "Name": ["John", "John", "Kim", "Kim", "Kim"],
  "Age": [25, 25, 25, 30, 30],
  "Country": ["UK", "UK", "USA", "FRA", "JPN"]
})

#displaying the dataframe
print(df,"\n")

#using 'Name' and 'Age' columns
#to identify duplicates columns
print("df.drop_duplicates(subset=['Name', 'Age']) returns:")
print(df.drop_duplicates(subset=['Name', 'Age']),"\n")

The output of the above code will be:

   Name  Age Country
0  John   25      UK
1  John   25      UK
2   Kim   25     USA
3   Kim   30     FRA
4   Kim   30     JPN 

df.drop_duplicates(subset=['Name', 'Age']) returns:
   Name  Age Country
0  John   25      UK
2   Kim   25     USA
3   Kim   30     FRA 

Example: using keep parameter

By using keep parameter, we can specify which duplicate row to keep. Consider the example below:

import pandas as pd
import numpy as np

df = pd.DataFrame({
  "Name": ["John", "John", "Kim", "Kim", "Kim"],
  "Age": [25, 25, 25, 30, 30],
  "Country": ["UK", "UK", "USA", "FRA", "JPN"]
})

#displaying the dataframe
print(df,"\n")

#keeping first duplicate row
print("df.drop_duplicates(subset=['Name', 'Age'], keep='first') returns:")
print(df.drop_duplicates(subset=['Name', 'Age'], keep='first'),"\n")

#keeping last duplicate row
print("df.drop_duplicates(subset=['Name', 'Age'], keep='last') returns:")
print(df.drop_duplicates(subset=['Name', 'Age'], keep='last'),"\n")

The output of the above code will be:

   Name  Age Country
0  John   25      UK
1  John   25      UK
2   Kim   25     USA
3   Kim   30     FRA
4   Kim   30     JPN 

df.drop_duplicates(subset=['Name', 'Age'], keep='first') returns:
   Name  Age Country
0  John   25      UK
2   Kim   25     USA
3   Kim   30     FRA 

df.drop_duplicates(subset=['Name', 'Age'], keep='last') returns:
   Name  Age Country
1  John   25      UK
2   Kim   25     USA
4   Kim   30     JPN 

❮ Pandas DataFrame - Functions