Python

How to handle duplicated rows in pandas.DataFrame

How to handle duplicated row in pandas.DataFrame

There may be duplicated rows in pandas DataFrame. I'd like to find and delete them.


pandas.DataFrame is useful to handle table format data.

Sometimes DataFrame has duplicated data.

In some cases, duplicated data is useless. So we would like to find and delete them.

But how can we find duplicated rows ?

So today I will introduce "How to handle duplicated rows in pandas.DataFrame".

Author


Mid-carieer engineer (AI, system). Good at Python and SQL.

Advantage to read

You can understand "How to handle duplicated rows in pandas.DataFrame". After that you will be good at data handling.


How to handle duplicated rows in pandas.DataFrame

So how can we handle duplicated rows in pandas.DataFrame ?

There are mothods like following.

Handle duplication

  • Find duplication: duplicated()
  • Delete duplication: drop_duplicates()

I will explain about them with using following data.

DATA

import pandas as pd

data_list1 = [
["a",12,100],
["a",12,100],
["c",12,90],
["d",13,85],
["d",13,85],
["e",14,95]
]
col_list1 = ["id","age","score"]
df1 = pd.DataFrame(data=data_list1, columns=col_list1)
print(df1)

#   id  age  score
# 0  a   12    100
# 1  a   12    100
# 2  c   12     90
# 3  d   13     85
# 4  d   13     85
# 5  e   14     95



Find duplication: duplicated()

In order to find duplication, we can use duplicated().

duplicated() function returns boolean value of each row.

Among duplicated rows, it considers first row as original, and considers others as duplicated.

SAMPLE

print(df1.duplicated())

# 0    False
# 1     True
# 2    False
# 3    False
# 4     True
# 5    False
# dtype: bool


Masking by boolean values, we can extract duplicated rows.

print(df1[df1.duplicated()])

#   id  age  score
# 1  a   12    100
# 4  d   13     85



"keep" remaining rows

If we use duplicated() without parameters, it considers first row as False and other row as True(duplicated).

If we set keep="last", it considers last row as False(not duplicated).
(Its default is keep="first".)

SAMPLE

print(df1.duplicated(keep="last"))

# 0     True
# 1    False
# 2    False
# 3     True
# 4    False
# 5    False
# dtype: bool

print(df1[df1.duplicated(keep="last")])

#   id  age  score
# 0  a   12    100
# 3  d   13     85


In addition, keep=False considers both first and last rows as `True(dupllicated)'.

print(df1.duplicated(keep=False))

# 0     True
# 1     True
# 2    False
# 3     True
# 4     True
# 5    False
# dtype: bool

print(df1[df1.duplicated(keep=False)])

#   id  age  score
# 0  a   12    100
# 1  a   12    100
# 3  d   13     85
# 4  d   13     85



"subset" considers certain columns to find duplication

duplicated() considers duplicated row if rows have same values in all columns.

With using subset=["column name"] parameter, it use certain columns to find duplication.

SAMPLE

print(df1.duplicated(subset=["age"]))

# 0    False
# 1     True
# 2     True
# 3    False
# 4     True
# 5    False
# dtype: bool

print(df1[df1.duplicated(subset=["age"])])

#   id  age  score
# 1  a   12    100
# 2  c   12     90
# 4  d   13     85



Delete duplication: drop_duplicates()

When you want to delete duplicated rows, you can use drop_duplicates().

SAMPLE

print(df1.drop_duplicates())

#   id  age  score
# 0  a   12    100
# 2  c   12     90
# 3  d   13     85
# 5  e   14     95


It works same with masking reversed boolean of duplicated().

print(df1[~df1.duplicated()])

#   id  age  score
# 0  a   12    100
# 2  c   12     90
# 3  d   13     85
# 5  e   14     95


In addition, drop_duplicates() can also use keep and subset parameters.



"inplace" updates original DataFrame

drop_duplicates() returns another DataFrame.

If you want to update source DataFrame, you can use inplace=True.

SAMPLE

df1.drop_duplicates()
print(df1)

#   id  age  score
# 0  a   12    100
# 1  a   12    100
# 2  c   12     90
# 3  d   13     85
# 4  d   13     85
# 5  e   14     95

df1.drop_duplicates(inplace=True)
print(df1)

#   id  age  score
# 0  a   12    100
# 2  c   12     90
# 3  d   13     85
# 5  e   14     95



Conclusion

Today I explained about "How to handle duplicated rows in pandas.DataFrame".

In rder to find and delete duplicated rows, there are mothods like following.

Handle duplication

  • Find duplication: duplicated()
  • Delete duplication: drop_duplicates()

Each of them can can consider original row by keep="first" or keep="last" parameters.

And with using subset=["column name"] parameter, it use certain columns to find duplication.


They are very useful for data handling.




There are more articles about pandas.DataFrame.

If you are interested in them, please read them.

If you felt this article is useful, please share.

にほんブログ村 IT技術ブログへ

-Python
-

© 2024 ITips