pandas.DataFrame
is useful to handle table format data.
Sometimes DataFrame has duplicated data.
In some cases, duplicated data is useless. So we would like to find and delete them.
But how can we find duplicated rows ?
So today I will introduce "How to handle duplicated rows in pandas.DataFrame".
Author
Advantage to read
You can understand "How to handle duplicated rows in pandas.DataFrame". After that you will be good at data handling.
How to handle duplicated rows in pandas.DataFrame
So how can we handle duplicated rows in pandas.DataFrame ?
There are mothods like following.
Handle duplication
- Find duplication:
duplicated()
- Delete duplication:
drop_duplicates()
I will explain about them with using following data.
DATA
import pandas as pd data_list1 = [ ["a",12,100], ["a",12,100], ["c",12,90], ["d",13,85], ["d",13,85], ["e",14,95] ] col_list1 = ["id","age","score"] df1 = pd.DataFrame(data=data_list1, columns=col_list1) print(df1) # id age score # 0 a 12 100 # 1 a 12 100 # 2 c 12 90 # 3 d 13 85 # 4 d 13 85 # 5 e 14 95
Find duplication: duplicated()
In order to find duplication, we can use duplicated()
.
duplicated()
function returns boolean value of each row.
Among duplicated rows, it considers first row as original
, and considers others as duplicated
.
SAMPLE
print(df1.duplicated()) # 0 False # 1 True # 2 False # 3 False # 4 True # 5 False # dtype: bool
Masking by boolean values, we can extract duplicated rows.
print(df1[df1.duplicated()]) # id age score # 1 a 12 100 # 4 d 13 85
"keep" remaining rows
If we use duplicated()
without parameters, it considers first row as False
and other row as True(duplicated)
.
If we set keep="last"
, it considers last row as False(not duplicated)
.
(Its default is keep="first"
.)
SAMPLE
print(df1.duplicated(keep="last")) # 0 True # 1 False # 2 False # 3 True # 4 False # 5 False # dtype: bool print(df1[df1.duplicated(keep="last")]) # id age score # 0 a 12 100 # 3 d 13 85
In addition, keep=False
considers both first and last rows as `True(dupllicated)'.
print(df1.duplicated(keep=False)) # 0 True # 1 True # 2 False # 3 True # 4 True # 5 False # dtype: bool print(df1[df1.duplicated(keep=False)]) # id age score # 0 a 12 100 # 1 a 12 100 # 3 d 13 85 # 4 d 13 85
"subset" considers certain columns to find duplication
duplicated()
considers duplicated row if rows have same values in all columns.
With using subset=["column name"]
parameter, it use certain columns to find duplication.
SAMPLE
print(df1.duplicated(subset=["age"])) # 0 False # 1 True # 2 True # 3 False # 4 True # 5 False # dtype: bool print(df1[df1.duplicated(subset=["age"])]) # id age score # 1 a 12 100 # 2 c 12 90 # 4 d 13 85
Delete duplication: drop_duplicates()
When you want to delete duplicated rows, you can use drop_duplicates()
.
SAMPLE
print(df1.drop_duplicates()) # id age score # 0 a 12 100 # 2 c 12 90 # 3 d 13 85 # 5 e 14 95
It works same with masking reversed boolean of duplicated()
.
print(df1[~df1.duplicated()]) # id age score # 0 a 12 100 # 2 c 12 90 # 3 d 13 85 # 5 e 14 95
In addition, drop_duplicates()
can also use keep
and subset
parameters.
"inplace" updates original DataFrame
drop_duplicates()
returns another DataFrame.
If you want to update source DataFrame, you can use inplace=True
.
SAMPLE
df1.drop_duplicates() print(df1) # id age score # 0 a 12 100 # 1 a 12 100 # 2 c 12 90 # 3 d 13 85 # 4 d 13 85 # 5 e 14 95 df1.drop_duplicates(inplace=True) print(df1) # id age score # 0 a 12 100 # 2 c 12 90 # 3 d 13 85 # 5 e 14 95
Conclusion
Today I explained about "How to handle duplicated rows in pandas.DataFrame".
In rder to find and delete duplicated rows, there are mothods like following.
Handle duplication
- Find duplication:
duplicated()
- Delete duplication:
drop_duplicates()
Each of them can can consider original row by keep="first"
or keep="last"
parameters.
And with using subset=["column name"]
parameter, it use certain columns to find duplication.
There are more articles about pandas.DataFrame.
If you are interested in them, please read them.