pyspark.pandas.DataFrame.drop_duplicates#

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)[source]#

Return DataFrame with duplicate rows removed, optionally only considering certain columns.

Parameters
subsetcolumn label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all the columns.

keep{‘first’, ‘last’, False}, default ‘first’

Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence. - last : Drop duplicates except for the last occurrence. - False : Drop all duplicates.

inplaceboolean, default False

Whether to drop duplicates in place or to return a copy.

ignore_indexboolean, default False

If True, the resulting axis will be labeled 0, 1, …, n - 1.

Returns
DataFrame

DataFrame with duplicates removed or None if inplace=True.

>>> df = ps.DataFrame(
    ..
… {‘a’: [1, 2, 2, 2, 3], ‘b’: [‘a’, ‘a’, ‘a’, ‘c’, ‘d’]}, columns = [‘a’, ‘b’])
>>> df
    a  b
0 1 a
1 2 a
2 2 a
3 2 c
4 3 d
>>> df.drop_duplicates().sort_index()
    a  b
0 1 a
1 2 a
3 2 c
4 3 d
>>> df.drop_duplicates(ignore_index=True).sort_index()
    a  b
0 1 a
1 2 a
2 2 c
3 3 d
>>> df.drop_duplicates('a').sort_index()
    a  b
0 1 a
1 2 a
4 3 d
>>> df.drop_duplicates(['a', 'b']).sort_index()
    a  b
0 1 a
1 2 a
3 2 c
4 3 d
>>> df.drop_duplicates(keep='last').sort_index()
    a  b
0 1 a
2 2 a
3 2 c
4 3 d
>>> df.drop_duplicates(keep=False).sort_index()
    a  b
0 1 a
3 2 c
4 3 d