pyspark.pandas.DataFrame.drop_duplicates#
- DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)[source]#
- Return DataFrame with duplicate rows removed, optionally only considering certain columns. - Parameters
- subsetcolumn label or sequence of labels, optional
- Only consider certain columns for identifying duplicates, by default use all the columns. 
- keep{‘first’, ‘last’, False}, default ‘first’
- Determines which duplicates (if any) to keep. - - first: Drop duplicates except for the first occurrence. -- last: Drop duplicates except for the last occurrence. - False : Drop all duplicates.
- inplaceboolean, default False
- Whether to drop duplicates in place or to return a copy. 
- ignore_indexboolean, default False
- If True, the resulting axis will be labeled 0, 1, …, n - 1. 
 
- Returns
- DataFrame
- DataFrame with duplicates removed or None if - inplace=True.
 - >>> df = ps.DataFrame( .. - … {‘a’: [1, 2, 2, 2, 3], ‘b’: [‘a’, ‘a’, ‘a’, ‘c’, ‘d’]}, columns = [‘a’, ‘b’])
 - >>> df a b - 0 1 a
- 1 2 a
- 2 2 a
- 3 2 c
- 4 3 d
 - >>> df.drop_duplicates().sort_index() a b - 0 1 a
- 1 2 a
- 3 2 c
- 4 3 d
 - >>> df.drop_duplicates(ignore_index=True).sort_index() a b - 0 1 a
- 1 2 a
- 2 2 c
- 3 3 d
 - >>> df.drop_duplicates('a').sort_index() a b - 0 1 a
- 1 2 a
- 4 3 d
 - >>> df.drop_duplicates(['a', 'b']).sort_index() a b - 0 1 a
- 1 2 a
- 3 2 c
- 4 3 d
 - >>> df.drop_duplicates(keep='last').sort_index() a b - 0 1 a
- 2 2 a
- 3 2 c
- 4 3 d
 - >>> df.drop_duplicates(keep=False).sort_index() a b - 0 1 a
- 3 2 c
- 4 3 d