pyspark.pandas.DataFrame.rank#

DataFrame.rank(method='average', ascending=True, numeric_only=False, axis=0)[source]#

Compute numerical data ranks (1 through n) along axis. Equal values are assigned a rank that is the average of the ranks of those values.

Note

the current implementation of rank uses Spark’s Window without specifying partition specification. This leads to moving all data into a single partition in a single machine and could cause serious performance degradation. Avoid this method with very large datasets.

Parameters:

method{‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}

average: average rank of group
min: lowest rank in group
max: highest rank in group
first: ranks assigned in order they appear in the array
dense: like ‘min’, but rank always increases by 1 between groups

ascendingboolean, default True

False for ranks by high (1) to low (N)

numeric_onlybool, default False

For DataFrame objects, rank only numeric columns if set to True.

Changed in version 4.0.0: The default value of numeric_only is now False.

axis{0 or ‘index’, 1 or ‘columns’}, default 0

Axis along which to rank:

0 or ‘index’: rank each column independently
1 or ‘columns’: rank each row independently (across columns)

Note

For axis=1, pandas UDF is used which may have performance overhead for very wide DataFrames (100+ columns).

Returns:

rankssame type as caller

Examples

>>> df = ps.DataFrame({'A': [1, 2, 2, 3], 'B': [4, 3, 2, 1]}, columns=['A', 'B'])
>>> df
   A  B
0  1  4
1  2  3
2  2  2
3  3  1

>>> df.rank().sort_index()
     A    B
0  1.0  4.0
1  2.5  3.0
2  2.5  2.0
3  4.0  1.0

If method is set to ‘min’, it uses lowest rank in group.

>>> df.rank(method='min').sort_index()
     A    B
0  1.0  4.0
1  2.0  3.0
2  2.0  2.0
3  4.0  1.0

If method is set to ‘max’, it uses highest rank in group.

>>> df.rank(method='max').sort_index()
     A    B
0  1.0  4.0
1  3.0  3.0
2  3.0  2.0
3  4.0  1.0

If method is set to ‘dense’, it leaves no gaps in group.

>>> df.rank(method='dense').sort_index()
     A    B
0  1.0  4.0
1  2.0  3.0
2  2.0  2.0
3  3.0  1.0

If numeric_only is set to ‘True’, rank only numeric columns.

>>> df = ps.DataFrame({'A': [1, 2, 2, 3], 'B': ['a', 'b', 'd', 'c']}, columns= ['A', 'B'])
>>> df
   A  B
0  1  a
1  2  b
2  2  d
3  3  c
>>> df.rank(numeric_only=True)
     A
0  1.0
1  2.5
2  2.5
3  4.0

Rank across columns with axis=1:

>>> df = ps.DataFrame({'A': [1, 2, 2, 3], 'B': [4, 3, 2, 1]}, columns=['A', 'B'])
>>> df.rank(axis=1).sort_index()
     A    B
0  1.0  2.0
1  1.0  2.0
2  1.5  1.5
3  2.0  1.0