pyspark.pandas.DataFrame.loc#

property DataFrame.loc#

Access a group of rows and columns by label(s) or a boolean Series.

.loc[] is primarily label based, but may also be used with a conditional boolean Series derived from the DataFrame or Series.

Allowed inputs are:

  • A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index) for column selection.

  • A list or array of labels, e.g. ['a', 'b', 'c'].

  • A slice object with labels, e.g. 'a':'f'.

  • A conditional boolean Series derived from the DataFrame or Series

  • A boolean array of the same length as the column axis being sliced, e.g. [True, False, True].

  • An alignable boolean pandas Series to the column axis being sliced. The index of the key will be aligned before masking.

Not allowed inputs which pandas allows are:

  • A boolean array of the same length as the row axis being sliced, e.g. [True, False, True].

  • A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)

Note

MultiIndex is not supported yet.

Note

Note that contrary to usual python slices, both the start and the stop are included, and the step of the slice is not allowed.

Note

With a list or array of labels for row selection, pandas-on-Spark behaves as a filter without reordering by the labels.

See also

Series.loc

Access group of values using labels.

Examples

Getting values

>>> df = ps.DataFrame([[1, 2], [4, 5], [7, 8]],
...                   index=['cobra', 'viper', 'sidewinder'],
...                   columns=['max_speed', 'shield'])
>>> df
            max_speed  shield
cobra               1       2
viper               4       5
sidewinder          7       8

Single label. Note this returns the row as a Series.

>>> df.loc['viper']
max_speed    4
shield       5
Name: viper, dtype: int64

List of labels. Note using [[]] returns a DataFrame. Also note that pandas-on-Spark behaves just a filter without reordering by the labels.

>>> df.loc[['viper', 'sidewinder']]
            max_speed  shield
viper               4       5
sidewinder          7       8
>>> df.loc[['sidewinder', 'viper']]
            max_speed  shield
viper               4       5
sidewinder          7       8

Single label for column.

>>> int(df.loc['cobra', 'shield'])
2

List of labels for row.

>>> df.loc[['cobra'], 'shield']
cobra    2
Name: shield, dtype: int64

List of labels for column.

>>> df.loc['cobra', ['shield']]
shield    2
Name: cobra, dtype: int64

List of labels for both row and column.

>>> df.loc[['cobra'], ['shield']]
       shield
cobra       2

Slice with labels for row and single label for column. Note that both the start and stop of the slice are included.

>>> df.loc['cobra':'viper', 'max_speed']
cobra    1
viper    4
Name: max_speed, dtype: int64

Conditional that returns a boolean Series

>>> df.loc[df['shield'] > 6]
            max_speed  shield
sidewinder          7       8

Conditional that returns a boolean Series with column labels specified

>>> df.loc[df['shield'] > 6, ['max_speed']]
            max_speed
sidewinder          7

A boolean array of the same length as the column axis being sliced.

>>> df.loc[:, [False, True]]
            shield
cobra            2
viper            5
sidewinder       8

An alignable boolean Series to the column axis being sliced.

>>> df.loc[:, pd.Series([False, True], index=['max_speed', 'shield'])]
            shield
cobra            2
viper            5
sidewinder       8

Setting values

Setting value for all items matching the list of labels.

>>> df.loc[['viper', 'sidewinder'], ['shield']] = 50
>>> df
            max_speed  shield
cobra               1       2
viper               4      50
sidewinder          7      50

Setting value for an entire row

>>> df.loc['cobra'] = 10
>>> df
            max_speed  shield
cobra              10      10
viper               4      50
sidewinder          7      50

Set value for an entire column

>>> df.loc[:, 'max_speed'] = 30
>>> df
            max_speed  shield
cobra              30      10
viper              30      50
sidewinder         30      50

Set value for an entire list of columns

>>> df.loc[:, ['max_speed', 'shield']] = 100
>>> df
            max_speed  shield
cobra             100     100
viper             100     100
sidewinder        100     100

Set value with Series

>>> df.loc[:, 'shield'] = df['shield'] * 2
>>> df
            max_speed  shield
cobra             100     200
viper             100     200
sidewinder        100     200

Getting values on a DataFrame with an index that has integer labels

Another example using integers for the index

>>> df = ps.DataFrame([[1, 2], [4, 5], [7, 8]],
...                   index=[7, 8, 9],
...                   columns=['max_speed', 'shield'])
>>> df
   max_speed  shield
7          1       2
8          4       5
9          7       8

Slice with integer labels for rows. Note that both the start and stop of the slice are included.

>>> df.loc[7:9]
   max_speed  shield
7          1       2
8          4       5
9          7       8