pyspark.pandas.Series.corr#

Series.corr(other, method='pearson', min_periods=None)[source]#

Compute correlation with other Series, excluding missing values.

New in version 3.3.0.

Parameters

otherSeries

method{‘pearson’, ‘spearman’, ‘kendall’}

pearson : standard correlation coefficient
spearman : Spearman rank correlation
kendall : Kendall Tau correlation coefficient

Changed in version 3.4.0: support ‘kendall’ for method parameter

min_periodsint, optional

Minimum number of observations needed to have a valid result.

New in version 3.4.0.

Returns

correlationfloat

Notes

The complexity of Kendall correlation is O(#row * #row), if the dataset is too large, sampling ahead of correlation computation is recommended.

Examples

>>> df = ps.DataFrame({'s1': [.2, .0, .6, .2],
...                    's2': [.3, .6, .0, .1]})
>>> s1 = df.s1
>>> s2 = df.s2
>>> s1.corr(s2, method='pearson')
-0.85106...

>>> s1.corr(s2, method='spearman')
-0.94868...

>>> s1.corr(s2, method='kendall')
-0.91287...

>>> s1 = ps.Series([1, np.nan, 2, 1, 1, 2, 3])
>>> s2 = ps.Series([3, 4, 1, 1, 5])

>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     s1.corr(s2, method="pearson")
-0.52223...

>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     s1.corr(s2, method="spearman")
-0.54433...

>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     s1.corr(s2, method="kendall")
-0.51639...

>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     s1.corr(s2, method="kendall", min_periods=5)
nan