KMeansModel#

class pyspark.mllib.clustering.KMeansModel(centers)[source]#

A clustering model derived from the k-means method.

New in version 0.9.0.

Examples

>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
>>> model = KMeans.train(
...     sc.parallelize(data), 2, maxIterations=10, initializationMode="random",
...                    seed=50, initializationSteps=5, epsilon=1e-4)
>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0]))
True
>>> model.predict(array([8.0, 9.0])) == model.predict(array([9.0, 8.0]))
True
>>> model.k
2
>>> model.computeCost(sc.parallelize(data))
2.0
>>> model = KMeans.train(sc.parallelize(data), 2)
>>> sparse_data = [
...     SparseVector(3, {1: 1.0}),
...     SparseVector(3, {1: 1.1}),
...     SparseVector(3, {2: 1.0}),
...     SparseVector(3, {2: 1.1})
... ]
>>> model = KMeans.train(sc.parallelize(sparse_data), 2, initializationMode="k-means||",
...                                     seed=50, initializationSteps=5, epsilon=1e-4)
>>> model.predict(array([0., 1., 0.])) == model.predict(array([0, 1.1, 0.]))
True
>>> model.predict(array([0., 0., 1.])) == model.predict(array([0, 0, 1.1]))
True
>>> model.predict(sparse_data[0]) == model.predict(sparse_data[1])
True
>>> model.predict(sparse_data[2]) == model.predict(sparse_data[3])
True
>>> isinstance(model.clusterCenters, list)
True
>>> import os, tempfile
>>> path = tempfile.mkdtemp()
>>> model.save(sc, path)
>>> sameModel = KMeansModel.load(sc, path)
>>> sameModel.predict(sparse_data[0]) == model.predict(sparse_data[0])
True
>>> from shutil import rmtree
>>> try:
...     rmtree(path)
... except OSError:
...     pass

>>> data = array([-383.1,-382.9, 28.7,31.2, 366.2,367.3]).reshape(3, 2)
>>> model = KMeans.train(sc.parallelize(data), 3, maxIterations=0,
...     initialModel = KMeansModel([(-1000.0,-1000.0),(5.0,5.0),(1000.0,1000.0)]))
>>> model.clusterCenters
[array([-1000., -1000.]), array([ 5.,  5.]), array([ 1000.,  1000.])]

Methods

`computeCost`(rdd)	Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.
`load`(sc, path)	Load a model from the given path.
`predict`(x)	Find the cluster that each of the points belongs to in this model.
`save`(sc, path)	Save this model to the given path.

Attributes

`clusterCenters`	Get the cluster centers, represented as a list of NumPy arrays.
`k`	Total number of clusters.

Methods Documentation

computeCost(rdd)[source]#

Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.

New in version 1.4.0.

Parameters

rdd:pyspark.RDD: The RDD of points to compute the cost on.

classmethod load(sc, path)[source]#: Load a model from the given path.

New in version 1.4.0.

predict(x)[source]#

Find the cluster that each of the points belongs to in this model.

New in version 0.9.0.

Parameters

xpyspark.mllib.linalg.Vector or pyspark.RDD: A data point (or RDD of points) to determine cluster index. pyspark.mllib.linalg.Vector can be replaced with equivalent objects (list, tuple, numpy.ndarray).

Returns

int or pyspark.RDD of int: Predicted cluster index or an RDD of predicted cluster indices if the input is an RDD.

save(sc, path)[source]#: Save this model to the given path.

New in version 1.4.0.

Attributes Documentation

clusterCenters#: Get the cluster centers, represented as a list of NumPy arrays.

New in version 1.0.0.

k#: Total number of clusters.

New in version 1.4.0.