KMeansModel#
- class pyspark.mllib.clustering.KMeansModel(centers)[source]#
A clustering model derived from the k-means method.
New in version 0.9.0.
Examples
>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) >>> model = KMeans.train( ... sc.parallelize(data), 2, maxIterations=10, initializationMode="random", ... seed=50, initializationSteps=5, epsilon=1e-4) >>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) True >>> model.predict(array([8.0, 9.0])) == model.predict(array([9.0, 8.0])) True >>> model.k 2 >>> model.computeCost(sc.parallelize(data)) 2.0 >>> model = KMeans.train(sc.parallelize(data), 2) >>> sparse_data = [ ... SparseVector(3, {1: 1.0}), ... SparseVector(3, {1: 1.1}), ... SparseVector(3, {2: 1.0}), ... SparseVector(3, {2: 1.1}) ... ] >>> model = KMeans.train(sc.parallelize(sparse_data), 2, initializationMode="k-means||", ... seed=50, initializationSteps=5, epsilon=1e-4) >>> model.predict(array([0., 1., 0.])) == model.predict(array([0, 1.1, 0.])) True >>> model.predict(array([0., 0., 1.])) == model.predict(array([0, 0, 1.1])) True >>> model.predict(sparse_data[0]) == model.predict(sparse_data[1]) True >>> model.predict(sparse_data[2]) == model.predict(sparse_data[3]) True >>> isinstance(model.clusterCenters, list) True >>> import os, tempfile >>> path = tempfile.mkdtemp() >>> model.save(sc, path) >>> sameModel = KMeansModel.load(sc, path) >>> sameModel.predict(sparse_data[0]) == model.predict(sparse_data[0]) True >>> from shutil import rmtree >>> try: ... rmtree(path) ... except OSError: ... pass
>>> data = array([-383.1,-382.9, 28.7,31.2, 366.2,367.3]).reshape(3, 2) >>> model = KMeans.train(sc.parallelize(data), 3, maxIterations=0, ... initialModel = KMeansModel([(-1000.0,-1000.0),(5.0,5.0),(1000.0,1000.0)])) >>> model.clusterCenters [array([-1000., -1000.]), array([ 5., 5.]), array([ 1000., 1000.])]
Methods
computeCost
(rdd)Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.
load
(sc, path)Load a model from the given path.
predict
(x)Find the cluster that each of the points belongs to in this model.
save
(sc, path)Save this model to the given path.
Attributes
Get the cluster centers, represented as a list of NumPy arrays.
Total number of clusters.
Methods Documentation
- computeCost(rdd)[source]#
Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.
New in version 1.4.0.
- Parameters
- rdd:
pyspark.RDD
The RDD of points to compute the cost on.
- rdd:
- predict(x)[source]#
Find the cluster that each of the points belongs to in this model.
New in version 0.9.0.
- Parameters
- x
pyspark.mllib.linalg.Vector
orpyspark.RDD
A data point (or RDD of points) to determine cluster index.
pyspark.mllib.linalg.Vector
can be replaced with equivalent objects (list, tuple, numpy.ndarray).
- x
- Returns
- int or
pyspark.RDD
of int Predicted cluster index or an RDD of predicted cluster indices if the input is an RDD.
- int or
Attributes Documentation
- clusterCenters#
Get the cluster centers, represented as a list of NumPy arrays.
New in version 1.0.0.
- k#
Total number of clusters.
New in version 1.4.0.