MLlib (RDD-based)#

Classification#

`LogisticRegressionModel`(weights, intercept, ...)	Classification model trained using Multinomial/Binary Logistic Regression.
`LogisticRegressionWithSGD`()	Train a classification model for Binary Logistic Regression using Stochastic Gradient Descent.
`LogisticRegressionWithLBFGS`()	Train a classification model for Multinomial/Binary Logistic Regression using Limited-memory BFGS.
`SVMModel`(weights, intercept)	Model for Support Vector Machines (SVMs).
`SVMWithSGD`()	Train a Support Vector Machine (SVM) using Stochastic Gradient Descent.
`NaiveBayesModel`(labels, pi, theta)	Model for Naive Bayes classifiers.
`NaiveBayes`()	Train a Multinomial Naive Bayes model.
`StreamingLogisticRegressionWithSGD`([...])	Train or predict a logistic regression model on streaming data.

Clustering#

`BisectingKMeansModel`(java_model)	A clustering model derived from the bisecting k-means method.
`BisectingKMeans`()	A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark.
`KMeansModel`(centers)	A clustering model derived from the k-means method.
`KMeans`()	K-means clustering.
`GaussianMixtureModel`(java_model)	A clustering model derived from the Gaussian Mixture Model method.
`GaussianMixture`()	Learning algorithm for Gaussian Mixtures using the expectation-maximization algorithm.
`PowerIterationClusteringModel`(java_model)	Model produced by `PowerIterationClustering`.
`PowerIterationClustering`()	Power Iteration Clustering (PIC), a scalable graph clustering algorithm.
`StreamingKMeans`([k, decayFactor, timeUnit])	Provides methods to set k, decayFactor, timeUnit to configure the KMeans algorithm for fitting and predicting on incoming dstreams.
`StreamingKMeansModel`(clusterCenters, ...)	Clustering model which can perform an online update of the centroids.
`LDA`()	Train Latent Dirichlet Allocation (LDA) model.
`LDAModel`(java_model)	A clustering model derived from the LDA method.

Evaluation#

`BinaryClassificationMetrics`(scoreAndLabels)	Evaluator for binary classification.
`RegressionMetrics`(predictionAndObservations)	Evaluator for regression.
`MulticlassMetrics`(predictionAndLabels)	Evaluator for multiclass classification.
`RankingMetrics`(predictionAndLabels)	Evaluator for ranking algorithms.

Feature#

`Normalizer`([p])	Normalizes samples individually to unit L^p norm
`StandardScalerModel`(java_model)	Represents a StandardScaler model that can transform vectors.
`StandardScaler`([withMean, withStd])	Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
`HashingTF`([numFeatures])	Maps a sequence of terms to their term frequencies using the hashing trick.
`IDFModel`(java_model)	Represents an IDF model that can transform term frequency vectors.
`IDF`([minDocFreq])	Inverse document frequency (IDF).
`Word2Vec`()	Word2Vec creates vector representation of words in a text corpus.
`Word2VecModel`(java_model)	class for Word2Vec model
`ChiSqSelector`([numTopFeatures, ...])	Creates a ChiSquared feature selector.
`ChiSqSelectorModel`(java_model)	Represents a Chi Squared selector model.
`ElementwiseProduct`(scalingVector)	Scales each column of the vector, with the supplied weight vector.

Frequency Pattern Mining#

`FPGrowth`()	A Parallel FP-growth algorithm to mine frequent itemsets.
`FPGrowthModel`(java_model)	A FP-Growth model for mining frequent itemsets using the Parallel FP-Growth algorithm.
`PrefixSpan`()	A parallel PrefixSpan algorithm to mine frequent sequential patterns.
`PrefixSpanModel`(java_model)	Model fitted by PrefixSpan

Vector and Matrix#

`Vector`()
`DenseVector`(ar)	A dense vector represented by a value array.
`SparseVector`(size, *args)	A simple sparse vector class for passing data to MLlib.
`Vectors`()	Factory methods for working with vectors.
`Matrix`(numRows, numCols[, isTransposed])
`DenseMatrix`(numRows, numCols, values[, ...])	Column-major dense matrix.
`SparseMatrix`(numRows, numCols, colPtrs, ...)	Sparse Matrix stored in CSC format.
`Matrices`()
`QRDecomposition`(Q, R)	Represents QR factors.

Distributed Representation#

`BlockMatrix`(blocks, rowsPerBlock, colsPerBlock)	Represents a distributed matrix in blocks of local matrices.
`CoordinateMatrix`(entries[, numRows, numCols])	Represents a matrix in coordinate format.
`DistributedMatrix`()	Represents a distributively stored matrix backed by one or more RDDs.
`IndexedRow`(index, vector)	Represents a row of an IndexedRowMatrix.
`IndexedRowMatrix`(rows[, numRows, numCols])	Represents a row-oriented distributed Matrix with indexed rows.
`MatrixEntry`(i, j, value)	Represents an entry of a CoordinateMatrix.
`RowMatrix`(rows[, numRows, numCols])	Represents a row-oriented distributed Matrix with no meaningful row indices.
`SingularValueDecomposition`(java_model)	Represents singular value decomposition (SVD) factors.

Random#

Generator methods for creating RDDs comprised of i.i.d samples from some distribution.

Recommendation#

`MatrixFactorizationModel`(java_model)	A matrix factorisation model trained by regularized alternating least-squares.
`ALS`()	Alternating Least Squares matrix factorization
`Rating`(user, product, rating)	Represents a (user, product, rating) tuple.

Regression#

`LabeledPoint`(label, features)	Class that represents the features and labels of a data point.
`LinearModel`(weights, intercept)	A linear model that has a vector of coefficients and an intercept.
`LinearRegressionModel`(weights, intercept)	A linear regression model derived from a least-squares fit.
`LinearRegressionWithSGD`()	Train a linear regression model with no regularization using Stochastic Gradient Descent.
`RidgeRegressionModel`(weights, intercept)	A linear regression model derived from a least-squares fit with an l_2 penalty term.
`RidgeRegressionWithSGD`()	Train a regression model with L2-regularization using Stochastic Gradient Descent.
`LassoModel`(weights, intercept)	A linear regression model derived from a least-squares fit with an l_1 penalty term.
`LassoWithSGD`()	Train a regression model with L1-regularization using Stochastic Gradient Descent.
`IsotonicRegressionModel`(boundaries, ...)	Regression model for isotonic regression.
`IsotonicRegression`()	Isotonic regression.
`StreamingLinearAlgorithm`(model)	Base class that has to be inherited by any StreamingLinearAlgorithm.
`StreamingLinearRegressionWithSGD`([stepSize, ...])	Train or predict a linear regression model on streaming data.

Statistics#

`Statistics`()
`MultivariateStatisticalSummary`(java_model)	Trait for multivariate statistical summary of a data matrix.
`ChiSqTestResult`(java_model)	Contains test results for the chi-squared hypothesis test.
`MultivariateGaussian`(mu, sigma)	Represents a (mu, sigma) tuple
`KernelDensity`()	Estimate probability density at required points given an RDD of samples from the population.
`ChiSqTestResult`(java_model)	Contains test results for the chi-squared hypothesis test.
`KolmogorovSmirnovTestResult`(java_model)	Contains test results for the Kolmogorov-Smirnov test.

Tree#

`DecisionTreeModel`(java_model)	A decision tree model for classification or regression.
`DecisionTree`()	Learning algorithm for a decision tree model for classification or regression.
`RandomForestModel`(java_model)	Represents a random forest model.
`RandomForest`()	Learning algorithm for a random forest model for classification or regression.
`GradientBoostedTreesModel`(java_model)	Represents a gradient-boosted tree model.
`GradientBoostedTrees`()	Learning algorithm for a gradient boosted trees model for classification or regression.

Utilities#

`JavaLoader`()	Mixin for classes which can load saved models using its Scala implementation.
`JavaSaveable`()	Mixin for models that provide save() through their Scala implementation.
`LinearDataGenerator`()	Utils for generating linear data.
`Loader`()	Mixin for classes which can load saved models from files.
`MLUtils`()	Helper methods to load, save and pre-process data used in MLlib.
`Saveable`()	Mixin for models and transformers which may be saved as files.