MLlib (RDD-based)#
Classification#
|
Classification model trained using Multinomial/Binary Logistic Regression. |
Train a classification model for Binary Logistic Regression using Stochastic Gradient Descent. |
|
Train a classification model for Multinomial/Binary Logistic Regression using Limited-memory BFGS. |
|
|
Model for Support Vector Machines (SVMs). |
Train a Support Vector Machine (SVM) using Stochastic Gradient Descent. |
|
|
Model for Naive Bayes classifiers. |
Train a Multinomial Naive Bayes model. |
|
Train or predict a logistic regression model on streaming data. |
Clustering#
|
A clustering model derived from the bisecting k-means method. |
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. |
|
|
A clustering model derived from the k-means method. |
|
K-means clustering. |
|
A clustering model derived from the Gaussian Mixture Model method. |
Learning algorithm for Gaussian Mixtures using the expectation-maximization algorithm. |
|
|
Model produced by |
Power Iteration Clustering (PIC), a scalable graph clustering algorithm. |
|
|
Provides methods to set k, decayFactor, timeUnit to configure the KMeans algorithm for fitting and predicting on incoming dstreams. |
|
Clustering model which can perform an online update of the centroids. |
|
Train Latent Dirichlet Allocation (LDA) model. |
|
A clustering model derived from the LDA method. |
Evaluation#
|
Evaluator for binary classification. |
|
Evaluator for regression. |
|
Evaluator for multiclass classification. |
|
Evaluator for ranking algorithms. |
Feature#
|
Normalizes samples individually to unit Lp norm |
|
Represents a StandardScaler model that can transform vectors. |
|
Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. |
|
Maps a sequence of terms to their term frequencies using the hashing trick. |
|
Represents an IDF model that can transform term frequency vectors. |
|
Inverse document frequency (IDF). |
|
Word2Vec creates vector representation of words in a text corpus. |
|
class for Word2Vec model |
|
Creates a ChiSquared feature selector. |
|
Represents a Chi Squared selector model. |
|
Scales each column of the vector, with the supplied weight vector. |
Frequency Pattern Mining#
|
A Parallel FP-growth algorithm to mine frequent itemsets. |
|
A FP-Growth model for mining frequent itemsets using the Parallel FP-Growth algorithm. |
A parallel PrefixSpan algorithm to mine frequent sequential patterns. |
|
|
Model fitted by PrefixSpan |
Vector and Matrix#
|
|
|
A dense vector represented by a value array. |
|
A simple sparse vector class for passing data to MLlib. |
|
Factory methods for working with vectors. |
|
|
|
Column-major dense matrix. |
|
Sparse Matrix stored in CSC format. |
|
|
|
Represents QR factors. |
Distributed Representation#
|
Represents a distributed matrix in blocks of local matrices. |
|
Represents a matrix in coordinate format. |
Represents a distributively stored matrix backed by one or more RDDs. |
|
|
Represents a row of an IndexedRowMatrix. |
|
Represents a row-oriented distributed Matrix with indexed rows. |
|
Represents an entry of a CoordinateMatrix. |
|
Represents a row-oriented distributed Matrix with no meaningful row indices. |
|
Represents singular value decomposition (SVD) factors. |
Random#
Generator methods for creating RDDs comprised of i.i.d samples from some distribution. |
Recommendation#
|
A matrix factorisation model trained by regularized alternating least-squares. |
|
Alternating Least Squares matrix factorization |
|
Represents a (user, product, rating) tuple. |
Regression#
|
Class that represents the features and labels of a data point. |
|
A linear model that has a vector of coefficients and an intercept. |
|
A linear regression model derived from a least-squares fit. |
Train a linear regression model with no regularization using Stochastic Gradient Descent. |
|
|
A linear regression model derived from a least-squares fit with an l_2 penalty term. |
Train a regression model with L2-regularization using Stochastic Gradient Descent. |
|
|
A linear regression model derived from a least-squares fit with an l_1 penalty term. |
Train a regression model with L1-regularization using Stochastic Gradient Descent. |
|
|
Regression model for isotonic regression. |
Isotonic regression. |
|
|
Base class that has to be inherited by any StreamingLinearAlgorithm. |
|
Train or predict a linear regression model on streaming data. |
Statistics#
|
Trait for multivariate statistical summary of a data matrix. |
|
Contains test results for the chi-squared hypothesis test. |
|
Represents a (mu, sigma) tuple |
Estimate probability density at required points given an RDD of samples from the population. |
|
|
Contains test results for the chi-squared hypothesis test. |
|
Contains test results for the Kolmogorov-Smirnov test. |
Tree#
|
A decision tree model for classification or regression. |
Learning algorithm for a decision tree model for classification or regression. |
|
|
Represents a random forest model. |
Learning algorithm for a random forest model for classification or regression. |
|
|
Represents a gradient-boosted tree model. |
Learning algorithm for a gradient boosted trees model for classification or regression. |
Utilities#
Mixin for classes which can load saved models using its Scala implementation. |
|
Mixin for models that provide save() through their Scala implementation. |
|
Utils for generating linear data. |
|
|
Mixin for classes which can load saved models from files. |
|
Helper methods to load, save and pre-process data used in MLlib. |
|
Mixin for models and transformers which may be saved as files. |