Class NaiveBayes

Object
org.apache.spark.mllib.classification.NaiveBayes
All Implemented Interfaces:
Serializable, org.apache.spark.internal.Logging

public class NaiveBayes extends Object implements Serializable, org.apache.spark.internal.Logging
Trains a Naive Bayes model given an RDD of (label, features) pairs.

This is the Multinomial NB (see here) which can handle all kinds of discrete data. For example, by converting documents into TF-IDF vectors, it can be used for document classification. By making every vector a 0-1 vector, it can also be used as Bernoulli NB (see here). The input feature values must be nonnegative.

See Also:
  • Nested Class Summary

    Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging

    org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
  • Constructor Summary

    Constructors
    Constructor
    Description
     
    NaiveBayes(double lambda)
     
  • Method Summary

    Modifier and Type
    Method
    Description
    double
    Get the smoothing parameter.
    Get the model type.
    Run the algorithm with the configured parameters on an input RDD of LabeledPoint entries.
    setLambda(double lambda)
    Set the smoothing parameter.
    setModelType(String modelType)
    Set the model type using a string (case-sensitive).
    Trains a Naive Bayes model given an RDD of (label, features) pairs.
    train(RDD<LabeledPoint> input, double lambda)
    Trains a Naive Bayes model given an RDD of (label, features) pairs.
    train(RDD<LabeledPoint> input, double lambda, String modelType)
    Trains a Naive Bayes model given an RDD of (label, features) pairs.

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

    Methods inherited from interface org.apache.spark.internal.Logging

    initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext
  • Constructor Details

    • NaiveBayes

      public NaiveBayes(double lambda)
    • NaiveBayes

      public NaiveBayes()
  • Method Details

    • train

      public static NaiveBayesModel train(RDD<LabeledPoint> input)
      Trains a Naive Bayes model given an RDD of (label, features) pairs.

      This is the default Multinomial NB (see here) which can handle all kinds of discrete data. For example, by converting documents into TF-IDF vectors, it can be used for document classification.

      This version of the method uses a default smoothing parameter of 1.0.

      Parameters:
      input - RDD of (label, array of features) pairs. Every vector should be a frequency vector or a count vector.
      Returns:
      (undocumented)
    • train

      public static NaiveBayesModel train(RDD<LabeledPoint> input, double lambda)
      Trains a Naive Bayes model given an RDD of (label, features) pairs.

      This is the default Multinomial NB (see here) which can handle all kinds of discrete data. For example, by converting documents into TF-IDF vectors, it can be used for document classification.

      Parameters:
      input - RDD of (label, array of features) pairs. Every vector should be a frequency vector or a count vector.
      lambda - The smoothing parameter
      Returns:
      (undocumented)
    • train

      public static NaiveBayesModel train(RDD<LabeledPoint> input, double lambda, String modelType)
      Trains a Naive Bayes model given an RDD of (label, features) pairs.

      The model type can be set to either Multinomial NB (see here) or Bernoulli NB (see here). The Multinomial NB can handle discrete count data and can be called by setting the model type to "multinomial". For example, it can be used with word counts or TF_IDF vectors of documents. The Bernoulli model fits presence or absence (0-1) counts. By making every vector a 0-1 vector and setting the model type to "bernoulli", the fits and predicts as Bernoulli NB.

      Parameters:
      input - RDD of (label, array of features) pairs. Every vector should be a frequency vector or a count vector.
      lambda - The smoothing parameter

      modelType - The type of NB model to fit from the enumeration NaiveBayesModels, can be multinomial or bernoulli
      Returns:
      (undocumented)
    • setLambda

      public NaiveBayes setLambda(double lambda)
      Set the smoothing parameter. Default: 1.0.
    • getLambda

      public double getLambda()
      Get the smoothing parameter.
    • setModelType

      public NaiveBayes setModelType(String modelType)
      Set the model type using a string (case-sensitive). Supported options: "multinomial" (default) and "bernoulli".
      Parameters:
      modelType - (undocumented)
      Returns:
      (undocumented)
    • getModelType

      public String getModelType()
      Get the model type.
    • run

      public NaiveBayesModel run(RDD<LabeledPoint> data)
      Run the algorithm with the configured parameters on an input RDD of LabeledPoint entries.

      Parameters:
      data - RDD of LabeledPoint.
      Returns:
      (undocumented)