Package org.apache.spark.ml.feature
Class RobustScaler
- All Implemented Interfaces:
Serializable
,org.apache.spark.internal.Logging
,RobustScalerParams
,Params
,HasInputCol
,HasOutputCol
,HasRelativeError
,DefaultParamsWritable
,Identifiable
,MLWritable
public class RobustScaler
extends Estimator<RobustScalerModel>
implements RobustScalerParams, DefaultParamsWritable
Scale features using statistics that are robust to outliers.
RobustScaler removes the median and scales the data according to the quantile range.
The quantile range is by default IQR (Interquartile Range, quantile range between the
1st quartile = 25th quantile and the 3rd quartile = 75th quantile) but can be configured.
Centering and scaling happen independently on each feature by computing the relevant
statistics on the samples in the training set. Median and quantile range are then
stored to be used on later data using the transform method.
Standardization of a dataset is a common requirement for many machine learning estimators.
Typically this is done by removing the mean and scaling to unit variance. However,
outliers can often influence the sample mean / variance in a negative way.
In such cases, the median and the quantile range often give better results.
Note that NaN values are ignored in the computation of medians and ranges.
- See Also:
-
Nested Class Summary
Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionCreates a copy of this instance with the same UID and some extra params.Fits a model to the input data.inputCol()
Param for input column name.static RobustScaler
lower()
Lower quantile to calculate quantile range, shared by all features Default: 0.25Param for output column name.static MLReader<T>
read()
final DoubleParam
Param for the relative target precision for the approximate quantile algorithm.setInputCol
(String value) setLower
(double value) setOutputCol
(String value) setRelativeError
(double value) setUpper
(double value) setWithCentering
(boolean value) setWithScaling
(boolean value) transformSchema
(StructType schema) Check transform validity and derive the output schema from the input schema.uid()
An immutable unique ID for the object and its derivatives.upper()
Upper quantile to calculate quantile range, shared by all features Default: 0.75Whether to center the data with median before scaling.Whether to scale the data to quantile range.Methods inherited from class org.apache.spark.ml.PipelineStage
params
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface org.apache.spark.ml.util.DefaultParamsWritable
write
Methods inherited from interface org.apache.spark.ml.param.shared.HasInputCol
getInputCol
Methods inherited from interface org.apache.spark.ml.param.shared.HasOutputCol
getOutputCol
Methods inherited from interface org.apache.spark.ml.param.shared.HasRelativeError
getRelativeError
Methods inherited from interface org.apache.spark.ml.util.Identifiable
toString
Methods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext
Methods inherited from interface org.apache.spark.ml.util.MLWritable
save
Methods inherited from interface org.apache.spark.ml.param.Params
clear, copyValues, defaultCopy, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, onParamChange, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn
Methods inherited from interface org.apache.spark.ml.feature.RobustScalerParams
getLower, getUpper, getWithCentering, getWithScaling, validateAndTransformSchema
-
Constructor Details
-
RobustScaler
-
RobustScaler
public RobustScaler()
-
-
Method Details
-
load
-
read
-
lower
Description copied from interface:RobustScalerParams
Lower quantile to calculate quantile range, shared by all features Default: 0.25- Specified by:
lower
in interfaceRobustScalerParams
- Returns:
- (undocumented)
-
upper
Description copied from interface:RobustScalerParams
Upper quantile to calculate quantile range, shared by all features Default: 0.75- Specified by:
upper
in interfaceRobustScalerParams
- Returns:
- (undocumented)
-
withCentering
Description copied from interface:RobustScalerParams
Whether to center the data with median before scaling. It will build a dense output, so take care when applying to sparse input. Default: false- Specified by:
withCentering
in interfaceRobustScalerParams
- Returns:
- (undocumented)
-
withScaling
Description copied from interface:RobustScalerParams
Whether to scale the data to quantile range. Default: true- Specified by:
withScaling
in interfaceRobustScalerParams
- Returns:
- (undocumented)
-
relativeError
Description copied from interface:HasRelativeError
Param for the relative target precision for the approximate quantile algorithm. Must be in the range [0, 1].- Specified by:
relativeError
in interfaceHasRelativeError
- Returns:
- (undocumented)
-
outputCol
Description copied from interface:HasOutputCol
Param for output column name.- Specified by:
outputCol
in interfaceHasOutputCol
- Returns:
- (undocumented)
-
inputCol
Description copied from interface:HasInputCol
Param for input column name.- Specified by:
inputCol
in interfaceHasInputCol
- Returns:
- (undocumented)
-
uid
Description copied from interface:Identifiable
An immutable unique ID for the object and its derivatives.- Specified by:
uid
in interfaceIdentifiable
- Returns:
- (undocumented)
-
setInputCol
-
setOutputCol
-
setLower
-
setUpper
-
setWithCentering
-
setWithScaling
-
setRelativeError
-
fit
Description copied from class:Estimator
Fits a model to the input data.- Specified by:
fit
in classEstimator<RobustScalerModel>
- Parameters:
dataset
- (undocumented)- Returns:
- (undocumented)
-
transformSchema
Description copied from class:PipelineStage
Check transform validity and derive the output schema from the input schema.We check validity for interactions between parameters during
transformSchema
and raise an exception if any parameter value is invalid. Parameter value checks which do not depend on other parameters are handled byParam.validate()
.Typical implementation should first conduct verification on schema change and parameter validity, including complex parameter interaction checks.
- Specified by:
transformSchema
in classPipelineStage
- Parameters:
schema
- (undocumented)- Returns:
- (undocumented)
-
copy
Description copied from interface:Params
Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. SeedefaultCopy()
.- Specified by:
copy
in interfaceParams
- Specified by:
copy
in classEstimator<RobustScalerModel>
- Parameters:
extra
- (undocumented)- Returns:
- (undocumented)
-