Pipeline#
- class pyspark.ml.connect.Pipeline(*, stages=None)[source]#
A simple pipeline, which acts as an estimator. A Pipeline consists of a sequence of stages, each of which is either an
Estimator
or aTransformer
. WhenPipeline.fit()
is called, the stages are executed in order. If a stage is anEstimator
, itsEstimator.fit()
method will be called on the input dataset to fit a model. Then the model, which is a transformer, will be used to transform the dataset as the input to the next stage. If a stage is aTransformer
, itsTransformer.transform()
method will be called to produce the dataset for the next stage. The fitted model from aPipeline
is aPipelineModel
, which consists of fitted models and transformers, corresponding to the pipeline stages. If stages is an empty list, the pipeline acts as an identity transformer.New in version 3.5.0.
Examples
>>> from pyspark.ml.connect import Pipeline >>> from pyspark.ml.connect.classification import LogisticRegression >>> from pyspark.ml.connect.feature import StandardScaler >>> scaler = StandardScaler(inputCol='features', outputCol='scaled_features') >>> lor = LogisticRegression(maxIter=20, learningRate=0.01) >>> pipeline=Pipeline(stages=[scaler, lor]) >>> dataset = spark.createDataFrame([ ... ([1.0, 2.0], 1), ... ([2.0, -1.0], 1), ... ([-3.0, -2.0], 0), ... ([-1.0, -2.0], 0), ... ], schema=['features', 'label']) >>> pipeline_model = pipeline.fit(dataset) >>> transformed_dataset = pipeline_model.transform(dataset) >>> transformed_dataset.show() +------------+-----+--------------------+----------+--------------------+ | features|label| scaled_features|prediction| probability| +------------+-----+--------------------+----------+--------------------+ | [1.0, 2.0]| 1|[0.56373452100212...| 1|[0.02423273026943...| | [2.0, -1.0]| 1|[1.01472213780381...| 1|[0.09334788471460...| |[-3.0, -2.0]| 0|[-1.2402159462046...| 0|[0.99808156490325...| |[-1.0, -2.0]| 0|[-0.3382407126012...| 0|[0.96210002899169...| +------------+-----+--------------------+----------+--------------------+ >>> pipeline_model.saveToLocal("/tmp/pipeline") >>> loaded_pipeline_model = PipelineModel.loadFromLocal("/tmp/pipeline")
Methods
clear
(param)Clears a param from the param map if it has been explicitly set.
copy
([extra])Creates a copy of this instance.
explainParam
(param)Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
extractParamMap
([extra])Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
fit
(dataset[, params])Fits a model to the input dataset with optional parameters.
getOrDefault
(param)Gets the value of a param in the user-supplied param map or its default value.
getParam
(paramName)Gets a param by its name.
Get pipeline stages.
get_uid_map
(instance)hasDefault
(param)Checks whether a param has a default value.
hasParam
(paramName)Tests whether this instance contains a param with a given (string) name.
isDefined
(param)Checks whether a param is explicitly set by user or has a default value.
isSet
(param)Checks whether a param is explicitly set by user.
load
(path)Load Estimator / Transformer / Model / Evaluator from provided cloud storage path.
loadFromLocal
(path)Load Estimator / Transformer / Model / Evaluator from provided local path.
save
(path, *[, overwrite])Save Estimator / Transformer / Model / Evaluator to provided cloud storage path.
saveToLocal
(path, *[, overwrite])Save Estimator / Transformer / Model / Evaluator to provided local path.
set
(param, value)Sets a parameter in the embedded param map.
setParams
(self, \*[, stages])Sets params for Pipeline.
setStages
(value)Set pipeline stages.
Attributes
Returns all params ordered by name.
Methods Documentation
- clear(param)#
Clears a param from the param map if it has been explicitly set.
- copy(extra=None)[source]#
Creates a copy of this instance.
New in version 3.5.0.
- Parameters
- extradict, optional
extra parameters
- Returns
Pipeline
new instance
- explainParam(param)#
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams()#
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra=None)#
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters
- extradict, optional
extra param values
- Returns
- dict
merged param map
- fit(dataset, params=None)#
Fits a model to the input dataset with optional parameters.
New in version 3.5.0.
- Parameters
- dataset
pyspark.sql.DataFrame
or py:class:pandas.DataFrame input dataset, it can be either pandas dataframe or spark dataframe.
- paramsa dict of param values, optional
an optional param map that overrides embedded params.
- dataset
- Returns
Transformer
fitted model
- getOrDefault(param)#
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getParam(paramName)#
Gets a param by its name.
- static get_uid_map(instance)#
- hasDefault(param)#
Checks whether a param has a default value.
- hasParam(paramName)#
Tests whether this instance contains a param with a given (string) name.
- isDefined(param)#
Checks whether a param is explicitly set by user or has a default value.
- isSet(param)#
Checks whether a param is explicitly set by user.
- classmethod load(path)#
Load Estimator / Transformer / Model / Evaluator from provided cloud storage path.
New in version 3.5.0.
- classmethod loadFromLocal(path)#
Load Estimator / Transformer / Model / Evaluator from provided local path.
New in version 3.5.0.
- save(path, *, overwrite=False)#
Save Estimator / Transformer / Model / Evaluator to provided cloud storage path.
New in version 3.5.0.
- saveToLocal(path, *, overwrite=False)#
Save Estimator / Transformer / Model / Evaluator to provided local path.
New in version 3.5.0.
- set(param, value)#
Sets a parameter in the embedded param map.
- setStages(value)[source]#
Set pipeline stages.
New in version 3.5.0.
- Parameters
- valuelist
of
pyspark.ml.connect.Transformer
orpyspark.ml.connect.Estimator
- Returns
Pipeline
the pipeline instance
Attributes Documentation
- params#
Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
- stages = Param(parent='undefined', name='stages', doc='a list of pipeline stages')#
- uid#
A unique id for the object.