Spark Core#

Public Classes#

`SparkContext`([master, appName, sparkHome, ...])	Main entry point for Spark functionality.
`RDD`(jrdd, ctx[, jrdd_deserializer])	A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
`Broadcast`([sc, value, pickle_registry, ...])	A broadcast variable created with `SparkContext.broadcast()`.
`Accumulator`(aid, value, accum_param)	A shared variable that can be accumulated, i.e., has a commutative and associative "add" operation.
`AccumulatorParam`()	Helper object that defines how to accumulate values of a given type.
`SparkConf`([loadDefaults, _jvm, _jconf])	Configuration for a Spark application.
`SparkFiles`()	Resolves paths to files added through `SparkContext.addFile()`.
`StorageLevel`(useDisk, useMemory, useOffHeap, ...)	Flags for controlling the storage of an RDD.
`TaskContext`(**kwargs)	Contextual information about a task which can be read or mutated during execution.
`RDDBarrier`(rdd)	Wraps an RDD in a barrier stage, which forces Spark to launch tasks of this stage together.
`BarrierTaskContext`(**kwargs)	A `TaskContext` with extra contextual info and tooling for tasks in a barrier stage.
`BarrierTaskInfo`(address)	Carries all task infos of a barrier task.
`InheritableThread`(target, *args[, session])	Thread that is recommended to be used in PySpark when the pinned thread mode is enabled.
`util.VersionUtils`()	Provides utility method to determine Spark versions with given input string.

Spark Context APIs#

`SparkContext.PACKAGE_EXTENSIONS`
`SparkContext.accumulator`(value[, accum_param])	Create an `Accumulator` with the given initial value, using a given `AccumulatorParam` helper object to define how to add values of the data type if provided.
`SparkContext.addArchive`(path)	Add an archive to be downloaded with this Spark job on every node.
`SparkContext.addFile`(path[, recursive])	Add a file to be downloaded with this Spark job on every node.
`SparkContext.addJobTag`(tag)	Add a tag to be assigned to all the jobs started by this thread.
`SparkContext.addPyFile`(path)	Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future.
`SparkContext.applicationId`	A unique identifier for the Spark application.
`SparkContext.binaryFiles`(path[, minPartitions])	Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array.
`SparkContext.binaryRecords`(path, recordLength)	Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant.
`SparkContext.broadcast`(value)	Broadcast a read-only variable to the cluster, returning a `Broadcast` object for reading it in distributed functions.
`SparkContext.cancelAllJobs`()	Cancel all jobs that have been scheduled or are running.
`SparkContext.cancelJobGroup`(groupId)	Cancel active jobs for the specified group.
`SparkContext.cancelJobsWithTag`(tag)	Cancel active jobs that have the specified tag.
`SparkContext.clearJobTags`()	Clear the current thread's job tags.
`SparkContext.defaultMinPartitions`	Default min number of partitions for Hadoop RDDs when not given by user
`SparkContext.defaultParallelism`	Default level of parallelism to use when not given by user (e.g.
`SparkContext.dump_profiles`(path)	Dump the profile stats into directory path
`SparkContext.emptyRDD`()	Create an `RDD` that has no partitions or elements.
`SparkContext.getCheckpointDir`()	Return the directory where RDDs are checkpointed.
`SparkContext.getConf`()	Return a copy of this SparkContext's configuration `SparkConf`.
`SparkContext.getJobTags`()	Get the tags that are currently set to be assigned to all the jobs started by this thread.
`SparkContext.getLocalProperty`(key)	Get a local property set in this thread, or null if it is missing.
`SparkContext.getOrCreate`([conf])	Get or instantiate a `SparkContext` and register it as a singleton object.
`SparkContext.getSystemProperty`(key)	Get a Java system property, such as java.home.
`SparkContext.hadoopFile`(path, ...[, ...])	Read an 'old' Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.
`SparkContext.hadoopRDD`(inputFormatClass, ...)	Read an 'old' Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict.
`SparkContext.listArchives`	Returns a list of archive paths that are added to resources.
`SparkContext.listFiles`	Returns a list of file paths that are added to resources.
`SparkContext.newAPIHadoopFile`(path, ...[, ...])	Read a 'new API' Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.
`SparkContext.newAPIHadoopRDD`(...[, ...])	Read a 'new API' Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict.
`SparkContext.parallelize`(c[, numSlices])	Distribute a local Python collection to form an RDD.
`SparkContext.pickleFile`(name[, minPartitions])	Load an RDD previously saved using `RDD.saveAsPickleFile()` method.
`SparkContext.range`(start[, end, step, numSlices])	Create a new RDD of int containing elements from start to end (exclusive), increased by step every element.
`SparkContext.resources`	Return the resource information of this `SparkContext`.
`SparkContext.removeJobTag`(tag)	Remove a tag previously added to be assigned to all the jobs started by this thread.
`SparkContext.runJob`(rdd, partitionFunc[, ...])	Executes the given partitionFunc on the specified set of partitions, returning the result as an array of elements.
`SparkContext.sequenceFile`(path[, keyClass, ...])	Read a Hadoop SequenceFile with arbitrary key and value Writable class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.
`SparkContext.setCheckpointDir`(dirName)	Set the directory under which RDDs are going to be checkpointed.
`SparkContext.setInterruptOnCancel`(...)	Set the behavior of job cancellation from jobs started in this thread.
`SparkContext.setJobDescription`(value)	Set a human readable description of the current job.
`SparkContext.setJobGroup`(groupId, description)	Assigns a group ID to all the jobs started by this thread until the group ID is set to a different value or cleared.
`SparkContext.setLocalProperty`(key, value)	Set a local property that affects jobs submitted from this thread, such as the Spark fair scheduler pool.
`SparkContext.setLogLevel`(logLevel)	Control our logLevel.
`SparkContext.setSystemProperty`(key, value)	Set a Java system property, such as spark.executor.memory.
`SparkContext.show_profiles`()	Print the profile stats to stdout
`SparkContext.sparkUser`()	Get SPARK_USER for user who is running SparkContext.
`SparkContext.startTime`	Return the epoch time when the `SparkContext` was started.
`SparkContext.statusTracker`()	Return `StatusTracker` object
`SparkContext.stop`()	Shut down the `SparkContext`.
`SparkContext.textFile`(name[, minPartitions, ...])	Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.
`SparkContext.uiWebUrl`	Return the URL of the SparkUI instance started by this `SparkContext`
`SparkContext.union`(rdds)	Build the union of a list of RDDs.
`SparkContext.version`	The version of Spark on which this application is running.
`SparkContext.wholeTextFiles`(path[, ...])	Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.

RDD APIs#

`RDD.aggregate`(zeroValue, seqOp, combOp)	Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral "zero value."
`RDD.aggregateByKey`(zeroValue, seqFunc, combFunc)	Aggregate the values of each key, using given combine functions and a neutral "zero value".
`RDD.barrier`()	Marks the current stage as a barrier stage, where Spark must launch all tasks together.
`RDD.cache`()	Persist this RDD with the default storage level (MEMORY_ONLY).
`RDD.cartesian`(other)	Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements `(a, b)` where `a` is in self and `b` is in other.
`RDD.checkpoint`()	Mark this RDD for checkpointing.
`RDD.cleanShuffleDependencies`([blocking])	Removes an RDD's shuffles and it's non-persisted ancestors.
`RDD.coalesce`(numPartitions[, shuffle])	Return a new RDD that is reduced into numPartitions partitions.
`RDD.cogroup`(other[, numPartitions])	For each key k in self or other, return a resulting RDD that contains a tuple with the list of values for that key in self as well as other.
`RDD.collect`()	Return a list that contains all the elements in this RDD.
`RDD.collectAsMap`()	Return the key-value pairs in this RDD to the master as a dictionary.
`RDD.collectWithJobGroup`(groupId, description)	When collect rdd, use this method to specify job group.
`RDD.combineByKey`(createCombiner, mergeValue, ...)	Generic function to combine the elements for each key using a custom set of aggregation functions.
`RDD.context`	The `SparkContext` that this RDD was created on.
`RDD.count`()	Return the number of elements in this RDD.
`RDD.countApprox`(timeout[, confidence])	Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished.
`RDD.countApproxDistinct`([relativeSD])	Return approximate number of distinct elements in the RDD.
`RDD.countByKey`()	Count the number of elements for each key, and return the result to the master as a dictionary.
`RDD.countByValue`()	Return the count of each unique value in this RDD as a dictionary of (value, count) pairs.
`RDD.distinct`([numPartitions])	Return a new RDD containing the distinct elements in this RDD.
`RDD.filter`(f)	Return a new RDD containing only the elements that satisfy a predicate.
`RDD.first`()	Return the first element in this RDD.
`RDD.flatMap`(f[, preservesPartitioning])	Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
`RDD.flatMapValues`(f)	Pass each value in the key-value pair RDD through a flatMap function without changing the keys; this also retains the original RDD's partitioning.
`RDD.fold`(zeroValue, op)	Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral "zero value."
`RDD.foldByKey`(zeroValue, func[, ...])	Merge the values for each key using an associative function "func" and a neutral "zeroValue" which may be added to the result an arbitrary number of times, and must not change the result (e.g., 0 for addition, or 1 for multiplication.).
`RDD.foreach`(f)	Applies a function to all elements of this RDD.
`RDD.foreachPartition`(f)	Applies a function to each partition of this RDD.
`RDD.fullOuterJoin`(other[, numPartitions])	Perform a right outer join of self and other.
`RDD.getCheckpointFile`()	Gets the name of the file to which this RDD was checkpointed
`RDD.getNumPartitions`()	Returns the number of partitions in RDD
`RDD.getResourceProfile`()	Get the `pyspark.resource.ResourceProfile` specified with this RDD or None if it wasn't specified.
`RDD.getStorageLevel`()	Get the RDD's current storage level.
`RDD.glom`()	Return an RDD created by coalescing all elements within each partition into a list.
`RDD.groupBy`(f[, numPartitions, partitionFunc])	Return an RDD of grouped items.
`RDD.groupByKey`([numPartitions, partitionFunc])	Group the values for each key in the RDD into a single sequence.
`RDD.groupWith`(other, *others)	Alias for cogroup but with support for multiple RDDs.
`RDD.histogram`(buckets)	Compute a histogram using the provided buckets.
`RDD.id`()	A unique ID for this RDD (within its SparkContext).
`RDD.intersection`(other)	Return the intersection of this RDD and another one.
`RDD.isCheckpointed`()	Return whether this RDD is checkpointed and materialized, either reliably or locally.
`RDD.isEmpty`()	Returns true if and only if the RDD contains no elements at all.
`RDD.isLocallyCheckpointed`()	Return whether this RDD is marked for local checkpointing.
`RDD.join`(other[, numPartitions])	Return an RDD containing all pairs of elements with matching keys in self and other.
`RDD.keyBy`(f)	Creates tuples of the elements in this RDD by applying f.
`RDD.keys`()	Return an RDD with the keys of each tuple.
`RDD.leftOuterJoin`(other[, numPartitions])	Perform a left outer join of self and other.
`RDD.localCheckpoint`()	Mark this RDD for local checkpointing using Spark's existing caching layer.
`RDD.lookup`(key)	Return the list of values in the RDD for key key.
`RDD.map`(f[, preservesPartitioning])	Return a new RDD by applying a function to each element of this RDD.
`RDD.mapPartitions`(f[, preservesPartitioning])	Return a new RDD by applying a function to each partition of this RDD.
`RDD.mapPartitionsWithIndex`(f[, ...])	Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.
`RDD.mapPartitionsWithSplit`(f[, ...])	Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.
`RDD.mapValues`(f)	Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD's partitioning.
`RDD.max`([key])	Find the maximum item in this RDD.
`RDD.mean`()	Compute the mean of this RDD's elements.
`RDD.meanApprox`(timeout[, confidence])	Approximate operation to return the mean within a timeout or meet the confidence.
`RDD.min`([key])	Find the minimum item in this RDD.
`RDD.name`()	Return the name of this RDD.
`RDD.partitionBy`(numPartitions[, partitionFunc])	Return a copy of the RDD partitioned using the specified partitioner.
`RDD.persist`([storageLevel])	Set this RDD's storage level to persist its values across operations after the first time it is computed.
`RDD.pipe`(command[, env, checkCode])	Return an RDD created by piping elements to a forked external process.
`RDD.randomSplit`(weights[, seed])	Randomly splits this RDD with the provided weights.
`RDD.reduce`(f)	Reduces the elements of this RDD using the specified commutative and associative binary operator.
`RDD.reduceByKey`(func[, numPartitions, ...])	Merge the values for each key using an associative and commutative reduce function.
`RDD.reduceByKeyLocally`(func)	Merge the values for each key using an associative and commutative reduce function, but return the results immediately to the master as a dictionary.
`RDD.repartition`(numPartitions)	Return a new RDD that has exactly numPartitions partitions.
`RDD.repartitionAndSortWithinPartitions`([...])	Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys.
`RDD.rightOuterJoin`(other[, numPartitions])	Perform a right outer join of self and other.
`RDD.sample`(withReplacement, fraction[, seed])	Return a sampled subset of this RDD.
`RDD.sampleByKey`(withReplacement, fractions)	Return a subset of this RDD sampled by key (via stratified sampling).
`RDD.sampleStdev`()	Compute the sample standard deviation of this RDD's elements (which corrects for bias in estimating the standard deviation by dividing by N-1 instead of N).
`RDD.sampleVariance`()	Compute the sample variance of this RDD's elements (which corrects for bias in estimating the variance by dividing by N-1 instead of N).
`RDD.saveAsHadoopDataset`(conf[, ...])	Output a Python RDD of key-value pairs (of form `RDD[(K, V)]`) to any Hadoop file system, using the old Hadoop OutputFormat API (mapred package).
`RDD.saveAsHadoopFile`(path, outputFormatClass)	Output a Python RDD of key-value pairs (of form `RDD[(K, V)]`) to any Hadoop file system, using the old Hadoop OutputFormat API (mapred package).
`RDD.saveAsNewAPIHadoopDataset`(conf[, ...])	Output a Python RDD of key-value pairs (of form `RDD[(K, V)]`) to any Hadoop file system, using the new Hadoop OutputFormat API (mapreduce package).
`RDD.saveAsNewAPIHadoopFile`(path, ...[, ...])	Output a Python RDD of key-value pairs (of form `RDD[(K, V)]`) to any Hadoop file system, using the new Hadoop OutputFormat API (mapreduce package).
`RDD.saveAsPickleFile`(path[, batchSize])	Save this RDD as a SequenceFile of serialized objects.
`RDD.saveAsSequenceFile`(path[, ...])	Output a Python RDD of key-value pairs (of form `RDD[(K, V)]`) to any Hadoop file system, using the "org.apache.hadoop.io.Writable" types that we convert from the RDD's key and value types.
`RDD.saveAsTextFile`(path[, compressionCodecClass])	Save this RDD as a text file, using string representations of elements.
`RDD.setName`(name)	Assign a name to this RDD.
`RDD.sortBy`(keyfunc[, ascending, numPartitions])	Sorts this RDD by the given keyfunc
`RDD.sortByKey`([ascending, numPartitions, ...])	Sorts this RDD, which is assumed to consist of (key, value) pairs.
`RDD.stats`()	Return a `StatCounter` object that captures the mean, variance and count of the RDD's elements in one operation.
`RDD.stdev`()	Compute the standard deviation of this RDD's elements.
`RDD.subtract`(other[, numPartitions])	Return each value in self that is not contained in other.
`RDD.subtractByKey`(other[, numPartitions])	Return each (key, value) pair in self that has no pair with matching key in other.
`RDD.sum`()	Add up the elements in this RDD.
`RDD.sumApprox`(timeout[, confidence])	Approximate operation to return the sum within a timeout or meet the confidence.
`RDD.take`(num)	Take the first num elements of the RDD.
`RDD.takeOrdered`(num[, key])	Get the N elements from an RDD ordered in ascending order or as specified by the optional key function.
`RDD.takeSample`(withReplacement, num[, seed])	Return a fixed-size sampled subset of this RDD.
`RDD.toDebugString`()	A description of this RDD and its recursive dependencies for debugging.
`RDD.toLocalIterator`([prefetchPartitions])	Return an iterator that contains all of the elements in this RDD.
`RDD.top`(num[, key])	Get the top N elements from an RDD.
`RDD.treeAggregate`(zeroValue, seqOp, combOp)	Aggregates the elements of this RDD in a multi-level tree pattern.
`RDD.treeReduce`(f[, depth])	Reduces the elements of this RDD in a multi-level tree pattern.
`RDD.union`(other)	Return the union of this RDD and another one.
`RDD.unpersist`([blocking])	Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
`RDD.values`()	Return an RDD with the values of each tuple.
`RDD.variance`()	Compute the variance of this RDD's elements.
`RDD.withResources`(profile)	Specify a `pyspark.resource.ResourceProfile` to use when calculating this RDD.
`RDD.zip`(other)	Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc.
`RDD.zipWithIndex`()	Zips this RDD with its element indices.
`RDD.zipWithUniqueId`()	Zips this RDD with generated unique Long ids.

Broadcast and Accumulator#

`Broadcast.destroy`([blocking])	Destroy all data and metadata related to this broadcast variable.
`Broadcast.dump`(value, f)	Write a pickled representation of value to the open file or socket.
`Broadcast.load`(file)	Read a pickled representation of value from the open file or socket.
`Broadcast.load_from_path`(path)	Read the pickled representation of an object from the open file and return the reconstituted object hierarchy specified therein.
`Broadcast.unpersist`([blocking])	Delete cached copies of this broadcast on the executors.
`Broadcast.value`	Return the broadcasted value
`Accumulator.add`(term)	Adds a term to this accumulator's value
`Accumulator.value`	Get the accumulator's value; only usable in driver program
`AccumulatorParam.addInPlace`(value1, value2)	Add two values of the accumulator's data type, returning a new value; for efficiency, can also update value1 in place and return it.
`AccumulatorParam.zero`(value)	Provide a "zero value" for the type, compatible in dimensions with the provided value (e.g., a zero vector)

Management#

`inheritable_thread_target`([f])	Return thread target wrapper which is recommended to be used in PySpark when the pinned thread mode is enabled.
`SparkConf.contains`(key)	Does this configuration contain a given key?
`SparkConf.get`(key[, defaultValue])	Get the configured value for some key, or return a default otherwise.
`SparkConf.getAll`()	Get all values as a list of key-value pairs.
`SparkConf.set`(key, value)	Set a configuration property.
`SparkConf.setAll`(pairs)	Set multiple parameters, passed as a list of key-value pairs.
`SparkConf.setAppName`(value)	Set application name.
`SparkConf.setExecutorEnv`([key, value, pairs])	Set an environment variable to be passed to executors.
`SparkConf.setIfMissing`(key, value)	Set a configuration property, if not already set.
`SparkConf.setMaster`(value)	Set master URL to connect to.
`SparkConf.setSparkHome`(value)	Set path where Spark is installed on worker nodes.
`SparkConf.toDebugString`()	Returns a printable version of the configuration, as a list of key=value pairs, one per line.
`SparkFiles.get`(filename)	Get the absolute path of a file added through `SparkContext.addFile()` or `SparkContext.addPyFile()`.
`SparkFiles.getRootDirectory`()	Get the root directory that contains files added through `SparkContext.addFile()` or `SparkContext.addPyFile()`.
`StorageLevel.DISK_ONLY`
`StorageLevel.DISK_ONLY_2`
`StorageLevel.DISK_ONLY_3`
`StorageLevel.MEMORY_AND_DISK`
`StorageLevel.MEMORY_AND_DISK_2`
`StorageLevel.MEMORY_AND_DISK_DESER`
`StorageLevel.MEMORY_ONLY`
`StorageLevel.MEMORY_ONLY_2`
`StorageLevel.OFF_HEAP`
`TaskContext.attemptNumber`()	How many times this task has been attempted.
`TaskContext.cpus`()	CPUs allocated to the task.
`TaskContext.get`()	Return the currently active `TaskContext`.
`TaskContext.getLocalProperty`(key)	Get a local property set upstream in the driver, or None if it is missing.
`TaskContext.partitionId`()	The ID of the RDD partition that is computed by this task.
`TaskContext.resources`()	Resources allocated to the task.
`TaskContext.stageId`()	The ID of the stage that this task belong to.
`TaskContext.taskAttemptId`()	An ID that is unique to this task attempt (within the same `SparkContext`, no two task attempts will share the same attempt ID).
`RDDBarrier.mapPartitions`(f[, ...])	Returns a new RDD by applying a function to each partition of the wrapped RDD, where tasks are launched together in a barrier stage.
`RDDBarrier.mapPartitionsWithIndex`(f[, ...])	Returns a new RDD by applying a function to each partition of the wrapped RDD, while tracking the index of the original partition.
`BarrierTaskContext.allGather`([message])	This function blocks until all tasks in the same stage have reached this routine.
`BarrierTaskContext.attemptNumber`()	How many times this task has been attempted.
`BarrierTaskContext.barrier`()	Sets a global barrier and waits until all tasks in this stage hit this barrier.
`BarrierTaskContext.cpus`()	CPUs allocated to the task.
`BarrierTaskContext.get`()	Return the currently active `BarrierTaskContext`.
`BarrierTaskContext.getLocalProperty`(key)	Get a local property set upstream in the driver, or None if it is missing.
`BarrierTaskContext.getTaskInfos`()	Returns `BarrierTaskInfo` for all tasks in this barrier stage, ordered by partition ID.
`BarrierTaskContext.partitionId`()	The ID of the RDD partition that is computed by this task.
`BarrierTaskContext.resources`()	Resources allocated to the task.
`BarrierTaskContext.stageId`()	The ID of the stage that this task belong to.
`BarrierTaskContext.taskAttemptId`()	An ID that is unique to this task attempt (within the same `SparkContext`, no two task attempts will share the same attempt ID).
`util.VersionUtils.majorMinorVersion`(sparkVersion)	Given a Spark version string, return the (major version number, minor version number).