pyspark.RDD.localCheckpoint#

RDD.localCheckpoint()[source]#

Mark this RDD for local checkpointing using Spark’s existing caching layer.

This method is for users who wish to truncate RDD lineages while skipping the expensive step of replicating the materialized data in a reliable distributed file system. This is useful for RDDs with long lineages that need to be truncated periodically (e.g. GraphX).

Local checkpointing sacrifices fault-tolerance for performance. In particular, checkpointed data is written to ephemeral local storage in the executors instead of to a reliable, fault-tolerant storage. The effect is that if an executor fails during the computation, the checkpointed data may no longer be accessible, causing an irrecoverable job failure.

This is NOT safe to use with dynamic allocation, which removes executors along with their cached blocks. If you must use both features, you are advised to set spark.dynamicAllocation.cachedExecutorIdleTimeout to a high value.

The checkpoint directory set through SparkContext.setCheckpointDir() is not used.

New in version 2.2.0.

Examples

>>> rdd = sc.range(5)
>>> rdd.isLocallyCheckpointed()
False
>>> rdd.localCheckpoint()
>>> rdd.isLocallyCheckpointed()
True