pyspark.RDD.subtract#

RDD.subtract(other, numPartitions=None)[source]#

Return each value in self that is not contained in other.

Added in version 0.9.1.

Parameters:

otherRDD: another RDD
numPartitionsint, optional: the number of partitions in new RDD

Returns:

RDD: a RDD with the elements from this that are not in other

See also

RDD.subtractByKey()

Examples

>>> rdd1 = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])
>>> rdd2 = sc.parallelize([("a", 3), ("c", None)])
>>> sorted(rdd1.subtract(rdd2).collect())
[('a', 1), ('b', 4), ('b', 5)]