pyspark.SparkContext.newAPIHadoopFile#
- SparkContext.newAPIHadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)[source]#
Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is the same as for meth:SparkContext.sequenceFile.
A Hadoop configuration can be passed in as a Python dict. This will be converted into a Configuration in Java
New in version 1.1.0.
- Parameters
- pathstr
path to Hadoop file
- inputFormatClassstr
fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapreduce.lib.input.TextInputFormat”)
- keyClassstr
fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
- valueClassstr
fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
- keyConverterstr, optional
fully qualified name of a function returning key WritableConverter None by default
- valueConverterstr, optional
fully qualified name of a function returning value WritableConverter None by default
- confdict, optional
Hadoop configuration, passed in as a dict None by default
- batchSizeint, optional, default 0
The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
- Returns
RDD
RDD of tuples of key and corresponding value
See also
Examples
>>> import os >>> import tempfile
Set the related classes
>>> output_format_class = "org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat" >>> input_format_class = "org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat" >>> key_class = "org.apache.hadoop.io.IntWritable" >>> value_class = "org.apache.hadoop.io.Text"
>>> with tempfile.TemporaryDirectory(prefix="newAPIHadoopFile") as d: ... path = os.path.join(d, "new_hadoop_file") ... ... # Write a temporary Hadoop file ... rdd = sc.parallelize([(1, ""), (1, "a"), (3, "x")]) ... rdd.saveAsNewAPIHadoopFile(path, output_format_class, key_class, value_class) ... ... loaded = sc.newAPIHadoopFile(path, input_format_class, key_class, value_class) ... collected = sorted(loaded.collect())
>>> collected [(1, ''), (1, 'a'), (3, 'x')]