FileInputFormat (Apache Hadoop Main 3.5.0-SNAPSHOT API)

java.lang.Object
- org.apache.hadoop.mapreduce.InputFormat<K,V>
- - org.apache.hadoop.mapreduce.lib.input.FileInputFormat<K,V>

Direct Known Subclasses:

CombineFileInputFormat, FixedLengthInputFormat, KeyValueTextInputFormat, NLineInputFormat, SequenceFileInputFormat, TextInputFormat
```
@InterfaceAudience.Public
 @InterfaceStability.Stable
public abstract class FileInputFormat<K,V>
extends InputFormat<K,V>
```
A base class for file-based InputFormats.
FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of getSplits(JobContext). Implementations of FileInputFormat can also override the isSplitable(JobContext, Path) method to prevent input files from being split-up in certain situations. Implementations that may deal with non-splittable files must override this method, since the default implementation assumes splitting is always possible.

Field Summary

Fields
Modifier and Type	Field and Description
`static int`	`DEFAULT_LIST_STATUS_NUM_THREADS`
`static String`	`INPUT_DIR`
`static String`	`INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS`
`static String`	`INPUT_DIR_RECURSIVE`
`static String`	`LIST_STATUS_NUM_THREADS`
`static String`	`NUM_INPUT_FILES`
`static String`	`PATHFILTER_CLASS`
`static String`	`SPLIT_MAXSIZE`
`static String`	`SPLIT_MINSIZE`

Constructor Summary

Constructors
Constructor and Description

FileInputFormat()

Constructors
Constructor and Description
`FileInputFormat()`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`static void`	`addInputPath(Job job, Path path)` Add a `Path` to the list of inputs for the map-reduce job.
`protected void`	`addInputPathRecursively(List<FileStatus> result, FileSystem fs, Path path, PathFilter inputFilter)` Add files in the input path recursively into the results.
`static void`	`addInputPaths(Job job, String commaSeparatedPaths)` Add the given comma separated paths to the list of inputs for the map-reduce job.
`protected long`	`computeSplitSize(long blockSize, long minSize, long maxSize)`
`protected int`	`getBlockIndex(BlockLocation[] blkLocations, long offset)`
`protected long`	`getFormatMinSplitSize()` Get the lower bound on split size imposed by the format.
`static boolean`	`getInputDirRecursive(JobContext job)`
`static PathFilter`	`getInputPathFilter(JobContext context)` Get a PathFilter instance of the filter set for the input paths.
`static Path[]`	`getInputPaths(JobContext context)` Get the list of input `Path`s for the map-reduce job.
`static long`	`getMaxSplitSize(JobContext context)` Get the maximum split size.
`static long`	`getMinSplitSize(JobContext job)` Get the minimum split size
`List<InputSplit>`	`getSplits(JobContext job)` Generate the list of files and make them into FileSplits.
`protected boolean`	`isSplitable(JobContext context, Path filename)` Is the given filename splittable? Usually, true, but if the file is stream compressed, it will not be.
`protected List<FileStatus>`	`listStatus(JobContext job)` List input directories.
`protected FileSplit`	`makeSplit(Path file, long start, long length, String[] hosts)` A factory that makes the split for this class.
`protected FileSplit`	`makeSplit(Path file, long start, long length, String[] hosts, String[] inMemoryHosts)` A factory that makes the split for this class.
`static void`	`setInputDirRecursive(Job job, boolean inputDirRecursive)`
`static void`	`setInputPathFilter(Job job, Class<? extends PathFilter> filter)` Set a PathFilter to be applied to the input paths for the map-reduce job.
`static void`	`setInputPaths(Job job, Path... inputPaths)` Set the array of `Path`s as the list of inputs for the map-reduce job.
`static void`	`setInputPaths(Job job, String commaSeparatedPaths)` Sets the given comma separated paths as the list of inputs for the map-reduce job.
`static void`	`setMaxInputSplitSize(Job job, long size)` Set the maximum split size
`static void`	`setMinInputSplitSize(Job job, long size)` Set the minimum input split size
`static FileStatus`	`shrinkStatus(FileStatus origStat)` The HdfsBlockLocation includes a LocatedBlock which contains messages for issuing more detailed queries to datanodes about a block, but these messages are useless during job submission currently.

Methods inherited from class org.apache.hadoop.mapreduce.InputFormat
createRecordReader

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - INPUT_DIR
```
public static final String INPUT_DIR
```
    See Also:
    
    Constant Field Values
  - SPLIT_MAXSIZE
```
public static final String SPLIT_MAXSIZE
```
    See Also:
    
    Constant Field Values
  - SPLIT_MINSIZE
```
public static final String SPLIT_MINSIZE
```
    See Also:
    
    Constant Field Values
  - PATHFILTER_CLASS
```
public static final String PATHFILTER_CLASS
```
    See Also:
    
    Constant Field Values
  - NUM_INPUT_FILES
```
public static final String NUM_INPUT_FILES
```
    See Also:
    
    Constant Field Values
  - INPUT_DIR_RECURSIVE
```
public static final String INPUT_DIR_RECURSIVE
```
    See Also:
    
    Constant Field Values
  - INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS
```
public static final String INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS
```
    See Also:
    
    Constant Field Values
  - LIST_STATUS_NUM_THREADS
```
public static final String LIST_STATUS_NUM_THREADS
```
    See Also:
    
    Constant Field Values
  - DEFAULT_LIST_STATUS_NUM_THREADS
```
public static final int DEFAULT_LIST_STATUS_NUM_THREADS
```
    See Also:
    
    Constant Field Values
- Constructor Detail
  - FileInputFormat
```
public FileInputFormat()
```
- Method Detail
  - setInputDirRecursive
```
public static void setInputDirRecursive(Job job,
                                        boolean inputDirRecursive)
```
    Parameters:
    
    job - the job to modify
    
    inputDirRecursive -
  - getInputDirRecursive
```
public static boolean getInputDirRecursive(JobContext job)
```
    Parameters:
    
    job - the job to look at.
    
    Returns:
    
    should the files to be read recursively?
  - getFormatMinSplitSize
```
protected long getFormatMinSplitSize()
```
    Get the lower bound on split size imposed by the format.
    
    Returns:
    
    the number of bytes of the minimal split for this format
  - isSplitable
```
protected boolean isSplitable(JobContext context,
                              Path filename)
```
    Is the given filename splittable? Usually, true, but if the file is stream compressed, it will not be. The default implementation in FileInputFormat always returns true. Implementations that may deal with non-splittable files must override this method. FileInputFormat implementations can override this and return false to ensure that individual input files are never split-up so that Mappers process entire files.
    
    Parameters:
    
    context - the job context
    
    filename - the file name to check
    
    Returns:
    
    is this file splitable?
  - setInputPathFilter
```
public static void setInputPathFilter(Job job,
                                      Class<? extends PathFilter> filter)
```
    Set a PathFilter to be applied to the input paths for the map-reduce job.
    
    Parameters:
    
    job - the job to modify
    
    filter - the PathFilter class use for filtering the input paths.
  - setMinInputSplitSize
```
public static void setMinInputSplitSize(Job job,
                                        long size)
```
    Set the minimum input split size
    
    Parameters:
    
    job - the job to modify
    
    size - the minimum size
  - getMinSplitSize
```
public static long getMinSplitSize(JobContext job)
```
    Get the minimum split size
    
    Parameters:
    
    job - the job
    
    Returns:
    
    the minimum number of bytes that can be in a split
  - setMaxInputSplitSize
```
public static void setMaxInputSplitSize(Job job,
                                        long size)
```
    Set the maximum split size
    
    Parameters:
    
    job - the job to modify
    
    size - the maximum split size
  - getMaxSplitSize
```
public static long getMaxSplitSize(JobContext context)
```
    Get the maximum split size.
    
    Parameters:
    
    context - the job to look at.
    
    Returns:
    
    the maximum number of bytes a split can include
  - getInputPathFilter
```
public static PathFilter getInputPathFilter(JobContext context)
```
    Get a PathFilter instance of the filter set for the input paths.
    
    Returns:
    
    the PathFilter instance set for the job, NULL if none has been set.
  - listStatus
```
protected List<FileStatus> listStatus(JobContext job)
                               throws IOException
```
    List input directories. Subclasses may override to, e.g., select only files matching a regular expression. If security is enabled, this method collects delegation tokens from the input paths and adds them to the job's credentials.
    
    Parameters:
    
    job - the job to list input paths for and attach tokens to.
    
    Returns:
    
    array of FileStatus objects
    
    Throws:
    
    IOException - if zero items.
  - addInputPathRecursively
```
protected void addInputPathRecursively(List<FileStatus> result,
                                       FileSystem fs,
                                       Path path,
                                       PathFilter inputFilter)
                                throws IOException
```
    Add files in the input path recursively into the results.
    
    Parameters:
    
    result - The List to store all files.
    
    fs - The FileSystem.
    
    path - The input path.
    
    inputFilter - The input filter that can be used to filter files/dirs.
    
    Throws:
    
    IOException
  - shrinkStatus
```
public static FileStatus shrinkStatus(FileStatus origStat)
```
    The HdfsBlockLocation includes a LocatedBlock which contains messages for issuing more detailed queries to datanodes about a block, but these messages are useless during job submission currently. This method tries to exclude the LocatedBlock from HdfsBlockLocation by creating a new BlockLocation from original, reshaping the LocatedFileStatus, allowing listStatus(JobContext) to scan more files with less memory footprint.
    
    Parameters:
    
    origStat - The fat FileStatus.
    
    Returns:
    
    The FileStatus that has been shrunk.
    
    See Also:
    
    BlockLocation, HdfsBlockLocation
  - makeSplit
```
protected FileSplit makeSplit(Path file,
                              long start,
                              long length,
                              String[] hosts)
```
    A factory that makes the split for this class. It can be overridden by sub-classes to make sub-types
  - makeSplit
```
protected FileSplit makeSplit(Path file,
                              long start,
                              long length,
                              String[] hosts,
                              String[] inMemoryHosts)
```
    A factory that makes the split for this class. It can be overridden by sub-classes to make sub-types
  - getSplits
```
public List<InputSplit> getSplits(JobContext job)
                           throws IOException
```
    Generate the list of files and make them into FileSplits.
    
    Specified by:
    
    getSplits in class InputFormat<K,V>
    
    Parameters:
    
    job - the job context
    
    Returns:
    
    an array of InputSplits for the job.
    
    Throws:
    
    IOException
  - computeSplitSize
```
protected long computeSplitSize(long blockSize,
                                long minSize,
                                long maxSize)
```
  - getBlockIndex
```
protected int getBlockIndex(BlockLocation[] blkLocations,
                            long offset)
```
  - setInputPaths
```
public static void setInputPaths(Job job,
                                 String commaSeparatedPaths)
                          throws IOException
```
    Sets the given comma separated paths as the list of inputs for the map-reduce job.
    
    Parameters:
    
    job - the job
    
    commaSeparatedPaths - Comma separated paths to be set as the list of inputs for the map-reduce job.
    
    Throws:
    
    IOException
  - addInputPaths
```
public static void addInputPaths(Job job,
                                 String commaSeparatedPaths)
                          throws IOException
```
    Add the given comma separated paths to the list of inputs for the map-reduce job.
    
    Parameters:
    
    job - The job to modify
    
    commaSeparatedPaths - Comma separated paths to be added to the list of inputs for the map-reduce job.
    
    Throws:
    
    IOException
  - setInputPaths
```
public static void setInputPaths(Job job,
                                 Path... inputPaths)
                          throws IOException
```
    Set the array of Paths as the list of inputs for the map-reduce job.
    
    Parameters:
    
    job - The job to modify
    
    inputPaths - the Paths of the input directories/files for the map-reduce job.
    
    Throws:
    
    IOException
  - addInputPath
```
public static void addInputPath(Job job,
                                Path path)
                         throws IOException
```
    Add a Path to the list of inputs for the map-reduce job.
    
    Parameters:
    
    job - The Job to modify
    
    path - Path to be added to the list of inputs for the map-reduce job.
    
    Throws:
    
    IOException
  - getInputPaths
```
public static Path[] getInputPaths(JobContext context)
```
    Get the list of input Paths for the map-reduce job.
    
    Parameters:
    
    context - The job
    
    Returns:
    
    the list of input Paths for the map-reduce job.

Class FileInputFormat<K,V>

Field Summary

Constructor Summary

Method Summary

Methods inherited from class org.apache.hadoop.mapreduce.InputFormat

Methods inherited from class java.lang.Object

Field Detail

INPUT_DIR

SPLIT_MAXSIZE

SPLIT_MINSIZE

PATHFILTER_CLASS

NUM_INPUT_FILES

INPUT_DIR_RECURSIVE

INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS

LIST_STATUS_NUM_THREADS

DEFAULT_LIST_STATUS_NUM_THREADS

Constructor Detail

FileInputFormat

Method Detail

setInputDirRecursive

getInputDirRecursive

getFormatMinSplitSize

isSplitable

setInputPathFilter

setMinInputSplitSize

getMinSplitSize

setMaxInputSplitSize

getMaxSplitSize

getInputPathFilter

listStatus

addInputPathRecursively

shrinkStatus

makeSplit

makeSplit

getSplits

computeSplitSize

getBlockIndex

setInputPaths

addInputPaths

setInputPaths

addInputPath

getInputPaths