Class FileInputFormat<K,V>
java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<K,V>
org.apache.hadoop.mapreduce.lib.input.FileInputFormat<K,V>
- Direct Known Subclasses:
CombineFileInputFormat,FixedLengthInputFormat,KeyValueTextInputFormat,NLineInputFormat,SequenceFileInputFormat,TextInputFormat
A base class for file-based
InputFormats.
FileInputFormat is the base class for all file-based
InputFormats. This provides a generic implementation of
getSplits(JobContext).
Implementations of FileInputFormat can also override the
isSplitable(JobContext, Path) method to prevent input files
from being split-up in certain situations. Implementations that may
deal with non-splittable files must override this method, since
the default implementation assumes splitting is always possible.
-
Nested Class Summary
Nested Classes -
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic voidaddInputPath(Job job, Path path) Add aPathto the list of inputs for the map-reduce job.protected voidaddInputPathRecursively(List<FileStatus> result, FileSystem fs, Path path, PathFilter inputFilter) Add files in the input path recursively into the results.static voidaddInputPaths(Job job, String commaSeparatedPaths) Add the given comma separated paths to the list of inputs for the map-reduce job.protected longcomputeSplitSize(long blockSize, long minSize, long maxSize) protected intgetBlockIndex(BlockLocation[] blkLocations, long offset) protected longGet the lower bound on split size imposed by the format.static booleanstatic PathFiltergetInputPathFilter(JobContext context) Get a PathFilter instance of the filter set for the input paths.static Path[]getInputPaths(JobContext context) Get the list of inputPaths for the map-reduce job.static longgetMaxSplitSize(JobContext context) Get the maximum split size.static longGet the minimum split sizegetSplits(JobContext job) Generate the list of files and make them into FileSplits.protected booleanisSplitable(JobContext context, Path filename) Is the given filename splittable?protected List<FileStatus>listStatus(JobContext job) List input directories.protected FileSplitA factory that makes the split for this class.protected FileSplitA factory that makes the split for this class.static voidsetInputDirRecursive(Job job, boolean inputDirRecursive) static voidsetInputPathFilter(Job job, Class<? extends PathFilter> filter) Set a PathFilter to be applied to the input paths for the map-reduce job.static voidsetInputPaths(Job job, String commaSeparatedPaths) Sets the given comma separated paths as the list of inputs for the map-reduce job.static voidsetInputPaths(Job job, Path... inputPaths) Set the array ofPaths as the list of inputs for the map-reduce job.static voidsetMaxInputSplitSize(Job job, long size) Set the maximum split sizestatic voidsetMinInputSplitSize(Job job, long size) Set the minimum input split sizestatic FileStatusshrinkStatus(FileStatus origStat) The HdfsBlockLocation includes a LocatedBlock which contains messages for issuing more detailed queries to datanodes about a block, but these messages are useless during job submission currently.Methods inherited from class org.apache.hadoop.mapreduce.InputFormat
createRecordReader
-
Field Details
-
INPUT_DIR
- See Also:
-
SPLIT_MAXSIZE
- See Also:
-
SPLIT_MINSIZE
- See Also:
-
PATHFILTER_CLASS
- See Also:
-
NUM_INPUT_FILES
- See Also:
-
INPUT_DIR_RECURSIVE
- See Also:
-
INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS
- See Also:
-
LIST_STATUS_NUM_THREADS
- See Also:
-
DEFAULT_LIST_STATUS_NUM_THREADS
public static final int DEFAULT_LIST_STATUS_NUM_THREADS- See Also:
-
-
Constructor Details
-
FileInputFormat
public FileInputFormat()
-
-
Method Details
-
setInputDirRecursive
- Parameters:
job- the job to modifyinputDirRecursive-
-
getInputDirRecursive
- Parameters:
job- the job to look at.- Returns:
- should the files to be read recursively?
-
getFormatMinSplitSize
protected long getFormatMinSplitSize()Get the lower bound on split size imposed by the format.- Returns:
- the number of bytes of the minimal split for this format
-
isSplitable
Is the given filename splittable? Usually, true, but if the file is stream compressed, it will not be. The default implementation inFileInputFormatalways returns true. Implementations that may deal with non-splittable files must override this method.FileInputFormatimplementations can override this and returnfalseto ensure that individual input files are never split-up so thatMappers process entire files.- Parameters:
context- the job contextfilename- the file name to check- Returns:
- is this file splitable?
-
setInputPathFilter
Set a PathFilter to be applied to the input paths for the map-reduce job.- Parameters:
job- the job to modifyfilter- the PathFilter class use for filtering the input paths.
-
setMinInputSplitSize
Set the minimum input split size- Parameters:
job- the job to modifysize- the minimum size
-
getMinSplitSize
Get the minimum split size- Parameters:
job- the job- Returns:
- the minimum number of bytes that can be in a split
-
setMaxInputSplitSize
Set the maximum split size- Parameters:
job- the job to modifysize- the maximum split size
-
getMaxSplitSize
Get the maximum split size.- Parameters:
context- the job to look at.- Returns:
- the maximum number of bytes a split can include
-
getInputPathFilter
Get a PathFilter instance of the filter set for the input paths.- Returns:
- the PathFilter instance set for the job, NULL if none has been set.
-
listStatus
List input directories. Subclasses may override to, e.g., select only files matching a regular expression. If security is enabled, this method collects delegation tokens from the input paths and adds them to the job's credentials.- Parameters:
job- the job to list input paths for and attach tokens to.- Returns:
- array of FileStatus objects
- Throws:
IOException- if zero items.
-
addInputPathRecursively
protected void addInputPathRecursively(List<FileStatus> result, FileSystem fs, Path path, PathFilter inputFilter) throws IOException Add files in the input path recursively into the results.- Parameters:
result- The List to store all files.fs- The FileSystem.path- The input path.inputFilter- The input filter that can be used to filter files/dirs.- Throws:
IOException
-
shrinkStatus
The HdfsBlockLocation includes a LocatedBlock which contains messages for issuing more detailed queries to datanodes about a block, but these messages are useless during job submission currently. This method tries to exclude the LocatedBlock from HdfsBlockLocation by creating a new BlockLocation from original, reshaping the LocatedFileStatus, allowinglistStatus(JobContext)to scan more files with less memory footprint.- Parameters:
origStat- The fat FileStatus.- Returns:
- The FileStatus that has been shrunk.
- See Also:
-
BlockLocationHdfsBlockLocation
-
makeSplit
A factory that makes the split for this class. It can be overridden by sub-classes to make sub-types -
makeSplit
protected FileSplit makeSplit(Path file, long start, long length, String[] hosts, String[] inMemoryHosts) A factory that makes the split for this class. It can be overridden by sub-classes to make sub-types -
getSplits
Generate the list of files and make them into FileSplits.- Specified by:
getSplitsin classInputFormat<K,V> - Parameters:
job- the job context- Returns:
- an array of
InputSplits for the job. - Throws:
IOException
-
computeSplitSize
protected long computeSplitSize(long blockSize, long minSize, long maxSize) -
getBlockIndex
-
setInputPaths
Sets the given comma separated paths as the list of inputs for the map-reduce job.- Parameters:
job- the jobcommaSeparatedPaths- Comma separated paths to be set as the list of inputs for the map-reduce job.- Throws:
IOException
-
addInputPaths
Add the given comma separated paths to the list of inputs for the map-reduce job.- Parameters:
job- The job to modifycommaSeparatedPaths- Comma separated paths to be added to the list of inputs for the map-reduce job.- Throws:
IOException
-
setInputPaths
Set the array ofPaths as the list of inputs for the map-reduce job.- Parameters:
job- The job to modifyinputPaths- thePaths of the input directories/files for the map-reduce job.- Throws:
IOException
-
addInputPath
Add aPathto the list of inputs for the map-reduce job.- Parameters:
job- TheJobto modifypath-Pathto be added to the list of inputs for the map-reduce job.- Throws:
IOException
-
getInputPaths
Get the list of inputPaths for the map-reduce job.- Parameters:
context- The job- Returns:
- the list of input
Paths for the map-reduce job.
-