Class CombineFileInputFormat<K,V>
java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<K,V>
org.apache.hadoop.mapreduce.lib.input.FileInputFormat<K,V>
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat<K,V>
- Direct Known Subclasses:
CombineFileInputFormat,CombineSequenceFileInputFormat,CombineTextInputFormat
An abstract
InputFormat that returns CombineFileSplit's in
InputFormat.getSplits(JobContext) method.
Splits are constructed from the files under the input paths.
A split cannot have files from different pools.
Each split returned may contain blocks from different files.
If a maxSplitSize is specified, then blocks on the same node are
combined to form a single split. Blocks that are left over are
then combined with other blocks in the same rack.
If maxSplitSize is not specified, then blocks from the same rack
are combined in a single split; no attempt is made to create
node-local splits.
If the maxSplitSize is equal to the block size, then this class
is similar to the default splitting behavior in Hadoop: each
block is a locally processed split.
Subclasses implement
InputFormat.createRecordReader(InputSplit, TaskAttemptContext)
to construct RecordReader's for
CombineFileSplit's.- See Also:
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
FileInputFormat.Counter -
Field Summary
FieldsFields inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
DEFAULT_LIST_STATUS_NUM_THREADS, INPUT_DIR, INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS, INPUT_DIR_RECURSIVE, LIST_STATUS_NUM_THREADS, NUM_INPUT_FILES, PATHFILTER_CLASS, SPLIT_MAXSIZE, SPLIT_MINSIZE -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected voidcreatePool(List<PathFilter> filters) Create a new pool and add the filters to it.protected voidcreatePool(PathFilter... filters) Create a new pool and add the filters to it.abstract RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) This is not implemented yet.protected BlockLocation[]getFileBlockLocations(FileSystem fs, FileStatus stat) getSplits(JobContext job) Generate the list of files and make them into FileSplits.protected booleanisSplitable(JobContext context, Path file) Is the given filename splittable?protected voidsetMaxSplitSize(long maxSplitSize) Specify the maximum size (in bytes) of each split.protected voidsetMinSplitSizeNode(long minSplitSizeNode) Specify the minimum size (in bytes) of each split per node.protected voidsetMinSplitSizeRack(long minSplitSizeRack) Specify the minimum size (in bytes) of each split per rack.Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
addInputPath, addInputPathRecursively, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputDirRecursive, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, listStatus, makeSplit, makeSplit, setInputDirRecursive, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize, shrinkStatus
-
Field Details
-
SPLIT_MINSIZE_PERNODE
- See Also:
-
SPLIT_MINSIZE_PERRACK
- See Also:
-
-
Constructor Details
-
CombineFileInputFormat
public CombineFileInputFormat()default constructor
-
-
Method Details
-
setMaxSplitSize
protected void setMaxSplitSize(long maxSplitSize) Specify the maximum size (in bytes) of each split. Each split is approximately equal to the specified size. -
setMinSplitSizeNode
protected void setMinSplitSizeNode(long minSplitSizeNode) Specify the minimum size (in bytes) of each split per node. This applies to data that is left over after combining data on a single node into splits that are of maximum size specified by maxSplitSize. This leftover data will be combined into its own split if its size exceeds minSplitSizeNode. -
setMinSplitSizeRack
protected void setMinSplitSizeRack(long minSplitSizeRack) Specify the minimum size (in bytes) of each split per rack. This applies to data that is left over after combining data on a single rack into splits that are of maximum size specified by maxSplitSize. This leftover data will be combined into its own split if its size exceeds minSplitSizeRack. -
createPool
Create a new pool and add the filters to it. A split cannot have files from different pools. -
createPool
Create a new pool and add the filters to it. A pathname can satisfy any one of the specified filters. A split cannot have files from different pools. -
isSplitable
Description copied from class:FileInputFormatIs the given filename splittable? Usually, true, but if the file is stream compressed, it will not be. The default implementation inFileInputFormatalways returns true. Implementations that may deal with non-splittable files must override this method.FileInputFormatimplementations can override this and returnfalseto ensure that individual input files are never split-up so thatMappers process entire files.- Overrides:
isSplitablein classFileInputFormat<K,V> - Parameters:
context- the job contextfile- the file name to check- Returns:
- is this file splitable?
-
getSplits
Description copied from class:FileInputFormatGenerate the list of files and make them into FileSplits.- Overrides:
getSplitsin classFileInputFormat<K,V> - Parameters:
job- the job context- Returns:
- an array of
InputSplits for the job. - Throws:
IOException
-
createRecordReader
public abstract RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException This is not implemented yet.- Specified by:
createRecordReaderin classInputFormat<K,V> - Parameters:
split- the split to be readcontext- the information about the task- Returns:
- a new record reader
- Throws:
IOException
-
getFileBlockLocations
- Throws:
IOException
-