org.apache.hadoop.mapreduce.InputFormat<K,V>

org.apache.hadoop.mapreduce.lib.input.FileInputFormat<K,V>

org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat<K,V>

Direct Known Subclasses:: CombineFileInputFormat, CombineSequenceFileInputFormat, CombineTextInputFormat

@Public @Stable public abstract class CombineFileInputFormat<K,V> extends FileInputFormat<K,V>

An abstract InputFormat that returns CombineFileSplit's in InputFormat.getSplits(JobContext) method. Splits are constructed from the files under the input paths. A split cannot have files from different pools. Each split returned may contain blocks from different files. If a maxSplitSize is specified, then blocks on the same node are combined to form a single split. Blocks that are left over are then combined with other blocks in the same rack. If maxSplitSize is not specified, then blocks from the same rack are combined in a single split; no attempt is made to create node-local splits. If the maxSplitSize is equal to the block size, then this class is similar to the default splitting behavior in Hadoop: each block is a locally processed split. Subclasses implement InputFormat.createRecordReader(InputSplit, TaskAttemptContext) to construct RecordReader's for CombineFileSplit's.

See Also:

CombineFileSplit

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
FileInputFormat.Counter
Field Summary

Fields

Modifier and Type

Field

Description

static final String

SPLIT_MINSIZE_PERNODE

static final String

SPLIT_MINSIZE_PERRACK

Fields inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
DEFAULT_LIST_STATUS_NUM_THREADS, INPUT_DIR, INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS, INPUT_DIR_RECURSIVE, LIST_STATUS_NUM_THREADS, NUM_INPUT_FILES, PATHFILTER_CLASS, SPLIT_MAXSIZE, SPLIT_MINSIZE
Constructor Summary

Constructors

Constructor

Description

CombineFileInputFormat()

default constructor
Method Summary

Modifier and Type

Method

Description

protected void

createPool(List<PathFilter> filters)

Create a new pool and add the filters to it.

protected void

createPool(PathFilter... filters)

Create a new pool and add the filters to it.

abstract RecordReader<K,V>

createRecordReader(InputSplit split, TaskAttemptContext context)

This is not implemented yet.

protected BlockLocation[]

getFileBlockLocations(FileSystem fs, FileStatus stat)

List<InputSplit>

getSplits(JobContext job)

Generate the list of files and make them into FileSplits.

protected boolean

isSplitable(JobContext context, Path file)

Is the given filename splittable?

protected void

setMaxSplitSize(long maxSplitSize)

Specify the maximum size (in bytes) of each split.

protected void

setMinSplitSizeNode(long minSplitSizeNode)

Specify the minimum size (in bytes) of each split per node.

protected void

setMinSplitSizeRack(long minSplitSizeRack)

Specify the minimum size (in bytes) of each split per rack.

Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
addInputPath, addInputPathRecursively, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputDirRecursive, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, listStatus, makeSplit, makeSplit, setInputDirRecursive, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize, shrinkStatus

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- SPLIT_MINSIZE_PERNODE
  
  public static final String SPLIT_MINSIZE_PERNODE
  See Also:
  
  Constant Field Values
- SPLIT_MINSIZE_PERRACK
  
  public static final String SPLIT_MINSIZE_PERRACK
  See Also:
  
  Constant Field Values
Constructor Details
- CombineFileInputFormat
  
  public CombineFileInputFormat()
  
  default constructor
Method Details
- setMaxSplitSize
  
  protected void setMaxSplitSize(long maxSplitSize)
  
  Specify the maximum size (in bytes) of each split. Each split is approximately equal to the specified size.
- setMinSplitSizeNode
  
  protected void setMinSplitSizeNode(long minSplitSizeNode)
  
  Specify the minimum size (in bytes) of each split per node. This applies to data that is left over after combining data on a single node into splits that are of maximum size specified by maxSplitSize. This leftover data will be combined into its own split if its size exceeds minSplitSizeNode.
- setMinSplitSizeRack
  
  protected void setMinSplitSizeRack(long minSplitSizeRack)
  
  Specify the minimum size (in bytes) of each split per rack. This applies to data that is left over after combining data on a single rack into splits that are of maximum size specified by maxSplitSize. This leftover data will be combined into its own split if its size exceeds minSplitSizeRack.
- createPool
  
  protected void createPool(List<PathFilter> filters)
  
  Create a new pool and add the filters to it. A split cannot have files from different pools.
- createPool
  
  protected void createPool(PathFilter... filters)
  
  Create a new pool and add the filters to it. A pathname can satisfy any one of the specified filters. A split cannot have files from different pools.
- isSplitable
  
  protected boolean isSplitable(JobContext context, Path file)
  
  Description copied from class: FileInputFormat
  
  Is the given filename splittable? Usually, true, but if the file is stream compressed, it will not be. The default implementation in FileInputFormat always returns true. Implementations that may deal with non-splittable files must override this method. FileInputFormat implementations can override this and return false to ensure that individual input files are never split-up so that Mappers process entire files.
  
  Overrides:
  
  isSplitable in class FileInputFormat<K,V>
  
  Parameters:
  
  context - the job context
  
  file - the file name to check
  
  Returns:
  
  is this file splitable?
- getSplits
  
  public List<InputSplit> getSplits(JobContext job) throws IOException
  
  Description copied from class: FileInputFormat
  
  Generate the list of files and make them into FileSplits.
  
  Overrides:
  
  getSplits in class FileInputFormat<K,V>
  
  Parameters:
  
  job - the job context
  
  Returns:
  
  an array of InputSplits for the job.
  
  Throws:
  
  IOException
- createRecordReader
  
  public abstract RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException
  
  This is not implemented yet.
  
  Specified by:
  
  createRecordReader in class InputFormat<K,V>
  
  Parameters:
  
  split - the split to be read
  
  context - the information about the task
  
  Returns:
  
  a new record reader
  
  Throws:
  
  IOException
- getFileBlockLocations
  
  protected BlockLocation[] getFileBlockLocations(FileSystem fs, FileStatus stat) throws IOException
  
  Throws:
  
  IOException

Class CombineFileInputFormat<K,V>

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat

Field Summary

Fields inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat

Constructor Summary

Method Summary

Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat

Methods inherited from class java.lang.Object

Field Details

SPLIT_MINSIZE_PERNODE

SPLIT_MINSIZE_PERRACK

Constructor Details

CombineFileInputFormat

Method Details

setMaxSplitSize

setMinSplitSizeNode

setMinSplitSizeRack

createPool

createPool

isSplitable

getSplits

createRecordReader

getFileBlockLocations