Class FileInputFormat<K,V>

java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<K,V>
org.apache.hadoop.mapreduce.lib.input.FileInputFormat<K,V>
Direct Known Subclasses:
CombineFileInputFormat, FixedLengthInputFormat, KeyValueTextInputFormat, NLineInputFormat, SequenceFileInputFormat, TextInputFormat

@Public @Stable public abstract class FileInputFormat<K,V> extends InputFormat<K,V>
A base class for file-based InputFormats.

FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of getSplits(JobContext). Implementations of FileInputFormat can also override the isSplitable(JobContext, Path) method to prevent input files from being split-up in certain situations. Implementations that may deal with non-splittable files must override this method, since the default implementation assumes splitting is always possible.

  • Field Details

  • Constructor Details

    • FileInputFormat

      public FileInputFormat()
  • Method Details

    • setInputDirRecursive

      public static void setInputDirRecursive(Job job, boolean inputDirRecursive)
      Parameters:
      job - the job to modify
      inputDirRecursive -
    • getInputDirRecursive

      public static boolean getInputDirRecursive(JobContext job)
      Parameters:
      job - the job to look at.
      Returns:
      should the files to be read recursively?
    • getFormatMinSplitSize

      protected long getFormatMinSplitSize()
      Get the lower bound on split size imposed by the format.
      Returns:
      the number of bytes of the minimal split for this format
    • isSplitable

      protected boolean isSplitable(JobContext context, Path filename)
      Is the given filename splittable? Usually, true, but if the file is stream compressed, it will not be. The default implementation in FileInputFormat always returns true. Implementations that may deal with non-splittable files must override this method. FileInputFormat implementations can override this and return false to ensure that individual input files are never split-up so that Mappers process entire files.
      Parameters:
      context - the job context
      filename - the file name to check
      Returns:
      is this file splitable?
    • setInputPathFilter

      public static void setInputPathFilter(Job job, Class<? extends PathFilter> filter)
      Set a PathFilter to be applied to the input paths for the map-reduce job.
      Parameters:
      job - the job to modify
      filter - the PathFilter class use for filtering the input paths.
    • setMinInputSplitSize

      public static void setMinInputSplitSize(Job job, long size)
      Set the minimum input split size
      Parameters:
      job - the job to modify
      size - the minimum size
    • getMinSplitSize

      public static long getMinSplitSize(JobContext job)
      Get the minimum split size
      Parameters:
      job - the job
      Returns:
      the minimum number of bytes that can be in a split
    • setMaxInputSplitSize

      public static void setMaxInputSplitSize(Job job, long size)
      Set the maximum split size
      Parameters:
      job - the job to modify
      size - the maximum split size
    • getMaxSplitSize

      public static long getMaxSplitSize(JobContext context)
      Get the maximum split size.
      Parameters:
      context - the job to look at.
      Returns:
      the maximum number of bytes a split can include
    • getInputPathFilter

      public static PathFilter getInputPathFilter(JobContext context)
      Get a PathFilter instance of the filter set for the input paths.
      Returns:
      the PathFilter instance set for the job, NULL if none has been set.
    • listStatus

      protected List<FileStatus> listStatus(JobContext job) throws IOException
      List input directories. Subclasses may override to, e.g., select only files matching a regular expression. If security is enabled, this method collects delegation tokens from the input paths and adds them to the job's credentials.
      Parameters:
      job - the job to list input paths for and attach tokens to.
      Returns:
      array of FileStatus objects
      Throws:
      IOException - if zero items.
    • addInputPathRecursively

      protected void addInputPathRecursively(List<FileStatus> result, FileSystem fs, Path path, PathFilter inputFilter) throws IOException
      Add files in the input path recursively into the results.
      Parameters:
      result - The List to store all files.
      fs - The FileSystem.
      path - The input path.
      inputFilter - The input filter that can be used to filter files/dirs.
      Throws:
      IOException
    • shrinkStatus

      public static FileStatus shrinkStatus(FileStatus origStat)
      The HdfsBlockLocation includes a LocatedBlock which contains messages for issuing more detailed queries to datanodes about a block, but these messages are useless during job submission currently. This method tries to exclude the LocatedBlock from HdfsBlockLocation by creating a new BlockLocation from original, reshaping the LocatedFileStatus, allowing listStatus(JobContext) to scan more files with less memory footprint.
      Parameters:
      origStat - The fat FileStatus.
      Returns:
      The FileStatus that has been shrunk.
      See Also:
    • makeSplit

      protected FileSplit makeSplit(Path file, long start, long length, String[] hosts)
      A factory that makes the split for this class. It can be overridden by sub-classes to make sub-types
    • makeSplit

      protected FileSplit makeSplit(Path file, long start, long length, String[] hosts, String[] inMemoryHosts)
      A factory that makes the split for this class. It can be overridden by sub-classes to make sub-types
    • getSplits

      public List<InputSplit> getSplits(JobContext job) throws IOException
      Generate the list of files and make them into FileSplits.
      Specified by:
      getSplits in class InputFormat<K,V>
      Parameters:
      job - the job context
      Returns:
      an array of InputSplits for the job.
      Throws:
      IOException
    • computeSplitSize

      protected long computeSplitSize(long blockSize, long minSize, long maxSize)
    • getBlockIndex

      protected int getBlockIndex(BlockLocation[] blkLocations, long offset)
    • setInputPaths

      public static void setInputPaths(Job job, String commaSeparatedPaths) throws IOException
      Sets the given comma separated paths as the list of inputs for the map-reduce job.
      Parameters:
      job - the job
      commaSeparatedPaths - Comma separated paths to be set as the list of inputs for the map-reduce job.
      Throws:
      IOException
    • addInputPaths

      public static void addInputPaths(Job job, String commaSeparatedPaths) throws IOException
      Add the given comma separated paths to the list of inputs for the map-reduce job.
      Parameters:
      job - The job to modify
      commaSeparatedPaths - Comma separated paths to be added to the list of inputs for the map-reduce job.
      Throws:
      IOException
    • setInputPaths

      public static void setInputPaths(Job job, Path... inputPaths) throws IOException
      Set the array of Paths as the list of inputs for the map-reduce job.
      Parameters:
      job - The job to modify
      inputPaths - the Paths of the input directories/files for the map-reduce job.
      Throws:
      IOException
    • addInputPath

      public static void addInputPath(Job job, Path path) throws IOException
      Add a Path to the list of inputs for the map-reduce job.
      Parameters:
      job - The Job to modify
      path - Path to be added to the list of inputs for the map-reduce job.
      Throws:
      IOException
    • getInputPaths

      public static Path[] getInputPaths(JobContext context)
      Get the list of input Paths for the map-reduce job.
      Parameters:
      context - The job
      Returns:
      the list of input Paths for the map-reduce job.