Class BindingPathOutputCommitter

All Implemented Interfaces:
org.apache.hadoop.fs.statistics.IOStatisticsSource, StreamCapabilities

@Public @Unstable public class BindingPathOutputCommitter extends PathOutputCommitter implements org.apache.hadoop.fs.statistics.IOStatisticsSource, StreamCapabilities
This is a special committer which creates the factory for the committer and runs off that. Why does it exist? So that you can explicitly instantiate a committer by classname and yet still have the actual implementation driven dynamically by the factory options and destination filesystem. This simplifies integration with existing code which takes the classname of a committer. There's no factory for this, as that would lead to a loop. All commit protocol methods and accessors are delegated to the wrapped committer. How to use:
  1. In applications which take a classname of committer in a configuration option, set it to the canonical name of this class (see NAME). When this class is instantiated, it will use the factory mechanism to locate the configured committer for the destination.
  2. In code, explicitly create an instance of this committer through its constructor, then invoke commit lifecycle operations on it. The dynamically configured committer will be created in the constructor and have the lifecycle operations relayed to it.
  • Field Details

    • NAME

      public static final String NAME
      The classname for use in configurations.
  • Constructor Details

    • BindingPathOutputCommitter

      public BindingPathOutputCommitter(Path outputPath, TaskAttemptContext context) throws IOException
      Instantiate.
      Parameters:
      outputPath - output path (may be null)
      context - task context
      Throws:
      IOException - on any failure.
  • Method Details

    • getOutputPath

      public Path getOutputPath()
      Description copied from class: PathOutputCommitter
      Get the final directory where work will be placed once the job is committed. This may be null, in which case, there is no output path to write data to.
      Specified by:
      getOutputPath in class PathOutputCommitter
      Returns:
      the path where final output of the job should be placed.
    • getWorkPath

      public Path getWorkPath() throws IOException
      Description copied from class: PathOutputCommitter
      Get the directory that the task should write results into. Warning: there's no guarantee that this work path is on the same FS as the final output, or that it's visible across machines. May be null.
      Specified by:
      getWorkPath in class PathOutputCommitter
      Returns:
      the work directory
      Throws:
      IOException - IO problem
    • setupJob

      public void setupJob(JobContext jobContext) throws IOException
      Description copied from class: OutputCommitter
      For the framework to setup the job output during initialization. This is called from the application master process for the entire job. This will be called multiple times, once per job attempt.
      Specified by:
      setupJob in class OutputCommitter
      Parameters:
      jobContext - Context of the job whose output is being written.
      Throws:
      IOException - if temporary output could not be created
    • setupTask

      public void setupTask(TaskAttemptContext taskContext) throws IOException
      Description copied from class: OutputCommitter
      Sets up output for the task. This is called from each individual task's process that will output to HDFS, and it is called just for that task. This may be called multiple times for the same task, but for different task attempts.
      Specified by:
      setupTask in class OutputCommitter
      Parameters:
      taskContext - Context of the task whose output is being written.
      Throws:
      IOException
    • needsTaskCommit

      public boolean needsTaskCommit(TaskAttemptContext taskContext) throws IOException
      Description copied from class: OutputCommitter
      Check whether task needs a commit. This is called from each individual task's process that will output to HDFS, and it is called just for that task.
      Specified by:
      needsTaskCommit in class OutputCommitter
      Returns:
      true/false
      Throws:
      IOException
    • commitTask

      public void commitTask(TaskAttemptContext taskContext) throws IOException
      Description copied from class: OutputCommitter
      To promote the task's temporary output to final output location. If OutputCommitter.needsTaskCommit(TaskAttemptContext) returns true and this task is the task that the AM determines finished first, this method is called to commit an individual task's output. This is to mark that tasks output as complete, as OutputCommitter.commitJob(JobContext) will also be called later on if the entire job finished successfully. This is called from a task's process. This may be called multiple times for the same task, but different task attempts. It should be very rare for this to be called multiple times and requires odd networking failures to make this happen. In the future the Hadoop framework may eliminate this race.
      Specified by:
      commitTask in class OutputCommitter
      Parameters:
      taskContext - Context of the task whose output is being written.
      Throws:
      IOException - if commit is not successful.
    • abortTask

      public void abortTask(TaskAttemptContext taskContext) throws IOException
      Description copied from class: OutputCommitter
      Discard the task output. This is called from a task's process to clean up a single task's output that can not yet been committed. This may be called multiple times for the same task, but for different task attempts.
      Specified by:
      abortTask in class OutputCommitter
      Throws:
      IOException
    • cleanupJob

      public void cleanupJob(JobContext jobContext) throws IOException
      Description copied from class: OutputCommitter
      For cleaning up the job's output after job completion. This is called from the application master process for the entire job. This may be called multiple times.
      Overrides:
      cleanupJob in class OutputCommitter
      Parameters:
      jobContext - Context of the job whose output is being written.
      Throws:
      IOException
    • commitJob

      public void commitJob(JobContext jobContext) throws IOException
      Description copied from class: OutputCommitter
      For committing job's output after successful job completion. Note that this is invoked for jobs with final runstate as SUCCESSFUL. This is called from the application master process for the entire job. This is guaranteed to only be called once. If it throws an exception the entire job will fail.
      Overrides:
      commitJob in class OutputCommitter
      Parameters:
      jobContext - Context of the job whose output is being written.
      Throws:
      IOException
    • abortJob

      public void abortJob(JobContext jobContext, JobStatus.State state) throws IOException
      Description copied from class: OutputCommitter
      For aborting an unsuccessful job's output. Note that this is invoked for jobs with final runstate as JobStatus.State.FAILED or JobStatus.State.KILLED. This is called from the application master process for the entire job. This may be called multiple times.
      Overrides:
      abortJob in class OutputCommitter
      Parameters:
      jobContext - Context of the job whose output is being written.
      state - final runstate of the job
      Throws:
      IOException
    • isRecoverySupported

      public boolean isRecoverySupported()
      Description copied from class: OutputCommitter
      Is task output recovery supported for restarting jobs? If task output recovery is supported, job restart can be done more efficiently.
      Overrides:
      isRecoverySupported in class OutputCommitter
      Returns:
      true if task output recovery is supported, false otherwise
      See Also:
    • isCommitJobRepeatable

      public boolean isCommitJobRepeatable(JobContext jobContext) throws IOException
      Description copied from class: OutputCommitter
      Returns true if an in-progress job commit can be retried. If the MR AM is re-run then it will check this value to determine if it can retry an in-progress commit that was started by a previous version. Note that in rare scenarios, the previous AM version might still be running at that time, due to system anomalies. Hence if this method returns true then the retry commit operation should be able to run concurrently with the previous operation. If repeatable job commit is supported, job restart can tolerate previous AM failures during job commit. By default, it is not supported. Extended classes (like: FileOutputCommitter) should explicitly override it if provide support.
      Overrides:
      isCommitJobRepeatable in class OutputCommitter
      Parameters:
      jobContext - Context of the job whose output is being written.
      Returns:
      true repeatable job commit is supported, false otherwise
      Throws:
      IOException
    • isRecoverySupported

      public boolean isRecoverySupported(JobContext jobContext) throws IOException
      Description copied from class: OutputCommitter
      Is task output recovery supported for restarting jobs? If task output recovery is supported, job restart can be done more efficiently.
      Overrides:
      isRecoverySupported in class OutputCommitter
      Parameters:
      jobContext - Context of the job whose output is being written.
      Returns:
      true if task output recovery is supported, false otherwise
      Throws:
      IOException
      See Also:
    • recoverTask

      public void recoverTask(TaskAttemptContext taskContext) throws IOException
      Description copied from class: OutputCommitter
      Recover the task output. The retry-count for the job will be passed via the MRJobConfig.APPLICATION_ATTEMPT_ID key in JobContext.getConfiguration() for the OutputCommitter. This is called from the application master process, but it is called individually for each task. If an exception is thrown the task will be attempted again. This may be called multiple times for the same task. But from different application attempts.
      Overrides:
      recoverTask in class OutputCommitter
      Parameters:
      taskContext - Context of the task whose output is being recovered
      Throws:
      IOException
    • hasOutputPath

      public boolean hasOutputPath()
      Description copied from class: PathOutputCommitter
      Predicate: is there an output path?
      Overrides:
      hasOutputPath in class PathOutputCommitter
      Returns:
      true if we have an output path set, else false.
    • toString

      public String toString()
      Overrides:
      toString in class PathOutputCommitter
    • getCommitter

      public PathOutputCommitter getCommitter()
      Get the inner committer.
      Returns:
      the bonded committer.
    • hasCapability

      public boolean hasCapability(String capability)
      Pass through if the inner committer supports StreamCapabilities. Query the stream for a specific capability.
      Specified by:
      hasCapability in interface StreamCapabilities
      Parameters:
      capability - string to query the stream support for.
      Returns:
      True if the stream supports capability.
    • getIOStatistics

      public IOStatistics getIOStatistics()
      Description copied from interface: org.apache.hadoop.fs.statistics.IOStatisticsSource
      Return a statistics instance.

      It is not a requirement that the same instance is returned every time. IOStatisticsSource.

      If the object implementing this is Closeable, this method may return null if invoked on a closed object, even if it returns a valid instance when called earlier.

      Specified by:
      getIOStatistics in interface org.apache.hadoop.fs.statistics.IOStatisticsSource
      Returns:
      an IOStatistics instance or null