Class DataDrivenDBInputFormat<T extends DBWritable>
java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<LongWritable,T>
org.apache.hadoop.mapreduce.lib.db.DBInputFormat<T>
org.apache.hadoop.mapreduce.lib.db.DataDrivenDBInputFormat<T>
- All Implemented Interfaces:
Configurable
- Direct Known Subclasses:
OracleDataDrivenDBInputFormat
@Public
@Evolving
public class DataDrivenDBInputFormat<T extends DBWritable>
extends DBInputFormat<T>
implements Configurable
A InputFormat that reads input data from an SQL table.
Operates like DBInputFormat, but instead of using LIMIT and OFFSET to demarcate
splits, it tries to generate WHERE clauses which separate the data into roughly
equivalent shards.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classorg.apache.hadoop.mapreduce.lib.db.DataDrivenDBInputFormat.DataDrivenDBInputSplitA InputSplit that spans a set of rowsNested classes/interfaces inherited from class org.apache.hadoop.mapreduce.lib.db.DBInputFormat
org.apache.hadoop.mapreduce.lib.db.DBInputFormat.DBInputSplit, org.apache.hadoop.mapreduce.lib.db.DBInputFormat.NullDBWritable -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final StringIf users are providing their own query, the following string is expected to appear in the WHERE clause, which will be substituted with a pair of conditions on the input to allow input splits to parallelise the import.Fields inherited from class org.apache.hadoop.mapreduce.lib.db.DBInputFormat
conditions, connection, dbConf, dbProductName, fieldNames, tableName -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected RecordReader<LongWritable,T> createDBRecordReader(org.apache.hadoop.mapreduce.lib.db.DBInputFormat.DBInputSplit split, Configuration conf) protected StringgetSplits(JobContext job) Logically split the set of input files for the job.protected DBSplittergetSplitter(int sqlDataType) static voidsetBoundingQuery(Configuration conf, String query) Set the user-defined bounding query to use with a user-defined query.static voidsetInput(Job job, Class<? extends DBWritable> inputClass, String inputQuery, String inputBoundingQuery) setInput() takes a custom query and a separate "bounding query" to use instead of the custom "count query" used by DBInputFormat.static voidsetInput(Job job, Class<? extends DBWritable> inputClass, String tableName, String conditions, String splitBy, String... fieldNames) Note that the "orderBy" column is called the "splitBy" in this version.Methods inherited from class org.apache.hadoop.mapreduce.lib.db.DBInputFormat
closeConnection, createConnection, createRecordReader, getConf, getConnection, getCountQuery, getDBConf, getDBProductName, setConfMethods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
Field Details
-
SUBSTITUTE_TOKEN
If users are providing their own query, the following string is expected to appear in the WHERE clause, which will be substituted with a pair of conditions on the input to allow input splits to parallelise the import.- See Also:
-
-
Constructor Details
-
DataDrivenDBInputFormat
public DataDrivenDBInputFormat()
-
-
Method Details
-
getSplitter
- Returns:
- the DBSplitter implementation to use to divide the table/query into InputSplits.
-
getSplits
Logically split the set of input files for the job.Each
InputSplitis then assigned to an individualMapperfor processing.Note: The split is a logical split of the inputs and the input files are not physically split into chunks. For e.g. a split could be <input-file-path, start, offset> tuple. The InputFormat also creates the
RecordReaderto read theInputSplit.- Overrides:
getSplitsin classDBInputFormat<T extends DBWritable>- Parameters:
job- job configuration.- Returns:
- an array of
InputSplits for the job. - Throws:
IOException
-
getBoundingValsQuery
- Returns:
- a query which returns the minimum and maximum values for the order-by column. The min value should be in the first column, and the max value should be in the second column of the results.
-
setBoundingQuery
Set the user-defined bounding query to use with a user-defined query. This *must* include the substring "$CONDITIONS" (DataDrivenDBInputFormat.SUBSTITUTE_TOKEN) inside the WHERE clause, so that DataDrivenDBInputFormat knows where to insert split clauses. e.g., "SELECT foo FROM mytable WHERE $CONDITIONS" This will be expanded to something like: SELECT foo FROM mytable WHERE (id > 100) AND (id < 250) inside each split. -
createDBRecordReader
protected RecordReader<LongWritable,T> createDBRecordReader(org.apache.hadoop.mapreduce.lib.db.DBInputFormat.DBInputSplit split, Configuration conf) throws IOException - Overrides:
createDBRecordReaderin classDBInputFormat<T extends DBWritable>- Throws:
IOException
-
setInput
public static void setInput(Job job, Class<? extends DBWritable> inputClass, String tableName, String conditions, String splitBy, String... fieldNames) Note that the "orderBy" column is called the "splitBy" in this version. We reuse the same field, but it's not strictly ordering it -- just partitioning the results. -
setInput
public static void setInput(Job job, Class<? extends DBWritable> inputClass, String inputQuery, String inputBoundingQuery) setInput() takes a custom query and a separate "bounding query" to use instead of the custom "count query" used by DBInputFormat.
-