public class SplitInput extends AbstractJob
splitDirectory(Path)
or splitFile(Path)
,
the lines read from one or more, input files are written to files of the same
name into the directories specified by the
setTestOutputDirectory(Path)
and
setTrainingOutputDirectory(Path)
methods.
The composition of the test set is determined using one of the following
approaches:
setTestSplitSize(int)
or setTestSplitPct(int)
methods.
setTestSplitSize(int)
allocates a fixed number of items, while
setTestSplitPct(int)
allocates a percentage of the original input,
rounded up to the nearest integer. setSplitLocation(int)
is used to
control the position in the input from which the test data is extracted and
is described further below.setTestRandomSelectionSize(int)
or
setTestRandomSelectionPct(int)
methods, each choosing a fixed test
set size or percentage of the input set size as described above. The
RandomSampler
class from mahout-math
is used to create a sample
of the appropriate size.setSplitLocation(int)
method is passed an integer from 0 to 100
(inclusive) which is translated into the position of the start of the test
data within the input file.
Given:
Modifier and Type | Class and Description |
---|---|
static interface |
SplitInput.SplitCallback
Used to pass information back to a caller once a file has been split without the need for a data object
|
argMap, inputFile, inputPath, outputFile, outputPath, tempPath
Constructor and Description |
---|
SplitInput() |
Modifier and Type | Method and Description |
---|---|
static int |
countLines(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path inputFile,
Charset charset)
Count the lines in the file specified as returned by
BufferedReader.readLine() |
SplitInput.SplitCallback |
getCallback() |
Charset |
getCharset() |
org.apache.hadoop.fs.Path |
getInputDirectory() |
int |
getSplitLocation() |
org.apache.hadoop.fs.Path |
getTestOutputDirectory() |
int |
getTestRandomSelectionPct() |
int |
getTestRandomSelectionSize() |
int |
getTestSplitPct() |
int |
getTestSplitSize() |
org.apache.hadoop.fs.Path |
getTrainingOutputDirectory() |
static void |
main(String[] args) |
int |
run(String[] args) |
void |
setCallback(SplitInput.SplitCallback callback)
Sets the callback used to inform the caller that an input file has been successfully split
|
void |
setCharset(Charset charset)
Set the charset used to read and write files
|
void |
setInputDirectory(org.apache.hadoop.fs.Path inputDir)
Set the directory from which input data will be read when the the
splitDirectory() method is invoked |
void |
setKeepPct(int keepPct)
Sets the percentage of the input data to keep in a map reduce split input job
|
void |
setMapRedOutputDirectory(org.apache.hadoop.fs.Path mapRedOutputDirectory) |
void |
setSplitLocation(int splitLocation)
Set the location of the start of the test/training data split.
|
void |
setTestOutputDirectory(org.apache.hadoop.fs.Path testOutputDir)
Set the directory to which test data will be written.
|
void |
setTestRandomSelectionPct(int randomSelectionPct)
Sets number of random input samples that will be saved to the test set as a percentage of the size of the
input set.
|
void |
setTestRandomSelectionSize(int testRandomSelectionSize)
Sets number of random input samples that will be saved to the test set.
|
void |
setTestSplitPct(int testSplitPct)
Sets the percentage of the input data to allocate to the test split
|
void |
setTestSplitSize(int testSplitSize) |
void |
setTrainingOutputDirectory(org.apache.hadoop.fs.Path trainingOutputDir)
Set the directory to which training data will be written.
|
void |
setUseMapRed(boolean useMapRed)
Set to true to use map reduce to split the input
|
void |
splitDirectory()
Perform a split on directory specified by
setInputDirectory(Path) by calling splitFile(Path)
on each file found within that directory. |
void |
splitDirectory(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path inputDir) |
void |
splitDirectory(org.apache.hadoop.fs.Path inputDir)
Perform a split on the specified directory by calling
splitFile(Path) on each file found within that
directory. |
void |
splitFile(org.apache.hadoop.fs.Path inputFile)
Perform a split on the specified input file.
|
void |
validate()
Validates that the current instance is in a consistent state
|
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, buildOption, getAnalyzerClassFromOption, getCLIOption, getConf, getDimensions, getFloat, getFloat, getGroup, getInputFile, getInputPath, getInt, getInt, getOption, getOption, getOption, getOptions, getOutputFile, getOutputPath, getOutputPath, getTempPath, getTempPath, hasOption, keyFor, maybePut, parseArguments, parseArguments, parseDirectories, prepareJob, prepareJob, prepareJob, prepareJob, setConf, setS3SafeCombinedInputPath, shouldRunNextPhase
public void splitDirectory() throws IOException, ClassNotFoundException, InterruptedException
setInputDirectory(Path)
by calling splitFile(Path)
on each file found within that directory.public void splitDirectory(org.apache.hadoop.fs.Path inputDir) throws IOException, ClassNotFoundException, InterruptedException
splitFile(Path)
on each file found within that
directory.public void splitDirectory(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path inputDir) throws IOException, ClassNotFoundException, InterruptedException
public void splitFile(org.apache.hadoop.fs.Path inputFile) throws IOException
validate()
method is called prior to executing the split.IOException
public int getTestSplitSize()
public void setTestSplitSize(int testSplitSize)
public int getTestSplitPct()
public void setTestSplitPct(int testSplitPct)
testSplitPct
- a value between 0 and 100 inclusive.public void setKeepPct(int keepPct)
keepPct
- a value between 0 and 100 inclusive.public void setUseMapRed(boolean useMapRed)
useMapRed
- a boolean to indicate whether map reduce should be usedpublic void setMapRedOutputDirectory(org.apache.hadoop.fs.Path mapRedOutputDirectory)
public int getSplitLocation()
public void setSplitLocation(int splitLocation)
splitLocation
- a value between 0 and 100 inclusive.public Charset getCharset()
public void setCharset(Charset charset)
public org.apache.hadoop.fs.Path getInputDirectory()
public void setInputDirectory(org.apache.hadoop.fs.Path inputDir)
splitDirectory()
method is invokedpublic org.apache.hadoop.fs.Path getTrainingOutputDirectory()
public void setTrainingOutputDirectory(org.apache.hadoop.fs.Path trainingOutputDir)
public org.apache.hadoop.fs.Path getTestOutputDirectory()
public void setTestOutputDirectory(org.apache.hadoop.fs.Path testOutputDir)
public SplitInput.SplitCallback getCallback()
public void setCallback(SplitInput.SplitCallback callback)
public int getTestRandomSelectionSize()
public void setTestRandomSelectionSize(int testRandomSelectionSize)
public int getTestRandomSelectionPct()
public void setTestRandomSelectionPct(int randomSelectionPct)
randomSelectionPct
- a value between 0 and 100 inclusive.public void validate() throws IOException
IllegalArgumentException
- if settings violate class invariants.IOException
- if output directories do not exist or are not directories.public static int countLines(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path inputFile, Charset charset) throws IOException
BufferedReader.readLine()
inputFile
- the file whose lines will be countedcharset
- the charset of the file to readIOException
- if there is a problem opening or reading the file.Copyright © 2008–2015 The Apache Software Foundation. All rights reserved.