public class CachingCVB0Mapper extends org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.IntWritable,VectorWritable,org.apache.hadoop.io.IntWritable,VectorWritable>
ModelTrainer
with two TopicModel
instances:
one from the previous iteration, the other empty. Inference is done on the first, and the
learning updates are stored in the second, and only emitted at cleanup().
In terms of obvious performance improvements still available, the memory footprint in this
Mapper could be dropped by half if we accumulated model updates onto the model we're using
for inference, which might also speed up convergence, as we'd be able to take advantage of
learning during iteration, not just after each one is done. Most likely we don't
really need to accumulate double values in the model either, floats would most likely be
sufficient. Between these two, we could squeeze another factor of 4 in memory efficiency.
In terms of CPU, we're re-learning the p(topic|doc) distribution on every iteration, starting
from scratch. This is usually only 10 fixed-point iterations per doc, but that's 10x more than
only 1. To avoid having to do this, we would need to do a map-side join of the unchanging
corpus with the continually-improving p(topic|doc) matrix, and then emit multiple outputs
from the mappers to make sure we can do the reduce model averaging as well. Tricky, but
possibly worth it.
ModelTrainer
already takes advantage (in maybe the not-nice way) of multi-core
availability by doing multithreaded learning, see that class for details.Constructor and Description |
---|
CachingCVB0Mapper() |
Modifier and Type | Method and Description |
---|---|
protected void |
cleanup(org.apache.hadoop.mapreduce.Mapper.Context context) |
protected int |
getMaxIters() |
protected ModelTrainer |
getModelTrainer() |
protected int |
getNumTopics() |
void |
map(org.apache.hadoop.io.IntWritable docId,
VectorWritable document,
org.apache.hadoop.mapreduce.Mapper.Context context) |
protected void |
setup(org.apache.hadoop.mapreduce.Mapper.Context context) |
protected ModelTrainer getModelTrainer()
protected int getMaxIters()
protected int getNumTopics()
protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context) throws IOException, InterruptedException
setup
in class org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.IntWritable,VectorWritable,org.apache.hadoop.io.IntWritable,VectorWritable>
IOException
InterruptedException
public void map(org.apache.hadoop.io.IntWritable docId, VectorWritable document, org.apache.hadoop.mapreduce.Mapper.Context context) throws IOException, InterruptedException
map
in class org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.IntWritable,VectorWritable,org.apache.hadoop.io.IntWritable,VectorWritable>
IOException
InterruptedException
protected void cleanup(org.apache.hadoop.mapreduce.Mapper.Context context) throws IOException, InterruptedException
cleanup
in class org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.IntWritable,VectorWritable,org.apache.hadoop.io.IntWritable,VectorWritable>
IOException
InterruptedException
Copyright © 2008–2015 The Apache Software Foundation. All rights reserved.