Subject: Mahout: NB Model for Text Classification - In Sample Error


This test plan is pretty reasonable.  There is inherently going to be some
form of bias due to the time shift, but the bias is real and will affect
your test results the same way it will affect your operational accuracy.
 It might be somewhat interesting to estimate the effect over time by also
testing on a sample from within the same time period, but that is really
mostly of academic interest.

What I would recommend is that you use some additional techniques to
increase your training set size.  Active learning is a classic technique
which can help you build a relatively small training set that gives
performance comparable to the performance you would get without
down-sampling.  Transduction would let you use the untagged data to improve
your model without increasing the number of tagged samples.

One simple approach for active learning is to repeatedly take new training
samples of untagged messages that is stratified on your first model's
score.  In addition, it makes sense to also sample messages that have
significant numbers of terms that do not appear in your positive training
examples.  These methods are much simpler than doing active learning by the
book, but give similar results.

For transduction, a very simple method is to simply tag the rest of your
training data and then train a model using this larger training set.  This
has benefit because you extend your model effectively using cooccurrences
 with known terms.  Again, this is less effective than more formally
defined transduction methods, but it can be surprisingly effective.

Finally, I would recommend that you consider alternative algorithms than
Naive Bayes for your basic model.  This is based on the fact that you only
have a small training set and Naive Bayes depends in part on having a
relatively large number of training examples in order to get a good model.
On Fri, Nov 18, 2011 at 10:43 PM, Night Wolf <[EMAIL PROTECTED]> wrote: