Subject: Machine Learning Question


I think I understand your question.  To make sure, here it is in my terms:

- you have documents with tag tokens in the fid field

- you have a bunch of rules for defining which documents appear where in
your hierarchy.  These rules are defined as Lucene queries.

- when you get a new document, it is slow to run every one of these queries
against the new document.

- you would like to run these queries very quickly in order to update your
hierarchy quickly and to provide author feedback.  Using ML would be a
spiffy way to do this and might provide hints for updating your hierarchy
rules.
My first suggestion for you would be to consider building a one document
index for the author feedback situation.  Running all of your rules against
that index should be pretty darned fast.  That doesn't help with some of the
other issues and might be hard to do with solr, but it would be easy with
raw Lucene.  You should be able to run several thousands of rules per second
this way.

That doesn't answer the question you asked, though.  The answer there, is
yes.  Definitely.  There are a number of machine learning approaches that
could reverse engineer your rules to give you new rules that could be
evaluated very quickly.  Some learning techniques and some configurations
would likely not give you precise accuracy, but some would likely give you
perfect replication.  Random forest will probably give you accurate results
as would logistic regression (referred to as SGD in Mahout), especially if
you use interaction variables (that depend on the presence of tag
combinations).  You will probably need to do a topological sort because it
is common for hierarchical structures to have rules that exclude a node from
a child if it appears in the parent (or vice versa).  Thus, you would want
to evaluate rules in dependency order and augment the document with any
category assignments as you go down the rule list.

Operationally, you would need to do some coding and not all of the pieces
you need are fully baked yet.  The first step is vectorization of your tag
list for many documents.  Robin has recently checked in some good code for
that and Drew has a more elaborate document model right behind that.  You
can also vectorize directly from a Lucene index which is probably very
convenient for you.  That gives you training data.

Training the classifiers will take a bit since you need to train pretty much
one classifier per category (unless you know that a document can have only
one category).  That shouldn't be hard, however, and with lots of examples
the training should converge to perfect performance pretty quickly.  The
command line form for running training is evolving a bit right now and your
feedback would be invaluable.

Deploying the classifiers should not be too difficult, but you would be in
slightly new territory there since I don't think that many (any) people have
deployed Mahout-trained classifiers in anger just yet.

Does this help?

On Wed, Feb 17, 2010 at 1:23 AM, David Stuart <
[EMAIL PROTECTED]> wrote:
Ted Dunning, CTO
DeepDyve