Subject: seq2sparse and lsi fold-in


There are two dictionary-like systems in Mahout.  Neither is quite right.

The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary.  It
doesn't do the frequency counting you want.

The more complex one is in DictionaryVectorizer.  Unfortunately, it is a
mass of static functions that depend on statically named files rather than
being a real API.

There is a third choice as well
in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder.  It does
on-line IDF weighting and can be used underneath a text encoder to get
on-line TF-IDF weighting of the sort you desire.  You can preset counts
using the getDictionary accessor.

A fourth choice is to simply use a static word encoder with hashed vectors
and do the IDF weighting as a vector element-wise multiplication.  That way
you only need to keep around a vector of weights and no dictionary.  That
should be much cheaper in memory.
On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]>wrote: