Last month we published the Lucene Digest, followed by the Solr Digest, and wrapped the month with the HBase Digest (just before Yahoo posted their report showing Cassandra beating HBase in their benchmarks!). We are starting February with a fresh Mahout Digest.
When covering Mahout, it seems logical to group topics following Mahout’s own groups of core algorithms. Thus, we’ll follow that grouping in this post, too:
- Recommendation Engine (Taste)
- Clustering
- Classification
There are, of course, some common concepts, some overlap, like n-grams. Let’s talk n-grams for a bit.
N-grams
There has been a lot of talk about n-gram usage through all of the major subject areas on Mahout mailing lists. This makes sense, since n-gram-based language models are used in various areas of statistical Natural Language Processing. An n-gram is a subsequence of n items from a given sequence of “items”. The “items” in question can be anything, though most commonly n-grams are made up of character or word/token sequenceas. Lucene’s n-gram support provided through NGramTokenizer tokenizes an input String into character n-grams and can be useful when building character n-gram models from text. When there is an existing Lucene TokenStream and character n-gram model is needed, NGramTokenFilter can be applied to the TokenStream. Word n-grams are sometimes referred to as “shingles”. Lucene helps there, too. When word n-gram statistics or model is needed, ShingleFilter or ShingleMatrixFilter can be used.
Classification
Usage of character n-grams in context of classification and, more specifically, the possibility of applying Naive Bayes to character n-grams instead of word/term n-grams is discussed here. Since Naive Bayes classifier as probabilistic classifier treats features of any type, there is no reason it could not be applied to character n-grams, too. Use of character n-gram model instead of word model in text classification could result in more accurate classification of shorter texts. Our language identifier is a good example of a classifier (though it doesn’t use Mahout) and it provides good results even on short texts.
Clustering
Non-trivial word n-grams (aka shingles) extracted from a document can be useful for document clustering. Similar to usage of document’s term vectors, this thread proposes usage of non-trivial word n-grams as a foundation for clustering. For extraction of word n-grams or shingles from a document Lucene’s ShingleAnalyzerWrapper is suggested. ShingleAnalyzerWrapper wraps the previously mentioned ShingleFilter around another Analyzer. Since clustering (grouping similar items) is an example of a unsupervised type of machine learning, it is always interesting to validate clustering results. In clustering there are no referent train or, more importantly, referent test data, so evaluating how well some clustering algorithm works is not a trivial task. Although good clustering results are intuitive and often easily visually evaluated, it is hard to implement an automated test. Here is an older thread about validating Mahout’s clustering output which resulted in an open JIRA issue.
Recommendation Engine
There is an interesting thread about content-based recommendation, what content-based recommendation really is or how it should be defined. So far, Mahout has only Collaborative Filtering based recommendation engine called Taste. Two different approaches are presented in that thread. One approach treats content-based recommendation as Collaborative Filtering problem or generalized Machine Learning problem, where item similarity is based on Collaborative Filtering applied on item’s attributes or ‘related’ user’s attributes (usual Collaborative Filtering treats an item as a black-box). The other approach is to treat content-based recommendation as a “generalized search engine” problem. Here, matching same or similar queries makes two items similar. Just think of queries as queries composed of, say, key words extracted from user’s reading or search history and this will start making sense. If items have enough textual content then content-based analysis (similar items are those that have similar term vectors) seems like a good approach for implementing content-based recommendation. This is actually nothing novel (people have been (ab)using Lucene, Solr, and other search engines as “recommendation engines” for a while), but content-based recommendations is a recently discussed topic of possible Mahout expansion. All algorithms in Mahout tend to run on top of Hadoop as MapReduce jobs, but in current release Taste does not have the MapReduce version. You can read more about about MapReduce Collaborative Filtering implementation in Mahout’s trunk. If you are in need of a working recommendation engine (that is, a whole application built on top of the core recommendation engine libraries), have a look at the Sematext’s recommendation engine.
In addition to Mahout’s basic machine learning algorithms there are discussions and development in directions which don’t fall under any of the above categories, such as collocation extraction. Often phrase extractors use word n-gram model for co-occurrence frequency counts. Check the thread about collocations which resulted in JIRA issue and the first implementation. Also, you can find more details on how Log-likelihood ratio can be used in the context of the collocation extraction in this thread.
Of course, anyone interested in Mahout should definitely read Mahout in Action (we’ve got ourselves a MEAP copy recently) and keep an eye on features for next 0.3 release.