Locating Mountains and More with Mahout and Public Weather Dataset

Recently I was playing with Mahout and public weather dataset. In this post I will describe how I used Mahout library and weather statistics to fill missing gaps in weather measurements and how I managed to locate steep mountains in US with a little Machine Learning (n.b. we are looking for people with Machine Learning or Data Mining backgrounds – see our jobs).

The idea was to just play and learn something, so the effort I did and the decisions chosen along with the approaches should not be considered as a research or serious thoughts by any means. In fact, things done during this effort may appear too simple and straightforward to some. Read on if you want to learn about the fun stuff you can do with Mahout!

Tools & Data

The data and tools used during this effort are: Apache Mahout project and public weather statistics dataset. Mahout is a machine learning library which provided a handful of machine learning tools. During this effort I used just small piece of this big pie. The public weather dataset is a collection of daily weather measurements (temperature, wind speed, humidity, pressure, &c.) from 9000+ weather stations around the world.

Artificial Problems

The problems I decided to attack (I made up them by myself, just to set some goals, no serious research was intended) were

  • using Mahout’s recommendation tools to fill the missing data points in weather statistics data
  • using Mahout’s “similarity tools” to locate nature’s relief influencers on the weather conditions, like mountains, based on weather statistics data

First would allow to fill in gaps in measurements on some of the weather stations. In the second problem the system would locate physical nature’s barriers and other influencers on the weather. By comparing the findings with the physical map one could judge e.g. about which mountains can isolate weather spreading to certain areas. Please forgive me if I use the wrong terms, but I really find the outcome quite interesting. Read on!

Using Recommendation Engine to Fill Missing Weather Measurements

The core (simple) idea is to use recommendation engine to fill in missing measurements. I used User-based recommendation specifically by treating station as a user, date as an item and the temperature as a preference. I know, many can say this isn’t a good way to approach the problem, but for my purpose of learning this worked well.

The Data

The dataset contains 18 surface meteorological elements, from which I selected just few items to be used. So, I chose to work only with temperature to simplify things, though I understood that it could not be enough to reach any interesting result. Moreover, one would argue that it makes much more sense to use precipitation to locate such objects as mountains which affect them a lot. I chose a simpler path, though with quite a big risk of getting nothing. In order to have ability to iterate fast I also used just 2 years of data of the stations located in US. I actually tried to use only California’s station, but this didn’t work well (at least in the beginning before I get to tuning the system logic). For this first problem I didn’t use any of the stations location and altitude information to get things more interesting.

To evaluate the system I simply divided the dataset into two pieces and used one of each as training sample and another one as evaluation sample. Please see the code below.

Data Cleaning

I had to clean some the data before I could use it for training the recommender. It seems like there were some stations which has different IDs but has same location in the dataset. I cleaned them to have only one station in the same place to avoid increased weighting of the same station. I also cleaned Alaska’s stations and most of those which are not on the continental area as they are usually alone standing and do not have related weather conditions to others and only bring the noise for the recommender.

Preparing Input Data for Mahout

So, in our interpretation stations are users and days (dates) with temperature are items with user preference, but there’s more to that. It makes sense to try to use connection between dates which is just dropped by this simple interpretation: dates may stand close to each other or be far from each other; by comparing the change of the temperature during some period helps to judge about “weather closeness” of two stations. To make use of that for calculating similarity I also added <date, diff with the N days ago> pairs. E.g. for input:


I got these preferences:


I divided the difference with the further standing “N day ago” by N so that they are weighted less. Otherwise difference between far from each other days going to be bigger than from than that from close days, while it is the close days difference which is more interesting actually. For the same purpose I tested with up to 5 extra pairs (diffs with up to 5 previous days).

Moreover, when it comes to comparing the change it may really not matter by how much temperature was changed, it is enough to know that it was changed at least by some value d (onlyDirectionOfChange=true, changeWeight=d in results below). So, e.g. given value d=2.0, comparing change with previous 2 days (prevDays=2 in the results below) the example data above is going to look like this:


Running Recommender Evaluating Results

Once data is prepared and you know which recommender parameters (including similarities and their parameters and such) training recommender and evaluating the results is very simple. Please find below the code for non-distributed recommendation calculation.

DataModel model = new FileDataModel(statsFile);
RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
RecommenderBuilder builder = new RecommenderBuilder() {
  public Recommender buildRecommender(DataModel model) throws TasteException {
    UserSimilarity similarity = new EuclideanDistanceSimilarity(model);
    UserNeighborhood neighborhood = new NearestNUserNeighborhood(neighbors, similarity, model);
    Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
    return recommender;
return evaluator.evaluate(builder, null, model, 0.95, 1.0);

The code above shows just one of many possible recommender configurations (and it doesn’t show the fact that we use more data for calculating user similarity as explained in the previous section). Since we really care about the numbers we throw away similarities that don’t take into account the preference value, like TanimotoCoefficientSimilarity. During the (limited) tests I ran it appeared that the simplest EuclideanSimilarity worked best.


My (limited) tests showed the following best results with configurations:

SCORE: 2.03936; neighbors: 2, similarity: {onlyDirectionOfChange=true, changeWeight=2.0, prevDays=1}
SCORE: 2.04010; neighbors: 5, similarity: {onlyDirectionOfChange=true, changeWeight=5.0, prevDays=2}
SCORE: 2.04159; neighbors: 2, similarity: {onlyDirectionOfChange=false, prevDays=1}


  • “score” is the recommendation score, i.e. in our case the average absolute difference in recommended value and actual value
  • “neighbors” is the number of neighbors used (in NearestNUserNeighborhood) to provide recommendation
  • “onlyDirectionOfChange” is whether we use absolute change value when comparing with previous days (value is false) or we comparing with the certain threshold as explained above (value is true)
  • “changeWeight” is the threshold to compare the change with
  • “prevDays” is the number of previous days to compare with

Using Statistics-based Similarity to Locate Weather Influencers

The core (simple) idea is to calculate similarity between stations based on weather statistics and compare it with the physical distance between the stations: if there’s a great difference then assume that there’s something in between such stations that makes the weather noticeably different. Then, the only things we need to define is:

  • how to calculate similarity
  • haw to calculate physical distance
  • what is a *noticeable* difference between weather stats based similarity and physical distance
  • where is this “in between” located

The plan was to use Mahout’s user similarities to calculate the distance between stations based on weather statistics data (taking it as user preferences similar to the first part) and compare it with the physical distance. The evaluation of the results is not very well automated as in previous part, though it could be have I more time for this. To evaluate I just plotted the results and compared this image with the physical map of US.

The Data

The data was used the same as was used in the first part plus physical location of the station, which was used to calculate physical distance between them.

Data Cleaning

Data cleaning had the same logic as for the first part. Though I did more severe cleaning of stations not on the continent and near the shore: we all know that ocean influences the weather a lot, so if I didn’t do that, all shore points would have been considered to be “outliers”. Actually, not only ocean shore-close stations, but most of those near the border of US were removed for the sake of removal algorithm simplicity.

Preparing Input Data for Mahout

The input data were prepared the same way as in first part, the “user preferences” contained <station, temperature> pairs and added <date, diff with the N days ago> pairs. Note, that there are a lot of ways to calculate distance between stations using weather stats, I simply chose the one which would allow me to re-use the same prepared data files from the first part of experiment.

Running the Implementation & Evaluating Results

Let’s have a closer look at how each item from the list in idea overview section was defined.

How to calculate similarity?

As mentioned above I chose simple (similar to the first part) similarity calculation. Simple EuclideanDistanceSimilarity worked well.

How to calculate physical distance?

Physical distance was calculated as Euclidean distance between stations using latitude, longitude and altitude. The longitude was given a much greater weight, because the temperature tends to be affected a lot by longitude coordinate: the further South you go (in Northern Hemisphere where US is) the lower the temperature without any physical relief influencers. Also altitude was given a stronger weight because it has the same strong affect on the temperature.
And, of course to make distance comparable with calculated similarity we need it to be in 0..1 range (with the value close to 1 showing the smallest distance). Hence distance was calculated as 1 / (1 + phys_dist(station1, station2)).

What is a *noticeable* difference between weather stats based similarity and physical distance?

I assumed that there’s an exponential dependency between calculated physical distance of the two stations and similarity calculated from weather stats. I found an average growth rate (of the similarity given the physical distance) using all stations pairs (remember that we have not huge amount of stations, so we can afford that) and used it to detect those pairs that had much greater difference between their physical distance and similarity calculated from weather stats.

Where is this “in between” located?

For those “outlier” pairs detected I put single “physical influencer” point on the map which location is exactly the middle point between those stations (unweighted latitude and longitude were used to calculate it).


Result is better represented and evaluated by comparing the following two images: first created by our system and second being a physical map of US.

red – cleaned “far standing” stations
green – stations which stats were used
blue – found weather “influencers”
circles – some of the cities (Seattle, San-Francisco, Los-Angeles, San-Antonio, Salt Lake City, Miami, New York City, Boston, Chicago)

Compare with:

Note how steep mountains are there in the first image. Of course, there’s noise from lakes, gulfs and other sources. But look how well it drew the relief of California!

What Else?

There are a number of things that can be done to obtain even more interesting results. Just to name a few:
* use not only temperature but other measures, e.g. precipitation as noted in the beginning of the post
* show influencers with different color (heat map) depending on how they affect the weather
* and more


I touched only a very small piece of the great collection of machine learning tools offered by Mahout and yet managed to get very interesting (and useful?) results. I also used just a small part of the public weather dataset available to everyone and yet it was enough to get meaningful outcome. And there are so many different public datasets available online! There’s so much exciting things one can do with such a powerful tool like Mahout and a pile of data.


Hiring: Data Mining, Analytics, Machine Learning Hackers

If you want to work with search, big data mining, analytics, and machine learning, and you are a positive, proactive, independent creature, please keep reading.We are looking for devops to hack on Sematext’s new products and services, as well as provide services to our growing list of clients.  Working knowledge of Mahout or statistics/machine learning/data mining background would be a major plus.

Skills & experience (the more of these you have under your belt the better):

  • Data mining and/or machine learning (Mahout or …)
  • Big data (HBase or Cassandra or Hive or …)
  • Search (Solr or Lucene or Elastic Search or …)

More about an ideal you:

  • You are well organized, disciplined, and efficient
  • You don’t wait to be told what to do and don’t need hand-holding
  • You are reliable, friendly, have a positive attitude, and don’t act like a prima donna
  • You have an eye for detail, don’t like sloppy code, poor spelelling and typous
  • You are able to communicate complex ideas in a clear fashion in English, clean and well designed code, or pretty diagrams

Optional bonus points:

  • You like to write or speak publicly about technologies relevant to what we do
  • You are an open-source software contributor

A few words about us:

We work with search and big data (Lucene, Solr, Nutch, Hadoop, MapReduce, HBase, etc.) on a daily basis and we present at conferences.  Our projects with external clients range from 1 week to several months.  Some clients are small startups, some are large international organizations.  Some are top secret.  New customers knock on our door regularly and this keeps us busy at pretty much all times.  When we are not busy with clients we work on our products.  We run search-lucene.com and search-hadoop.com.  We participate in open-source projects and publish monthly Digest posts that cover Lucene, Solr, Nutch, Mahout, Hadoop, Hive, and HBase.  We don’t write huge spec docs, we work in sprints, we multitask, and try our best to be agile. We send people to conferences, trainings (Hadoop, HBase, Cassandra), and certifications (2 of our team members are Cloudera Certified Hadoop Developers).

We are a small and mostly office-free, highly distributed team spanning 3 continents and 6 countries.  We communicates via email, Skype voice/IM, BaseCamp.  Some of our developers are in Eastern Europe, so we are especially open to new team members being in that area, but we are also interested in good people world-wide, from South America to Far East.

Interested? Please send your resume to jobs @ sematext.com feel free to check out our other positions.

Google Summer of Code and Intern Sponsoring

Are you a student and looking to do some fun and rewarding coding this summer? Then join us for the 2011 Google Summer of Code!

The application deadline is in less than a month! Lucene has identified initial potential projects, but this doesn’t mean you can also pick your own.  If you need additional ideas, look at our Lucene / Solr for Academia: PhD Thesis Ideas (or just the spreadsheet if you don’t want to read the what and the why),  just be sure to discuss with the community first (send an email to dev@lucene.apache.org).

We should also add that, separately from GSoC, Sematext would be happy to sponsor good students and interns interested in work on projects involving search (Lucene, Solr), machine learning & analytics (Mahout), big data (Hadoop, HBase, Hive, Pig, Cassandra), and related areas. We are a virtual and geographically distributed organization whose members are spread over several countries and continents and we welcome students from all across the globe.  For more information please inquire within.

Lucene and Solr: 2010 in Review

Lucene has been around for 10+ years and Solr for 4+ years.  It’s amazing that even after being as mature as these tools are there is still very rapid development and improvement going on.  We are not talking about polishing of the APIs or minor tweaks here and there, but serious development in the heart of both of these tools.  When you know this, it’s even more amazing to hear commercial search vendors spread FUD about tools like Lucene or Solr not being ready for serious business, large scale, high performance, etc.  Those 5000-6000 daily downloads of Lucene/Solr/Nutch/etc. (see the graph, scroll down on the page) must be from people who simply don’t know better than to download this free stuff…

But let’s not go down that path.  Below are some of the Lucene & Solr highlights from 2010.

The Merge

Lucene and Solr code bases were merged early in 2010.   Development mailing lists merged, but user lists remained separate, as did release artifacts.  The code repository went through major reorganization resulting in the addition of the “modules” section that currently hosts only the analysis package (this contains numerous analyzers, tokenizers, stemmers – over 400 Java classes so far.  Why is this good?  Because tools like our Key Phrase Extractor can now use just the jar from the analysis package instead of having to use the whole Lucene jar if all they really want is access to Lucene’s tokenizers, for example.).  In short, things are working out well after the merge.

Code, Releases…

In 2010 Lucene saw 3 releases: 3.0.1, 3.0.2, and 3.0.3, as well as 2.9.2, 2.9.3, and 2.9.4.  Solr 1.4.1 was released, too.  Subversion repository got some new branches which essentially means parallel development at increased pace, more experimentation, more freedom to change the code, etc.  Ultimately it’s the users of Lucene and Solr who reap major benefits from this.  In 2011 we’ll most likely see Lucene 4.0, as well as SolrCloud version of Solr, both of which will bring speed improvements, lower memory footprint, flexible indexing, and a bunch of other good stuff that we’ll write about in our Lucene Digests and Solr Digests on this blog in 2011.

Top Level Projects, Incubator, New Sub-Project

Three former Lucene sub-projects became Top Level Projects: Mahout, Nutch, and Tika.  Mahout 0.3 and 0.4 were released.  Nutch 1.1 and 1.2 were released and work is under way to get Nutch 2.0 out in 2011.  This new Nutch 2.0 includes some major improvements, such as great use of HBase.  After some semi-stagnation, it feels like Nutch is getting some more love from contributors and developers.  Tika is developing rapidly and also releasing rapidly with releases 0.6, 0.7, and 0.8 happening in 2010 and 0.9 being mentioned on the mailing list already.

Lucene ecosystem got a new sub-project in 2010: ManifoldCF (previously known as Lucene Connectors Framework). The code was donated by MetaCarta and it includes connectors for various enterprise data sources, such as Microsoft Sharepoint or EMC Documentum, as well as the file system, Web, or RSS and Atom feeds.  Importantly, ManifoldCF includes a Security Model and has the ability to index documents with Solr.

At the same time, Lucy (the Lucene C port) went to the Incubator.  Lucene.Net is on its way to the Incubator, too.  In short, both projects need to work on building more active development community.


Lucene Eurocon was the first Lucene-focused conference last May in Praha, followed by Lucene Revolution in October in Boston, where we presented how we built search-lucene.com and search-hadoop.com.


Lucene in Action 2nd edition was published by Manning and a book on Solr was published by Packt.  Mahout in Action is nearly done, and Tika in Action is in the works.  A book on Nutch is also getting started.


We built a Lucene/Solr-powered search-lucene.com and the sister search-hadoop.com sites, where one can search all mailing list archives, JIRA issues, source code, javadoc, wiki, and web site for all (sub-) projects at once, facet on sub-projects, data sources, and authors, get short links for any mailing list message handy for sharing, view mailing list messages in threaded or non-threaded view, see search term highlighted not only on search results page, but also in mailing list messages themselves (click on that “book on Nutch” link above for an example), etc.

If you’d like to keep up with Lucene and Solr news in 2011, as well as keep an eye on Nutch, Mahout, and Tika, you can follow @lucene on Twitter – a low volume source of key developments in these projects.

Mahout Digest, October 2010

We’ve been very busy here at Sematext, so we haven’t covered Mahout during the last few months.  We are pleased with what’s been keeping us busy, but are not happy about our irregular Mahout Digests.  We had covered the last (0.3) release with all of its features and we are not going to miss covering very important milestone for Mahout: release 0.4 is out! In this digest we’ll summarize the most important changes in Mahout from the last digest and add some perspective.

Before we dive into Mahout, please note that we are looking for people with Machine Learning skills and Mahout experience (as well as good Lucene/Solr search people).  See our Hiring Search and Data Analytics Engineers post.

This Mahout release brings overall changes regarding model refactoring and command line interface to Mahout aimed at improving integration and consistency (easier access to Mahout operations via the command line). The command line interface is pretty much standardized for working with all the various options now, which makes it easier to run and use. Interfaces are better and more consistent across algorithms and there have been many small fixes, improvements, refactorings, and clean-ups. Details on what’s included can be found in the release notes and download is available from the Apache Mirrors.

Now let’s add some context to various changes and new features.

GSoC projects

Mahout completed its Google Summer of Code  projects and two completed successfully:

  • EigenCuts spectral clustering implementation on Map-Reduce for Apache Mahout (addresses issue MAHOUT-328), proposal and implementation details can be found in MAHOUT-363
  • Hidden Markov Models based sequence classification (proposal for a summer-term university project), proposal and implementation details in  MAHOUT-396

Two projects did not complete due to lack of student participation and one remains in progress.


The biggest addition in clustering department are EigenCuts clustering algorithm (project from GSoC) and MinHash based clustering which we covered as one of possible GSoC suggestions in one of previous digests . MinHash clustering was implemented, but not as a GSoC project. In the first digest from the Mahout series we covered problems related to evaluation of clustering results (unsupervised learning issue), so big addition to Mahout’s clustering are Cluster Evaluation Tools featuring new ClusterEvaluator (uses Mahout In Action code for inter-cluster density and similar code for intra-cluster density over a set of representative points, not the entire clustered data set) and CDbwEvaluator which offers new ways to evaluate clustering effectiveness.

Logistic Regression

Online learning capabilities such as Stochastic Gradient Descent (SGD) algorithm implementation are now part of Mahout. Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several predictor variables that may be either numerical or categories. For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person’s age, sex and body mass index. Logistic regression is used extensively in the medical and social sciences as well as marketing applications such as prediction of a customer’s propensity to purchase a product or cease a subscription. The Mahout implementation uses Stochastic Gradient Descent (SGD), check more on initial request and development in MAHOUT-228. New sequential logistic regression training framework supports feature vector encoding framework for high speed vectorization without a pre-built dictionary. You can find more details on Mahout’s logistic regression wiki page.


There has been a lot of cleanup done in the math module (you can check details in Cleanup Math discussion on ML), lot’s of it related to an untested Colt framework integration (and deprecated code in Colt framework). The discussion resulted in several pieces of Colt framework getting promoted to a tested status (QRdecomposition, in particular)


In addition to speedups and bug fixes, main new features in classification are new classifiers (new classification algorithms) and more open/uniformed input data formats (vectors). Most important changes are:

  • New SGD classifier
  • Experimental new type of Naive bayes classifier (using vectors) and feature reduction options for existing Naive bayes classifier (variable length coding of vectors)
  • New VectorModelClassifier allows any set of clusters to be used for classification (clustering as input for classification)
  • Now random forest can be saved and used to classify new data. Read more on how to build a random forest and how to use it to classify new cases on this dedicated wiki page.

Recommendation Engine

The most important changes in this area are related to distributed similarity computations which can be used in Collaborative Filtering (or other areas like clustering, for example). Implementation of Map-Reduce job, based on algorithm suggested in Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce, which computes item-item similarities for item-based Collaborative Filtering can be found in MAHOUT-362. Generalization of algorithm based on the mailing list discussion led to an implementation of  Map-Reduce job which computes pairwise similarities of the rows of a matrix using a customizable similarity measure (with implementations already provided for Cooccurrence, Euclidean Distance, Loglikelihood, Pearson Correlation, Tanimoto coefficient, Cosine). More on distributed version of any item similarity function (which was available in a non-distributed implementation before) can be found in MAHOUT-393. With pairwise similarity computation defined, RecommenderJob has been evolved to a fully distributed item-based recommender (implementation depends on how the pairwise similarities are computed). You can read more on distributed item-based recommender in MAHOUT-420.

Implementation of distributed operations on very large matrices are very important for a scalable machine learning library which supports large data sets. For example, when term vector is built from textual document/content, terms vectors tend to have high dimension. Now,  if we consider a term-document matrix where each row represents terms from document(s), while a column represents a document we obviously end up with high dimensional matrix. Same/similar thing occurs in Collaborative Filtering: it uses a user-item matrix containing ratings for matrix values, row corresponds to a user and each column represents an item. Again we have large dimension matrix that is sparse.

Now, in both cases (term-document matrix and user-item matrix) we are dealing with high matrix dimensionality which needs to be reduced, but most of information needs to be preserved (in best way possible). Obviously we need to have some sort of matrix operation which will provide lower dimension matrix with important information preserved. For example, large dimensional matrix may be approximated to lower dimensions using Singular Value Decomposition (SVD).

It’s obvious that we need some (java) matrix framework capable of fundamental matrix decompositions. JAMA is a great example of widely used linear algebra package for matrix operations, capable of SVD and other fundamental matrix decompositions (WEKA for example uses JAMA for matrix operations). Operations on highly dimensional matrices always require heavy computation and this requirements produces high HW requirements on any ML production system. This is where Mahout, which features distributed operations on large matrices, should be the production choice for Machine Learning algorithms over frameworks like JAMA, which although great, can not distribute its operations.

In typical recommendation setup users often ‘have’ (used/interacted with) only a few items from the whole item set (item set can be very large) which leads to user-item matrices being sparse matrices. Mahout’s (0.4) distributed Lanczos SVD implementation is particularly useful for finding decompositions of very large sparse matrices.

News and Roadmap

All of the new distributed similarity/recommender implementations we analyzed in previous paragraph were contributed by Sebastian Schelter and as a recognition for this important work he was elected as a new Mahout committer.

The book “Mahout in Action”, published by Manning, has reached 15/16 chapters complete and will soon enter final review.

This is all from us for now.  Any comments/questions/suggestions are more than welcome and until next Mahout digest keep an eye on Mahout’s road map for 0.5 or discussion about what is Mahout missing to become production stabile (1.0) framework.  We’ll see you next month – @sematext.

Hiring Search and Data Analytics Engineers

We are growing and looking for smart people to join us either in an “elastic”, on-demand, per-project, or more permanent role:

Lucene/Solr expert who…

  • Has built non-trivial applications with Lucene or Solr or Elastic Search, knows how to tune them, and can design systems for large volume of data and queries
  • Is familiar with (some of the) internals of Lucene or Solr or Elastic Search, at least on the high level (yeah, a bit of an oxymoron)
  • Has a systems/ops bent or knows how to use performance-related UNIX and JVM tools for analyzing disk IO, CPU, GC, etc.

Data Analytics expert who…

  • Has used or built tools to process and analyze large volumes of data
  • Has experience using HDFS and MapReduce, and have ideally also worked with HBase, or Pig, or Hive, or Cassandra, or Voldemort, or Cascading or…
  • Has experience using Mahout or other similar tools
  • Has interest or background in Statistics, or Machine Learning, or Data Mining, or Text Analytics or…
  • Has interest in growing into a Lead role for the Data Analytics team

We like to dream that we can find a person who gets both Search and Data Analytics, and ideally wants or knows how to marry them.

Ideal candidates also have the ability to:

  • Write articles on interesting technical topics (that may or may not relate to Lucene/Solr) on Sematext Blog or elsewhere
  • Create and give technical talks/presentations (at conferences, local user groups, etc.)

Additional personal and professional traits we really like:

  • Proactive and analytical: takes initiative, doesn’t wait to be asked or told what to do and how to do it
  • Self-improving and motivated: acquires new knowledge and skills, reads books, follows relevant projects, keeps up with changes in the industry…
  • Self-managing and organized: knows how to parcel work into digestible tasks, organizes them into Sprints, updates and closes them, keeps team members in the loop…
  • Realistic: good estimator of time and effort (i.e. knows how to multiply by 2)
  • Active in OSS projects: participates in open source community (e.g. mailing list participation, patch contribution…) or at least keeps up with relevant projects via mailing list or some other means
  • Follows good development practices: from code style to code design to architecture
  • Productive, gets stuff done: minimal philosophizing and over-designing

Here are some of the Search things we do (i.e. that you will do if you join us):

  • Work with external clients on their Lucene/Solr projects.  This may involve anything from performance troubleshooting to development of custom components, to designing highly scalable, high performance, fault-tolerant architectures.  See our services page for common requests.
  • Provide Lucene/Solr technical support to our tech support customers
  • Work on search-related products and services

A few words about us:

We work with search and big data (Lucene, Solr, Nutch, Hadoop, MapReduce, HBase, etc.) on a daily basis.  Our projects with external clients range from 1 week to several months.  Some clients are small startups, some are large international organizations.  Some are top secret.  New customers knock on our door regularly and this keeps us busy at pretty much all times.  When we are not busy with clients we work on our products.  We run search-lucene.com and search-hadoop.com.  We participate in open-source projects and publish monthly Digest posts that cover Lucene, Solr, Nutch, Mahout, Hadoop, and HBase.  We don’t write huge spec docs, we work in sprints, we multitask, and try our best to be agile. We send people to conferences, trainings (Hadoop, HBase, Cassandra), and certifications (2 of our team members are Cloudera Certified Hadoop Developers).

We are a small and mostly office-free, highly distributed team that communicates via email, Skype voice/IM, BaseCamp.  Some of our developers are in Eastern Europe, so we are especially open to new team members being in that area, but we are also interested in good people world-wide, from South America to Far East.

Interested? Please send your resume to jobs @ sematext.com.

Mahout Digest, May 2010

As we reported in our previous Digest, Mahout was approved by the board to become Top Level Project (TLP) and TLP-related preparations kept Mahout developers busy in May. Mahout mailing lists (user- and dev-) were moved to their new addresses and all subscribers were automatically moved to the new lists: user@mahout.apache.org and dev@mahout.apache.org. Regarding other TLP-related changes and their progress, check the list of items to complete the move.

We’ll start a quick walk-through of May’s new features and improvements with an effort on reducing Mahout’s Vector representation size on disk, resulting in improvement of 11% lower size on a test data set.

Discussion on how UncenteredCosineSimilarity , an implementation of the cosine similarity which does not “center” its data, should behave in distributed vs. non-distributed version and a patch for distributed version can be found in MAHOUT-389. Furthermore, an implementation of distributed version of any ItemSimilarity currently available in a non-distributed form was committed in MAHOUT-393.

Mahout has utilities to generate Vectors from a directory of text documents and one improvement in terms of consistency and ease of understanding was made on tf/tfidf-vectors outputs. When using bin/mahout seq2sparse command to generate vectors from text (actually, from the sequence file generated from text), depending on the weighting method (term frequency or term frequency–inverse document frequency), output was created in different directories. Now with the fix from MAHOUT-398, tf-vectors and tfidf-vectors output directories at the same level.

We’ll end with Hidden Markov Model and its integration into Mahout. In MAHOUT-396 you’ll find a patch and more information regarding where and how it is used.

That’s all from Mahout’s world for May, please feel free to leave any questions or comments and don’t forget you can follow @sematext on Twitter.

Mahout Digest, April 2010

April was a busy month here at Sematext, so we are a few days late with our April Mahout Digest, but better late than never!

The Board has approved Mahout to become Top Level Project (TLP) at the Apache Software Foundation. Check the status thread on change of Mahout mailing lists, svn layout, etc. Several discussions on mailing list, all describing what needs to be done for Mahout to become Top Level Project, resulted in Mahout TLP to-do list

As we reported in previous digest there was a lot of talk on mailing list about Google Summer of Code (GSoC) and here is a follow up on this subject. GSoC announcements are up and Mahout got 5 projects accepted this year. Check the full list of GSoC projects and Mahout’s community reactions on the mailing list.

In the past we have reported about the idea of Mahout/Solr integration, and it seems this is now getting realized. Read more on features and implementation progress of this integration here.

At the beginning of April, there was a proposal to make collections releases independent of the rest of Mahout. After some transition period of loosening the coupling between mahout-collections and the rest of Mahout, mahout-collections were extracted as an independent component.  The vote on the first independent release of collections is in progress. Mahout-collections-1.0 differs from the version released with mahout 0.3 only by removed dependency on slf4j.

Mahout parts that use Lucene are updated to use the latest release of Lucene 3.0.1 and code changes for this migration can be found in this patch.

There was a question that generated a good discussion about cosine similarity between two items and how it is implemented in Mahout. More on cosine similarity between two items which is implemented as PearsonCorrelationSimilarity (source) in Mahout code, can be found in MAHOUT-387.

The goal of clustering is to determine the grouping in a set of data (e.g. a corpus of items or a (search) result set or …).  Often, the problem in clustering implementations is that the data to be clustered has a high number of dimensions, which tend to need reducing (to a smaller number of dimensions) in order to make clustering computationally more feasible (read: faster).  The simplest way of reducing those dimensions is to use some sort of a ‘mapping function’ that takes data presented in n-dimensional space and transforms it to data presented in m-dimensional space, where m is lower than n, hence the reduction. Of course, the key here is to preserve variance of important data features (dimensions) and flatten out unimportant ones. One simple and interesting approach to clustering is the use of several independent hash functions where the probability of collision of similar items is higher. This approach, called Minhash based clustering, was proposed back in March as part of GSoC (see the proposal).  You’ll find more on theory behind it and Mahout applicable implementation in MAHOUT-344.

Those interested in Neural Networks should keep an eye on MAHOUT-383, where Neuroph (a lightweight neural net library) and Mahout will get integrated.

This concludes our short April Mahout Digest.  Once Mahout completes its transition to TLP, we expect the project to flourish even more.  We’ll be sure to report on the progress and new developments later this month.  If you are a Twitter user, you can follow @sematext on Twitter.

Mahout Digest, March 2010

In this Mahout Digest we’ll summarize what went on in the Mahout world since our last post back in February.

There has been some talk on the mailing list about Mahout becoming a top level project (TLP). Indeed, the decision to go TLP has been made (see Lucene March Digest to find out about other Lucene subprojects going for TLP) and this will probably happen soon, now that Mahout 0.3 is released. Check the discussion thread on Mahout as TLP and follow the discussion on what the PMC will look like. Also, Sean Owen is nominated as Mahout PMC Chair.  There’s been some logo tweaking.

There has been a lot of talk on Mahout mailing list about Google Summer Of Code and project ideas related to Mahout. Check the full list of Mahout’s GSOC project ideas or take on the invitation to write up your GSOC idea for Mahout!

Since Sematext is such a big Solr shop, we find the proposal to integrate Mahout clustering or classification with Solr quite interesting.  Check more details in MAHOUT-343 JIRA issue.  One example of classification integrated with Solr or actually, classifier as a Solr component, is Sematext’s Multilingual Indexer.  Among other things, Multilingual Indexer uses our Language Identifier to classify documents and individual fields based on language.

When talking about classification we should point out a few more interesting developments. There is an interesting thread on implementation of new classification algorithms and overall classifier architecture. In the same context of classifier architecture, there is a MAHOUT-286 JIRA issue on how (and when)  Mahout’s Bayes classifier will be able to support classification of non-text (non-binary) data like numeric features. If you are interested in using decision forests to classify new data, check this wiki page and this JIRA and patch.

In the previous post we discussed application of n-grams in collocation identification and now there is a wiki page where you can read more on how Mahout handles collocations. Also, check memory management improvements in collocation identification here. Of course, if you think you need more features in key phrases identification and extraction, check Sematext’s Key Phrase Extractor demo – it does more than collocations, can be extended, etc.

Finally, two new committers, Drew Farris and Benson Margulies, have been added to the list of Mahout committers.

That’s all for now from the Mahout world.  Please feel free to leave comments or if you have any questions – just ask, we are listening!

Mahout Digest, February 2010

Last month we published the Lucene Digest, followed by the Solr Digest, and wrapped the month with the HBase Digest (just before Yahoo posted their report showing Cassandra beating HBase in their benchmarks!).  We are starting February with a fresh Mahout Digest.

When covering Mahout, it seems logical to group topics following Mahout’s own groups of core algorithms.  Thus, we’ll follow that grouping in this post, too:

  • Recommendation Engine (Taste)
  • Clustering
  • Classification

There are, of course, some common concepts, some overlap, like n-grams.  Let’s talk n-grams for a bit.


There has been a lot of talk about n-gram usage through all of the major subject areas on Mahout mailing lists. This makes sense, since n-gram-based language models are used in various areas of statistical Natural Language Processing.  An n-gram is a subsequence of n items from a given sequence of “items”. The “items” in question can be anything, though most commonly n-grams are made up of character or word/token sequenceas.  Lucene’s n-gram support provided through NGramTokenizer tokenizes an input String into character n-grams and can be useful when building character n-gram models from text. When there is an existing Lucene TokenStream and character n-gram model is needed, NGramTokenFilter can be applied to the TokenStream.  Word n-grams are sometimes referred to as “shingles”.  Lucene helps there, too.  When word n-gram statistics or model is needed, ShingleFilter or ShingleMatrixFilter can be used.


Usage of character n-grams in context of classification and, more specifically, the possibility of applying Naive Bayes to character n-grams instead of word/term n-grams is discussed here. Since Naive Bayes classifier as probabilistic classifier treats features of any type, there is no reason it could not be applied to character n-grams, too. Use of character n-gram model instead of word model in text classification could result in more accurate classification of shorter texts.  Our language identifier is a good example of a classifier (though it doesn’t use Mahout) and it provides good results even on short texts, try it.


Non-trivial word n-grams (aka shingles) extracted from a document can be useful for document clustering. Similar to usage of document’s term vectors, this thread proposes usage of non-trivial word n-grams as a foundation for clustering. For extraction of word n-grams or shingles from a document Lucene’s ShingleAnalyzerWrapper is suggested. ShingleAnalyzerWrapper wraps the previously mentioned ShingleFilter around another Analyzer.  Since clustering (grouping similar items) is an example of a unsupervised type of machine learning, it is always interesting to validate clustering results. In clustering there are no referent train or, more importantly, referent test data, so evaluating how well some clustering algorithm works is not a trivial task. Although good clustering results are intuitive and often easily visually evaluated, it is hard to implement an automated test. Here is an older thread about validating Mahout’s clustering output which resulted in an open JIRA issue.

Recommendation Engine

There is an interesting thread about content-based recommendation, what content-based recommendation really is or how it should be defined. So far, Mahout has only Collaborative Filtering based recommendation engine called Taste.  Two different approaches are presented in that thread. One approach treats content-based recommendation as Collaborative Filtering problem or generalized Machine Learning problem, where item similarity is based on Collaborative Filtering applied on item’s attributes or ‘related’ user’s attributes (usual Collaborative Filtering treats an item as a black-box).  The other approach is to treat content-based recommendation as a “generalized search engine” problem. Here, matching same or similar queries makes two items similar. Just think of queries as queries composed of, say, key words extracted from user’s reading or search history and this will start making sense.  If items have enough textual content then content-based analysis (similar items are those that have similar term vectors) seems like a good approach for implementing content-based recommendation.  This is actually nothing novel (people have been (ab)using Lucene, Solr, and other search engines as “recommendation engines” for a while), but content-based recommendations is a recently discussed topic of possible Mahout expansion.  All algorithms in Mahout tend to run on top of Hadoop as MapReduce jobs, but in current release Taste does not have the MapReduce version. You can read more about about MapReduce Collaborative Filtering implementation in Mahout’s trunk. If you are in need of a working recommendation engine (that is, a whole application built on top of the core recommendation engine libraries), have a look at the Sematext’s recommendation engine.

In addition to Mahout’s basic machine learning algorithms there are discussions and development in directions which don’t fall under any of the above categories, such as collocation extraction. Often phrase extractors use word n-gram model for co-occurrence frequency counts. Check the thread about collocations which resulted in JIRA issue and the first implementation. Also, you can find more details on how Log-likelihood ratio can be used in the context of the collocation extraction in this thread.

Of course, anyone interested in Mahout should definitely read Mahout in Action (we’ve got ourselves a MEAP copy recently) and keep an eye on features for next 0.3 release.