Skip to main content
Search

Elastic Search: Distributed, Lucene-based Search Engine

sematext sematext on

Here at Sematext we are constantly looking to expand our horizons and acquire new knowledge (as well as our search team – see Sematext jobs page – we are always on the lookout for people with passion for search and, yes, we are heavy users of ElasticSearch!), especially in and around our main domains of expertize – search and analytics.  That is why today we are talking with Shay Banon about his latest project: ElasticSearch.  If his name sounds familiar to you, that’s likely because Shay is also known for his work on GigaSpaces and Compass.
  • What is the history of Elastic Search in terms of:
  • When you got the idea?
  • How long you had that brewing in your head?
  • When did you first start cutting code for it?
I have been thinking about something along the lines of what elasticsearch has turned out to be for a few years now. As you may know, I am the creator of Compass (http://www.compass-project.org), which I started more 7 years ago, and the aim of Compass was to ease the integration of search into any Java application.
When I developed Compass, I slowly started to add more and more features to it. For example, Compass, from the get go, supported mapping of Objects to a Search Engine (OSEM – http://www.compass-project.org/docs/2.2.0/reference/html/core-osem.html). But, it also added a JSON to Search Engine mapping layer (JSEM – http://www.compass-project.org/docs/2.2.0/reference/html/core-jsem.html) as slowly JSON was becoming a de-facto standard wire protocol.
Another aspect that I started to tackle was Compass support for a distributed mode. GigaSpaces (http://www.compass-project.org/docs/2.2.0/reference/html/needle-gigaspaces.html), Coherence (http://www.compass-project.org/docs/2.2.0/reference/html/needle-coherence.html), and Terracotta (http://www.compass-project.org/docs/2.2.0/reference/html/needle-terracotta.html) are attempts at solving that. All support using a distributed Lucene Directory implementation (scaling the storage), but, as you know Lucene, sometimes this is not enough. With GigaSpaces, the integration took another step with sharding the index itself and using “map/reduce” to search on nodes.
The last important aspect of Compass is its integration with different mechanisms to handle content and make it searchable. For example, it has very nice integration with JPA (Hibernate, TopLink, OpenJPA), which means any change you do the database through JPA is automatically indexed. Another integration point was with data grids such as GigaSpaces and Coherence: any change done to them gets applied to the index.
But still, Compass is a library that gets embedded in your Java app, and its solutions for distributed search are far from ideal. So, I started to play around with the idea of creating a Compass Search server that would use its mapping support (JSON) and expose itself through a RESTful API.
Also, I really wanted to try and tackle the distributed search problem. I wanted to create a distributed search solution that is inspired, in its distributed model, from how current state of the art data grids work.
So, about 7 months ago I wrote the first line of elasticsearch and have been hacking on it ever since.
  • The inevitable: How is Elastic Search different from Solr?
To be honest, I never used Solr. When I was looking around for current distributed search solutions, I took a brief look at Solr distributed model, and was shocked that this is what people need to deal with in order to build a scalable search solution (that was 7 months ago, so maybe things have changed). While looking at Solr distributed model I also noticed the very problematic “REST” API it exposes. I am a strong believer in having the product talk the domain model, and not the other way around. ElasticSearch is very much a domain driven search engine, and I explain it more here: http://www.elasticsearch.com/blog/2010/02/12/yourdatayoursearch.html. You will find this attitude throughout elasticsearch APIs.
  • Is there a feature-for-feature comparison to Solr that would make it easier for developers of new search applications to understand the differences and choose the right tool for the job?
There isn’t one, and frankly, I am not that expert with Solr to create such a list. What I hope is that people who work with both will create such a list, hopefully with input from both projects.
  • When would one want (or need) to use Elastic Search instead of Solr and vice versa?
As far as I am concerned, elasticsearch is being built to be a complete distributed, RESTful, search solution, with all the features anyone would ever want from a search server. So, I really don’t see a reason why someone would choose Solr over ElasticSearch. To be honest, with today data scale, and the slow move to the cloud (or “cloud architectures”) you *need* a search engine that you can scale, and I really don’t understand how one would work with Solr distributed model, but that might just be me and I am spoiled by what I expect from distributed solutions because of my data grid background.
  • What was the reason for simply not working with the Solr community and enhancing Solr? (discussion, patches…)  Are some of Elastic Search’s features simply not implementable in Solr?
First, the challenge. Writing a data grid level distributed search engine is quite challenging to say the least (note, I am not talking about data grid features, such as transactions and so on, just data grids distributed model).
Second, building something on top of existing codebase will never be as good as building something from scratch. For example, elasticsearch has a highly optimized, asynchronous, transport layer to communicate between nodes (which the native Java client uses), a highly modular core where almost anything is pluggable. These are things that are very hard to introduce or change with existing codebase, and existing developers. Its much simpler to write it from scratch.
  • We see more and more projects using Github.  What is your reason for choosing Git for SCM and Github for Elastic Search’s home?
Well, there is no doubt that Git is much nicer to work with than SVN thanks to its distributed nature (and I like things that are distributed 🙂 ). As for GitHub, I think that its currently the best project hosting service out there. You really feel like people out there know developers and what developers need. As a side note, I am a sucker for eye candy, and it simply looks good.
  • We see the community already created a Perl client.  Are there other client libraries in the works?
Yeah, so there is an excellent Perl client (http://github.com/clintongormley/ElasticSearch.pm) which Clinton Gormley has been developing (he has also been an invaluable source of suggestions/bugs to the development of elasticsearch). There are several more including erlang, ruby, python and PHP (all listed here http://www.elasticsearch.com/products/). Note, thanks to the fact that elasticsearch has a rich, domain driven, JSON API, writing clients to it is very simple since most times there is no need to perform any mappings, especially with dynamic languages.
  • We realize it is super early for Elastic Search, but what is the biggest known deployment to date?
Yes, it is quite early. But, I can tell you that some not that small sites (10s of millions of documents) are already playing with elasticsearch successfully. Not sure if I can disclose any names, but once they go live, I will try and get them to post something about it.
  • What are Elastic Search future plans, is there a roadmap?
The first thing to get into ElasticSearch are more search engine features. The features are derived from the features that are already exposed in Compass, including some new aspects such a geo/local search.
Another interesting aspect is making elasticsearch more cloud provider friendly. For example, elasticsearch persistent store is designed in a write behind fashion, and I would love to get one that persist the index to Amazon S3 or Rackspace CloudFiles (See more information on how persistency works with elasticsearch, see here: http://www.elasticsearch.com/blog/2010/02/16/searchengine_time_machine.html).
NoSQL is also an avenue that I would love to explore. In similar concept to how Compass works with JPA / Data Grids, I would like the integration of search with NoSQL solutions more simple. It should be as simple as you do something against the NoSQL solution, and it automatically gets applied to elasticsearch as well. Thanks to the fact that elasticsearch model is very much domain driven, and very similar to what NoSQL uses, the integration should be simple. As an example, TerraStore already comes with an integration module that applies to elasticsearch any changes done to TerraStore. I blogged my thoughts about search and NoSQL here http://www.elasticsearch.com/blog/2010/02/25/nosql_yessearch.html.
If you have additional questions for Shay about Elastic Search, please feel free to leave them in comments, and we will ask Shay to use comments to answer them.