Skip to main content

The State of Solandra – Summer 2011

sematext sematext on

A little over 18 months ago we talked to Jake Luciani about Lucandra – a Cassandra-based Lucene backend.  Since then Jake has moved away from raw Lucene and married Cassandra with Solr, which is why Lucandra now goes by Solandra.  Let’s see what Jake and Solandra are up to these days.

What is the current status of Solandra in terms of features and stability?

Solandra has gone through a few iterations. First as Lucandra which partitioned data by terms and used thrift to communicate with Cassandra.  This worked for a few big use cases, mainly how to manage a index per user, and garnered a number of adopters.  But it performed poorly when you had very large indexes with many dense terms, due to the number and size of remote calls needed to fulfill a query.Last summer I started off on a new approach based on Solr that would address Lucandra’s shortcomings: Solandra.  The core idea of Solandra is to use Cassandra as a foundation for scaling Solr.  It achieves this by embedding Solr in the Cassandra runtime and uses the Cassandra routing layer to auto shard a index across the ring (by document).  This means good random distribution of data for writes (using Cassandra’s RandomParitioner) and good search performance since individual shards can be searched in parallel across nodes (using SolrDistributedSearch).  Cassandra is responsible for sharding, replication, failover and compaction.  The end user now gets a single scalable component for search without changing API’s which will scale in the background for them.  Since search functionality is performed by Solr so it will support anything Solr does.

I gave a talk recently on Solandra and how it works:

Are you still the sole developer of Solandra?  How much time do you spend on Solandra?
Have there been any external contributions to Solandra?

I still am responsible for the majority of the code, however the Solandra community is quite large with over 500 github followers and 60 forks.  I receive many useful bug reports and patches through the community.  Late last year I started working at DataStax (formerly Riptano) to focus on Apache Cassandra.   DataStax is building a suite of products and services to help customers use Cassandra in production and incorporate Cassandra into existing enterprise infrastructure.  Solandra is a great example of this. We currently have a number of customers using Solandra and we encourage people interested in using Solandra to reach out to us for support.

What are the most notable differences with Solandra and Solr?

The primary difference is the ability to grow your Solr infrastructure seamlessly using Cassandra. I purposely want to avoid altering the Solr functionality since the primary goal here is to make it easy for users to migrate to and from Solandra and Solr.   That being said Solandra does offer some unique features regarding managing millions of indexes.  One is different Solr schemas can be injected at runtime using a RESTful interface and Solandra supports the concept of virtual Solr Cores which share the same core but are treated as different indexes. For example, if you have a core called “inbox” you can create an index per user like “inbox.otis” or “inbox.jake” just by changing the endpoint URL.

Finally, Solandra has a bulk loading interface that makes it easy to index large amounts of data at a time (one known cluster indexes at ~4-5MB of text per second).

What are the most notable differences with Solandra and Elastic Search?

ElasticSearch is more mature and offers a similar architecture for scaling search though not based on Cassandra or Solr.  I think ElasticSearch’s main weakness is it requires users to scrap their existing code and tools to use it.  On the other hand, Solandra provides a scalable platform built on Solr and lets you grow with it.

Solandra doesn’t use the Lucene index file format so it will grow to support millions of indexes. Systems like Solr and ElasticSearch create a directory per index which makes managing millions of indexes very hard. The flipside is there are a lot of performance tweaks lost by not using the native file format most of the current work on Solandra relates to improving single node performance.

Solandra is a single component that gives you search AND NoSQL database, and is therefore much easier to manage from the operations perspective IMO.

What do you plan on adding to Solandra that will make it clearly stand out from Solr or Elastic Search?

Solandra will continue to grow with Solr (4.0 will be out in the future), as well as with Cassandra. Right now Solandra’s real-time search is limited by the performance of Solr’s field cache implementation. By incorporating Cassandra triggers I think we can remove this bottleneck and get really impressive real-time performance at scale, due to how Solandra pre-allocates shards.

Also, since the Solr index is stored in the Cassandra datamodel, you can now apply some really interesting features of Cassandra to Solr, such as expiring indexes and triggered searches.

When should one use Solandra?

If you say yes to any of the following you should use Solandra:

  • I have more data than can fit on a single box
  • I have potentially millions of indexes
  • I need improved indexing with multi-master writes
  • I need multi-datacenter search
  • I am already using Cassandra and Solr
  • I am having trouble managing my Solr cluster

When should one not use Solandra?

If you are happy with your Solr system today and you have enough capacity to scale the size and number of indexes comfortably then there is no need to use Solandra.  Also, Solandra is under active development so you should be prepared to help diagnose unknown issues.  Also note that if you require search features that are currently not supported by Solr distributed search, you should not use Solandra.

Are there known problems with Solandra that users should be aware of?

Yes, currently the index sizes can be much larger in Solandra than Solr (in some cases 10x) this is due to how Solandra indexes data as well as Cassandra’s file format. Cassandra 1.0 includes compression so that will help quite a bit.Also, since consistency in Solandra is tunable it requires your application to consider the implications of writing data at lower consistencies.Finally, one thing that keeps coming up quite often is users assuming Solandra auto indexes the data you already have in Cassandra, since Solandra builds on Cassandra.  This is not the case.  Data must be indexed and searched through the traditional Solr APIs.

Is anyone using Solandra in production? What is the biggest production deployment in terms of # docs, data size on filesystem, query rate?

Solandra is now in production with a handful of users I know of.  Some others are in the testing/pre-production stage. But it’s still a small number AFAIK.The largest Solandra cluster I know of is in the order of ~5 nodes, ~10TB of data with ~100k indexes and ~2B documents.

If you had to do it all over, what would you do differently?

I’m really excited with the way Lucandra/Solandra has evolved over the past year. It’s been a great learning experience and has allowed me to work with technologies and people I’m really, really excited about. I don’t think I’d change a thing, great software takes time.

When is Solandra 1.0 coming out and what is the functionality/issues that remain to be implemented before 1.0?

I don’t really use the 1.0 moniker as people tend to assume too much when they read that. I think once Solandra is fully documented, supports things like Cassandra based triggers for indexing and search, and has an improved on disk format, I’d be comfortable calling Solandra 0.9 🙂

Thank you Jake.  We are looking forward to Solandra 0.9 then.