One way to create a better search experience is to understand the user intent. One of the phases in that process is query understanding, and one simple step in that direction is query segmentation. In this post, we’ll cover what query segmentation is and when it is useful. We will also introduce to you Solr Query Segmenter, a open-sourced Solr component that we developed to make search experience better.
Nowadays there are more and more organizations searching for fault-tolerant and highly available solutions for various parts of their infrastructure, including search, which evolved from merely a “nice to have” feature to the first class citizen and a “must have” element.
Apache Solr is a mature search solution that has been available for over a decade now. Its traditional master-slave deployment has been available since 2006, while the fully distributed deployment known as SolrCloud has been available for only a few years now. Thus, naturally, many organizations are in the process of migrating from Solr master-slave to SolrCloud, or are at least thinking about the move. In this article, we will give you an overview of what’s needed to be done for the migration to SolrCloud to be as smooth as it can be.
Running on Elasticsearch on Docker sounds like a natural fit – both technologies promise elasticity. However, running a truly elastic Elasticsearch cluster on Docker Swarm became somewhat difficult with Docker 1.12 in Swarm mode. Why? Since Elasticsearch gave up on multicast discovery (by moving multicast node discovery into a plugin and not including it by default) one has to specify IP addresses of all master nodes to join the cluster. Unfortunately, this creates the chicken or the egg problem in the sense that these IP addresses are not actually known in advance when you start Elasticsearch as a Swarm service! It would be easy if we could use the shared Docker bridge or host network and simply specify the Docker host IP addresses, as we are used to it with the “docker run” command. However, “docker service create” rejects the usage of bridge or host network. Thus, the question remains: How can we deploy Elasticsearch in a Docker Swarm cluster?
Docker is all the rage these days, but one doesn’t hear about running Solr on Docker very much.
Last month, we gave a talk on the topic of running containerized Solr at the Lucene Revolution conference in Boston, the biggest open source conference dedicated to Apache Lucene/Solr. The 40-minute talk included a live demo that shows how to actually do it, while addressing a number of important bits if you want to run Solr on Docker in production.
Curious to check the presentation? You may find it below.
Or, interested in listening to the 40-minute talk? Check it below.
Indeed, a rapidly growing number of organizations are using Solr and Docker in production. If you also run Solr in Docker be sure to check out Docker + Solr How-to: Monitoring the Official Solr Docker Image.
Needless to say, monitoring Solr is essential in production and Docker is disruptive in many ways, and there are many things that are slightly different and worth mentioning. For instance, one can create, deploy, and run applications by using containers and this gives a significant performance boost and reduces the size of the applications.
Many of our clients use AWS EC2. In the context of Elasticsearch consulting or support, one question we often get is: should we use AWS Elasticsearch Service instead of deploying Elasticsearch ourselves? The question is valid whether “self hosted” means in EC2, some other cloud or your own datacenter. As always, the answer is “it depends”, but in this post we’ll show how the advantages of AWS Elasticsearch compared to those of deploying your own Elasticsearch cluster. This way, you’ll be able to decide what fits your use-case and knowledge.
Why AWS Elasticsearch?
- It automatically replaces failed nodes: you don’t need to get paged in the middle of the night, spin a new node and add it to the cluster
- You can add/remove nodes through an API – otherwise you’ll have to make sure you have all the automation in place so that when you spin a node you don’t spend extra time manually installing and configuring Elasticsearch
- You can manage access rights via IAM: this is easier than setting up a reverse proxy or a security addon (cheaper, too, if the addon is paid)
- Daily snapshots to S3 are included. This saves you the time and money to set it up (and the storage cost) for what is a mandatory step in most use-cases
- CloudWatch monitoring included. You will want to monitor your Elasticsearch cluster anyway (whether you build or buy)
Why to install your own Elasticsearch?
- On demand equivalent instances are cheaper by ~29%. The delta differs from instance to instance (we checked m3.2xl and i2.2xl ones). You get even more discount for your own cluster if you use reserved instances
- More instance types and sizes are available. You can use bigger i2 instances than AWS Elasticsearch, and you have access to the latest generation of c4 and m4 instances. This way, you are likely to scale further and get more bang per buck, especially with logs and metrics (more specific hardware recommendations and Elasticsearch tuning here)
- You can change more index settings, beyond analysis and number of shards/replicas. For example, delayed allocation, which is useful when you have a lot of data per node. You can also change the settings of all indices at once by hitting the /_settings endpoint. By properly utilizing various settings Elasticsearch makes available you can better optimize your setup for your particular use case, make better use of underlying resources, and thus drive the cost down further.
- You can change more cluster-wide settings, such as number of shards to rebalance at once
- You get access to all other APIs, such as Hot Threads, which is useful for debugging
- You can use a more comprehensive Elasticsearch monitoring solution. Currently, CloudWatch only collects a few metrics, such as cluster status, number of nodes and documents, heap pressure and disk space. For most use-cases, you’ll need more info, such as the query latency and indexing throughput. And when something goes wrong, you’ll need more insight on JVM pool sizes, cache sizes, Garbage Collection or you may need to profile Elasticsearch
- You can have clusters of more than 20 nodes
— Sematext Group, Inc. (@sematext) November 28, 2016
You may see a pattern emerging from the bullets above: AWS Elasticsearch is easy to set up and comes with a few features on top of Elasticsearch that you’ll likely need. However, it’s limited when it comes to scaling – both in terms of number&size of nodes and Elasticsearch features.
If you already know your way around Elasticsearch, AWS Elasticsearch service will likely only make sense for small clusters. If you’re just getting started, you can go a longer way until it will start to pay off for you to boost your knowledge (e.g. via an Elasticsearch training) and install your own Elasticsearch cluster (maybe with the help of our consulting or support). Or you can delegate the whole scaling part to us by using Logsene, especially if your use-case is about logs or metrics.
Finally, if you think there are too many “if”s in the above paragraph, here’s a flowchart to cover all the options:
One of the things you learn when attending Sematext Solr training is how to scale Solr. We discuss various topics regarding leader shards and their replicas – things like when to go for more leaders, when to go for more replicas and when to go for both. We discuss what you can do with them, how to control their creation and work with them. In this blog entry I would like to focus on one of the mentioned aspects – handling replica creation in SolrCloud. Note that while this is not limited to Solr on Docker deployment, if you are considering running Solr in Docker containers you will want to pay attention to this.
As the world of software is growing, so is the ecosystem of DevOps tools and resources – for monitoring, for logging, for alerting, for continuous integration and deployment, configuration management, etc. Nothing wrong with having lots of resources and tools, but here at Sematext we like to find answers as quickly and efficiently as possible; after all, we are known (among other things) for Search and Big Data expertise.
To that end, we’re happy to announce search-devops.com (aka “SD”), which aggregates, indexes and makes searchable all content repositories (mailing lists, source code, wikis, issue trackers, etc.) for a number of open source DevOps projects:
Activity has been steadily growing the past few months (see chart below for Docker example) and there is a ton of usable and helpful content available right now.
The Search-DevOps Suggestion Box is Open!
Got suggestions for new DevOps tools to add? Or new functionality on the site? Drop us an email or hit us on Twitter. We are eager to support the DevOps community and improve everyone’s work lives and tool sets.
If you are running Elasticsearch in Docker, you may have flipped through our Running High Performance Fault-tolerant Elasticsearch Clusters on Docker slide deck. Here is the video of the Running High Performance Fault-tolerant Elasticsearch Clusters on Docker talk given at Berlin Buzzwords in June 2016.
Last time, when talking about Solr 6 we learned how to use streaming expressions to automatically update data in a collection. You can imagine this is not the only cool thing you can do with streaming expressions. Today, we will see how to re-index data in your collection for fields that are using doc values. For that we will use Solr 6.1, because of a simple bug that was fixed for that version (details SOLR-9015)
Let’s assume we have two collections – one called video, which will be the source of the data. The second collection will be video_new and will be the target collection. We assume that collections will have slightly different structure – slightly different field names. The video collection will have the following fields:
- id – document identifier
- url – URL of the video
- likes – number of likes
- views – number of views
The second collection, video_new, will have the following fields:
- id – document identifier
- url – URL of the video
- num_likes – number of likes
- num_views – number of views
Exporting the data
First thing we need to figure out is a way to export data from the source collection in an efficient fashion. We can’t just set the rows parameter to gazillion, because it is not efficient and can lead to Solr going out of memory. So we will use the /export request handler. The only limitation of that request handler is that data needs to be sorted and needs to use doc values. That is not a problem for our data, however you should be aware of this requirement.
We will start by exporting the data using the standard Solr way – using the request params with the /export handler. The request looks like this:
curl -XGET 'localhost:8983/solr/video/export?q=*:*&sort=id+desc&fl=id,url,likes,views'
The above will result in Solr using the /export handler and returning all data, not only the first page of the results.
However, we want to use streaming expressions to re-index the data. Because of that we can change the above request to use the search streaming expression, which looks as follows:
search( video, zkHost="localhost:9983", qt="/export", q="*:*", fl="id,url,likes,views", sort="id desc")
The working command with the request looks like this:
curl --data-urlencode 'expr=search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc")' http://localhost:8983/solr/video/stream
We use the search streaming expression and provide the name of the collection, which is video in our case, the ZooKeeper host (yes, we can read from other clusters), the name of the request handler which is /export in our case and is required. Finally, we provide the match-all query, the list of fields that we are interested in, and the sorting expression. Please remember that when using the /export handler all fields listed in the fl parameter must use doc values.
Changing field names
Our collections have different field names and because of that the above search request is not enough. We need to alter the name of the fields by using the select streaming expression. We will change the name of the likes field to num_likes and the name of the views field to num_views. The expression that does that is:
select( search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"), id, url, likes as num_likes, views as num_views )
The select streaming expression lets us choose which fields should be used in the resulting tuples and how they will be named. In our case we take the id and url fields as is and we change the name of the likes and views fields.
To test the result of that expression you can simply use the following command:
curl --data-urlencode 'expr=select(search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),id,url,likes as num_likes,views as num_views)' http://localhost:8983/solr/video/stream
Running the re-indexing
Finally, we have the data prepared and read in an efficient way, so we can send data to Solr for indexation. We do that using the update streaming expression simply by specifying the target collection name and the batch size, like this:
update( video_new, batchSize=100, select(search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),id,url,likes as num_likes,views as num_views))
And the command that we would send to Solr:
curl --data-urlencode 'expr=update(video_new,batchSize=100,select(search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),id,url,likes as num_likes,views as num_views))' http://localhost:8983/solr/video/stream
Please note that we send the command to the source collection /stream handler – in our case to the video collection. This is important.
Verifying the re-indexation
Once the task has been finished by Solr we can check the number of documents returned by each collection to verify that data has been re-indexed properly. We can do that by running these commands:
curl -XGET 'localhost:8983/solr/video/select?q=*:*&indent=true&rows=0'
curl -XGET 'localhost:8983/solr/video_new/select?q=*:*&indent=true&rows=0'
Both result in the following number of documents:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <bool name="zkConnected">true</bool> <int name="status">0</int> <int name="QTime">38</int> <lst name="params"> <str name="q">*:*</str> <str name="indent">true</str> <str name="rows">0</str> </lst> </lst> <result name="response" numFound="18" start="0"> </result> </response>
And that means that everything works as intended 🙂
Interested in Solr Streaming Expressions? Subscribe to this blog or follow @sematext – we have more Streaming Expressions blog posts in the queue. If you need any help with Solr / SolrCloud – don’t forget @sematext does Solr Consulting, Production Support, as well as Solr Training!