One way to create a better search experience is to understand the user intent. One of the phases in that process is query understanding, and one simple step in that direction is query segmentation. In this post, we’ll cover what query segmentation is and when it is useful. We will also introduce to you Solr Query Segmenter, a open-sourced Solr component that we developed to make search experience better.
Nowadays there are more and more organizations searching for fault-tolerant and highly available solutions for various parts of their infrastructure, including search, which evolved from merely a “nice to have” feature to the first class citizen and a “must have” element.
Apache Solr is a mature search solution that has been available for over a decade now. Its traditional master-slave deployment has been available since 2006, while the fully distributed deployment known as SolrCloud has been available for only a few years now. Thus, naturally, many organizations are in the process of migrating from Solr master-slave to SolrCloud, or are at least thinking about the move. In this article, we will give you an overview of what’s needed to be done for the migration to SolrCloud to be as smooth as it can be.
Docker is all the rage these days, but one doesn’t hear about running Solr on Docker very much.
Last month, we gave a talk on the topic of running containerized Solr at the Lucene Revolution conference in Boston, the biggest open source conference dedicated to Apache Lucene/Solr. The 40-minute talk included a live demo that shows how to actually do it, while addressing a number of important bits if you want to run Solr on Docker in production.
Curious to check the presentation? You may find it below.
Or, interested in listening to the 40-minute talk? Check it below.
Indeed, a rapidly growing number of organizations are using Solr and Docker in production. If you also run Solr in Docker be sure to check out Docker + Solr How-to: Monitoring the Official Solr Docker Image.
Needless to say, monitoring Solr is essential in production and Docker is disruptive in many ways, and there are many things that are slightly different and worth mentioning. For instance, one can create, deploy, and run applications by using containers and this gives a significant performance boost and reduces the size of the applications.
Not everyone uses Splunk or ELK stack for logs. A few weeks ago, at the Lucene/Solr Revolution conference in Boston, we gave a talk about using Solr for logging, along with lots of good info about how to tune the logging pipeline. The talk also goes over the best AWS instance types, optimal EBS setup, log shipping (see Top 5 Logstash Alternatives), and so on.
One of the things you learn when attending Sematext Solr training is how to scale Solr. We discuss various topics regarding leader shards and their replicas – things like when to go for more leaders, when to go for more replicas and when to go for both. We discuss what you can do with them, how to control their creation and work with them. In this blog entry I would like to focus on one of the mentioned aspects – handling replica creation in SolrCloud. Note that while this is not limited to Solr on Docker deployment, if you are considering running Solr in Docker containers you will want to pay attention to this.
If you are using Solr and are looking for Solr training to quickly improve your Solr or SolrCloud skills, we’ve running several Solr classes this October in San Francisco and New York.
Have two people attend the training from the same company? The second one gets 25% off.
To see the full course details, full outline and information about the instructor, click on the class names below.
New York City:
All classes include breakfast and lunch. If you have special dietary needs, please let us know ahead of time. If you have any questions about any of the classes you can use our live chat (see bottom-right of the page), email us at email@example.com or call us at 1-347-480-1610.
If you’ve missed our Core Solr training in October 2015 in New York, here is another chance – we’re running the 2-day Core Solr class again next month – June 13 & 14, 2016.
This course covers Solr 5.x as well as Solr 6.x! You can see the complete course outline under Solr & Elasticsearch Training Overview . The course is very comprehensive — it includes over 20 chapters and lots of hands-on exercises. Be prepared to learn a lot!
$1,200 early bird rate (valid through June 1) and $1,500 afterwards.
There’s also a 50% discount for the purchase of a 2nd seat!
462 7th Avenue, New York, NY 10018 – see map
Last time, when talking about Solr 6 we learned how to use streaming expressions to automatically update data in a collection. You can imagine this is not the only cool thing you can do with streaming expressions. Today, we will see how to re-index data in your collection for fields that are using doc values. For that we will use Solr 6.1, because of a simple bug that was fixed for that version (details SOLR-9015)
Let’s assume we have two collections – one called video, which will be the source of the data. The second collection will be video_new and will be the target collection. We assume that collections will have slightly different structure – slightly different field names. The video collection will have the following fields:
- id – document identifier
- url – URL of the video
- likes – number of likes
- views – number of views
The second collection, video_new, will have the following fields:
- id – document identifier
- url – URL of the video
- num_likes – number of likes
- num_views – number of views
Exporting the data
First thing we need to figure out is a way to export data from the source collection in an efficient fashion. We can’t just set the rows parameter to gazillion, because it is not efficient and can lead to Solr going out of memory. So we will use the /export request handler. The only limitation of that request handler is that data needs to be sorted and needs to use doc values. That is not a problem for our data, however you should be aware of this requirement.
We will start by exporting the data using the standard Solr way – using the request params with the /export handler. The request looks like this:
curl -XGET 'localhost:8983/solr/video/export?q=*:*&sort=id+desc&fl=id,url,likes,views'
The above will result in Solr using the /export handler and returning all data, not only the first page of the results.
However, we want to use streaming expressions to re-index the data. Because of that we can change the above request to use the search streaming expression, which looks as follows:
search( video, zkHost="localhost:9983", qt="/export", q="*:*", fl="id,url,likes,views", sort="id desc")
The working command with the request looks like this:
curl --data-urlencode 'expr=search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc")' http://localhost:8983/solr/video/stream
We use the search streaming expression and provide the name of the collection, which is video in our case, the ZooKeeper host (yes, we can read from other clusters), the name of the request handler which is /export in our case and is required. Finally, we provide the match-all query, the list of fields that we are interested in, and the sorting expression. Please remember that when using the /export handler all fields listed in the fl parameter must use doc values.
Changing field names
Our collections have different field names and because of that the above search request is not enough. We need to alter the name of the fields by using the select streaming expression. We will change the name of the likes field to num_likes and the name of the views field to num_views. The expression that does that is:
select( search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"), id, url, likes as num_likes, views as num_views )
The select streaming expression lets us choose which fields should be used in the resulting tuples and how they will be named. In our case we take the id and url fields as is and we change the name of the likes and views fields.
To test the result of that expression you can simply use the following command:
curl --data-urlencode 'expr=select(search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),id,url,likes as num_likes,views as num_views)' http://localhost:8983/solr/video/stream
Running the re-indexing
Finally, we have the data prepared and read in an efficient way, so we can send data to Solr for indexation. We do that using the update streaming expression simply by specifying the target collection name and the batch size, like this:
update( video_new, batchSize=100, select(search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),id,url,likes as num_likes,views as num_views))
And the command that we would send to Solr:
curl --data-urlencode 'expr=update(video_new,batchSize=100,select(search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),id,url,likes as num_likes,views as num_views))' http://localhost:8983/solr/video/stream
Please note that we send the command to the source collection /stream handler – in our case to the video collection. This is important.
Verifying the re-indexation
Once the task has been finished by Solr we can check the number of documents returned by each collection to verify that data has been re-indexed properly. We can do that by running these commands:
curl -XGET 'localhost:8983/solr/video/select?q=*:*&indent=true&rows=0'
curl -XGET 'localhost:8983/solr/video_new/select?q=*:*&indent=true&rows=0'
Both result in the following number of documents:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <bool name="zkConnected">true</bool> <int name="status">0</int> <int name="QTime">38</int> <lst name="params"> <str name="q">*:*</str> <str name="indent">true</str> <str name="rows">0</str> </lst> </lst> <result name="response" numFound="18" start="0"> </result> </response>
And that means that everything works as intended 🙂
Interested in Solr Streaming Expressions? Subscribe to this blog or follow @sematext – we have more Streaming Expressions blog posts in the queue. If you need any help with Solr / SolrCloud – don’t forget @sematext does Solr Consulting, Production Support, as well as Solr Training!
One of the things that was extensively changed in Solr 6.0 are the Streaming Expressions and what we can do with them (hint: amazing stuff!). We already described Solr SQL support. Today, we’ll dig into the functionality that makes Solr SQL support possible – the Streaming Expressions. Using Streaming Expressions we will put together a mechanism that lets us re-index data in a given Solr Collection – all within Solr, without any external moving parts or dependencies.
Last week, in the Solr 6, SolrCloud and SQL Queries post, we described how the recent release of Solr 6 in its SolrCloud mode is able to understand SQL. But this is not the only SolrCloud / Solr 6. Another addition that we can use is the Solr JDBC driver. We can use it just like any other JDBC driver. In this blog post we will show how to use Solr JDBC driver from our code, which should give you an idea of how to proceed when using this functionality elsewhere, say with Apache Zeppelin or any other data exploration or visualization tool that has JDBC support.