running sold on docker

Running Solr in Docker: How & Why

Docker is all the rage these days, but one doesn’t hear about running Solr on Docker very much.

Last month, we gave a talk on the topic of running containerized Solr at the Lucene Revolution conference in Boston, the biggest open source conference dedicated to Apache Lucene/Solr. The 40-minute talk included a live demo that shows how to actually do it, while addressing a number of important bits if you want to run Solr on Docker in production.

Curious to check the presentation? You may find it below.

Or, interested in listening to the 40-minute talk? Check it below.

Indeed, a rapidly growing number of organizations are using Solr and Docker in production. If you also run Solr in Docker be sure to check out Docker + Solr How-to: Monitoring the Official Solr Docker Image.

Needless to say, monitoring Solr is essential in production and Docker is disruptive in many ways, and there are many things that are slightly different and worth mentioning. For instance, one can create, deploy, and run applications by using containers and this gives a significant performance boost and reduces the size of the applications.

 

 

Tuning Solr & Pipeline for Logs – Video & Slides

Not everyone uses Splunk or ELK stack for logs. A few weeks ago, at the Lucene/Solr Revolution conference in Boston, we gave a talk about using Solr for logging, along with lots of good info about how to tune the logging pipeline. The talk also goes over the best AWS instance types, optimal EBS setup, log shipping (see Top 5 Logstash Alternatives), and so on.

Handling Shards in SolrCloud

handling-shards-in-solrcloud

One of the things you learn when attending Sematext Solr training is how to scale Solr. We discuss various topics regarding leader shards and their replicas – things like when to go for more leaders, when to go for more replicas and when to go for both. We discuss what you can do with them, how to control their creation and work with them. In this blog entry I would like to focus on one of the mentioned aspects – handling replica creation in SolrCloud. Note that while this is not limited to Solr on Docker deployment, if you are considering running Solr in Docker containers you will want to pay attention to this.

Read More

Solr Training

Solr Training, San Francisco & New York, October

If you are using Solr and are looking for Solr training to quickly improve your Solr or SolrCloud skills, we’ve running several Solr classes this October in San Francisco and New York.

Have two people attend the training from the same company? The second one gets 25% off.

To see the full course details, full outline and information about the instructor, click on the class names below.

solr-elasticsearch-training-fall-2016
San Francisco:

ny-solr-elasticsearch-training-fall-2016
New York City:

All classes include breakfast and lunch. If you have special dietary needs, please let us know ahead of time. If you have any questions about any of the classes you can use our live chat (see bottom-right of the page), email us at training@sematext.com or call us at 1-347-480-1610.

Sematext Solr Training

Apache Solr Training in NYC June 13-14

If you’ve missed our Core Solr training in October 2015 in New York, here is another chance – we’re running the 2-day Core Solr class again next month – June 13 & 14, 2016.
This course covers Solr 5.x as well as Solr 6.x!  You can see the complete course outline under Solr & Elasticsearch Training Overview . The course is very comprehensive — it includes over 20 chapters and lots of hands-on exercises.  Be prepared to learn a lot!

Cost:
$1,200 early bird rate (valid through June 1) and $1,500 afterwards.

There’s also a 50% discount for the purchase of a 2nd seat!

Location:

462 7th Avenue, New York, NY 10018 – see map

If you have any questions please get in touch.  To sign up, just register here.

reindex data

DocValues Reindexing with Solr Streaming Expressions

Last time, when talking about Solr 6 we learned how to use streaming expressions to automatically update data in a collection. You can imagine this is not the only cool thing you can do with streaming expressions. Today, we will see how to re-index data in your collection for fields that are using doc values. For that we will use Solr 6.1, because of a simple bug that was fixed for that version (details SOLR-9015)

Let’s assume we have two collections – one called video, which will be the source of the data. The second collection will be video_new and will be the target collection. We assume that collections will have slightly different structure – slightly different field names. The video collection will have the following fields:

  • id – document identifier
  • url – URL of the video
  • likes – number of likes
  • views – number of views

The second collection, video_new, will have the following fields:

  • id – document identifier
  • url – URL of the video
  • num_likes – number of likes
  • num_views – number of views

Exporting the data

First thing we need to figure out is a way to export data from the source collection in an efficient fashion. We can’t just set the rows parameter to gazillion, because it is not efficient and can lead to Solr going out of memory. So we will use the /export request handler. The only limitation of that request handler is that data needs to be sorted and needs to use doc values. That is not a problem for our data, however you should be aware of this requirement.

We will start by exporting the data using the standard Solr way – using the request params with the /export handler. The request looks like this:

curl -XGET 'localhost:8983/solr/video/export?q=*:*&sort=id+desc&fl=id,url,likes,views'

The above will result in Solr using the /export handler and returning all data, not only the first page of the results.

However, we want to use streaming expressions to re-index the data. Because of that we can change the above request to use the search streaming expression, which looks as follows:

search(
  video,
  zkHost="localhost:9983",
  qt="/export",
  q="*:*",
  fl="id,url,likes,views",
  sort="id desc")

The working command with the request looks like this:

curl --data-urlencode 'expr=search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc")' http://localhost:8983/solr/video/stream

We use the search streaming expression and provide the name of the collection, which is video in our case, the ZooKeeper host (yes, we can read from other clusters), the name of the request handler which is /export in our case and is required. Finally, we provide the match-all query, the list of fields that we are interested in, and the sorting expression. Please remember that when using the /export handler all fields listed in the fl parameter must use doc values.

Changing field names

Our collections have different field names and because of that the above search request is not enough. We need to alter the name of the fields by using the select streaming expression. We will change the name of the likes field to num_likes and the name of the views field to num_views. The expression that does that is:

select(
  search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),
  id, 
  url, 
  likes as num_likes,
  views as num_views
)

The select streaming expression lets us choose which fields should be used in the resulting tuples and how they will be named. In our case we take the id and url fields as is and we change the name of the likes and views fields.

To test the result of that expression you can simply use the following command:

curl --data-urlencode 'expr=select(search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),id,url,likes as num_likes,views as num_views)' http://localhost:8983/solr/video/stream

Running the re-indexing

Finally, we have the data prepared and read in an efficient way, so we can send data to Solr for indexation. We do that using the update streaming expression simply by specifying the target collection name and the batch size, like this:

update(
  video_new, 
  batchSize=100, 
  select(search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),id,url,likes as num_likes,views as num_views))

And the command that we would send to Solr:

curl --data-urlencode 'expr=update(video_new,batchSize=100,select(search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),id,url,likes as num_likes,views as num_views))' http://localhost:8983/solr/video/stream

Please note that we send the command to the source collection /stream handler – in our case to the video collection. This is important.

Verifying the re-indexation

Once the task has been finished by Solr we can check the number of documents returned by each collection to verify that data has been re-indexed properly. We can do that by running these commands:

curl -XGET 'localhost:8983/solr/video/select?q=*:*&indent=true&rows=0'

and

curl -XGET 'localhost:8983/solr/video_new/select?q=*:*&indent=true&rows=0'

Both result in the following number of documents:

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <bool name="zkConnected">true</bool>
  <int name="status">0</int>
  <int name="QTime">38</int>
  <lst name="params">
    <str name="q">*:*</str>
    <str name="indent">true</str>
    <str name="rows">0</str>
  </lst>
</lst>
<result name="response" numFound="18" start="0">
</result>
</response>

And that means that everything works as intended 🙂

Interested in Solr Streaming Expressions? Subscribe to this blog or follow @sematext – we have more Streaming Expressions blog posts in the queue. If you need any help with Solr / SolrCloud – don’t forget @sematext does Solr Consulting, Production Support, as well as Solr Training!

solr streaming expressions

Solr Streaming Expressions for Collection auto-updating

One of the things that was extensively changed in Solr 6.0 are the Streaming Expressions and what we can do with them (hint: amazing stuff!). We already described Solr SQL support. Today, we’ll dig into the functionality that makes Solr SQL support possible – the Streaming Expressions. Using Streaming Expressions we will put together a mechanism that lets us re-index data in a given Solr Collection – all within Solr, without any external moving parts or dependencies.

Read More

jdbc driver

Solr 6 as JDBC Data Source

Last week, in the Solr 6, SolrCloud and SQL Queries post, we described how the recent release of Solr 6 in its SolrCloud mode is able to understand SQL. But this is not the only SolrCloud / Solr 6. Another addition that we can use is the Solr JDBC driver. We can use it just like any other JDBC driver. In this blog post we will show how to use Solr JDBC driver from our code, which should give you an idea of how to proceed when using this functionality elsewhere, say with Apache Zeppelin or any other data exploration or visualization tool that has JDBC support.

Read More

cdcr

Solr 6 Cross-Data Center Replication

With the recent release of Solr 6.0, we got a host of new functionalities users have been anxiously waiting for. We’ve got the Parallel SQL over MapReduce that we recently blogged about, the new default similarity model, changes to the default similarity model configuration, the graph query and the cross-data center replication. We will slowly discuss all of the features of new Solr version, but today we will look at the cross-data center replication functionality, how it works, how to set it up and what to keep in mind when using it.

Read More

Solr 6, SolrCloud and SQL Queries

parallel

With the recent release of Apache Lucene and Solr 6, we should familiarize ourselves with the juicy features that come with them. We have the new default Similarity implementation – BM25 – instead of the previously used TF-IDF Similarity, we have improvements in the default Similarity configuration, new dimensional points, spatial module not using third-party libraries, and so on. We will look at all that in the upcoming weeks, but for now let’s dig into one of the biggest additions – Parallel SQL over MapReduce in SolrCloud.

Read More