Handling Shards in SolrCloud


One of the things you learn when attending Sematext Solr training is how to scale Solr. We discuss various topics regarding leader shards and their replicas – things like when to go for more leaders, when to go for more replicas and when to go for both. We discuss what you can do with them, how to control their creation and work with them. In this blog entry I would like to focus on one of the mentioned aspects – handling replica creation in SolrCloud. Note that while this is not limited to Solr on Docker deployment, if you are considering running Solr in Docker containers you will want to pay attention to this.

Read More

Solr Training

Solr Training, San Francisco & New York, October

If you are using Solr and are looking for Solr training to quickly improve your Solr or SolrCloud skills, we’ve running several Solr classes this October in San Francisco and New York.

Have two people attend the training from the same company? The second one gets 25% off.

To see the full course details, full outline and information about the instructor, click on the class names below.

San Francisco:

New York City:

All classes include breakfast and lunch. If you have special dietary needs, please let us know ahead of time. If you have any questions about any of the classes you can use our live chat (see bottom-right of the page), email us at training@sematext.com or call us at 1-347-480-1610.

Sematext Solr Training

Apache Solr Training in NYC June 13-14

If you’ve missed our Core Solr training in October 2015 in New York, here is another chance – we’re running the 2-day Core Solr class again next month – June 13 & 14, 2016.
This course covers Solr 5.x as well as Solr 6.x!  You can see the complete course outline under Solr & Elasticsearch Training Overview . The course is very comprehensive — it includes over 20 chapters and lots of hands-on exercises.  Be prepared to learn a lot!

$1,200 early bird rate (valid through June 1) and $1,500 afterwards.

There’s also a 50% discount for the purchase of a 2nd seat!


462 7th Avenue, New York, NY 10018 – see map

If you have any questions please get in touch.  To sign up, just register here.

reindex data

DocValues Reindexing with Solr Streaming Expressions

Last time, when talking about Solr 6 we learned how to use streaming expressions to automatically update data in a collection. You can imagine this is not the only cool thing you can do with streaming expressions. Today, we will see how to re-index data in your collection for fields that are using doc values. For that we will use Solr 6.1, because of a simple bug that was fixed for that version (details SOLR-9015)

Let’s assume we have two collections – one called video, which will be the source of the data. The second collection will be video_new and will be the target collection. We assume that collections will have slightly different structure – slightly different field names. The video collection will have the following fields:

  • id – document identifier
  • url – URL of the video
  • likes – number of likes
  • views – number of views

The second collection, video_new, will have the following fields:

  • id – document identifier
  • url – URL of the video
  • num_likes – number of likes
  • num_views – number of views

Exporting the data

First thing we need to figure out is a way to export data from the source collection in an efficient fashion. We can’t just set the rows parameter to gazillion, because it is not efficient and can lead to Solr going out of memory. So we will use the /export request handler. The only limitation of that request handler is that data needs to be sorted and needs to use doc values. That is not a problem for our data, however you should be aware of this requirement.

We will start by exporting the data using the standard Solr way – using the request params with the /export handler. The request looks like this:

curl -XGET 'localhost:8983/solr/video/export?q=*:*&sort=id+desc&fl=id,url,likes,views'

The above will result in Solr using the /export handler and returning all data, not only the first page of the results.

However, we want to use streaming expressions to re-index the data. Because of that we can change the above request to use the search streaming expression, which looks as follows:

  sort="id desc")

The working command with the request looks like this:

curl --data-urlencode 'expr=search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc")' http://localhost:8983/solr/video/stream

We use the search streaming expression and provide the name of the collection, which is video in our case, the ZooKeeper host (yes, we can read from other clusters), the name of the request handler which is /export in our case and is required. Finally, we provide the match-all query, the list of fields that we are interested in, and the sorting expression. Please remember that when using the /export handler all fields listed in the fl parameter must use doc values.

Changing field names

Our collections have different field names and because of that the above search request is not enough. We need to alter the name of the fields by using the select streaming expression. We will change the name of the likes field to num_likes and the name of the views field to num_views. The expression that does that is:

  search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),
  likes as num_likes,
  views as num_views

The select streaming expression lets us choose which fields should be used in the resulting tuples and how they will be named. In our case we take the id and url fields as is and we change the name of the likes and views fields.

To test the result of that expression you can simply use the following command:

curl --data-urlencode 'expr=select(search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),id,url,likes as num_likes,views as num_views)' http://localhost:8983/solr/video/stream

Running the re-indexing

Finally, we have the data prepared and read in an efficient way, so we can send data to Solr for indexation. We do that using the update streaming expression simply by specifying the target collection name and the batch size, like this:

  select(search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),id,url,likes as num_likes,views as num_views))

And the command that we would send to Solr:

curl --data-urlencode 'expr=update(video_new,batchSize=100,select(search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),id,url,likes as num_likes,views as num_views))' http://localhost:8983/solr/video/stream

Please note that we send the command to the source collection /stream handler – in our case to the video collection. This is important.

Verifying the re-indexation

Once the task has been finished by Solr we can check the number of documents returned by each collection to verify that data has been re-indexed properly. We can do that by running these commands:

curl -XGET 'localhost:8983/solr/video/select?q=*:*&indent=true&rows=0'


curl -XGET 'localhost:8983/solr/video_new/select?q=*:*&indent=true&rows=0'

Both result in the following number of documents:

<?xml version="1.0" encoding="UTF-8"?>

<lst name="responseHeader">
  <bool name="zkConnected">true</bool>
  <int name="status">0</int>
  <int name="QTime">38</int>
  <lst name="params">
    <str name="q">*:*</str>
    <str name="indent">true</str>
    <str name="rows">0</str>
<result name="response" numFound="18" start="0">

And that means that everything works as intended 🙂

Interested in Solr Streaming Expressions? Subscribe to this blog or follow @sematext – we have more Streaming Expressions blog posts in the queue. If you need any help with Solr / SolrCloud – don’t forget @sematext does Solr Consulting, Production Support, as well as Solr Training!

solr streaming expressions

Solr Streaming Expressions for Collection auto-updating

One of the things that was extensively changed in Solr 6.0 are the Streaming Expressions and what we can do with them (hint: amazing stuff!). We already described Solr SQL support. Today, we’ll dig into the functionality that makes Solr SQL support possible – the Streaming Expressions. Using Streaming Expressions we will put together a mechanism that lets us re-index data in a given Solr Collection – all within Solr, without any external moving parts or dependencies.

Read More

jdbc driver

Solr 6 as JDBC Data Source

Last week, in the Solr 6, SolrCloud and SQL Queries post, we described how the recent release of Solr 6 in its SolrCloud mode is able to understand SQL. But this is not the only SolrCloud / Solr 6. Another addition that we can use is the Solr JDBC driver. We can use it just like any other JDBC driver. In this blog post we will show how to use Solr JDBC driver from our code, which should give you an idea of how to proceed when using this functionality elsewhere, say with Apache Zeppelin or any other data exploration or visualization tool that has JDBC support.

Read More


Solr 6 Cross-Data Center Replication

With the recent release of Solr 6.0, we got a host of new functionalities users have been anxiously waiting for. We’ve got the Parallel SQL over MapReduce that we recently blogged about, the new default similarity model, changes to the default similarity model configuration, the graph query and the cross-data center replication. We will slowly discuss all of the features of new Solr version, but today we will look at the cross-data center replication functionality, how it works, how to set it up and what to keep in mind when using it.

Read More

Solr 6, SolrCloud and SQL Queries


With the recent release of Apache Lucene and Solr 6, we should familiarize ourselves with the juicy features that come with them. We have the new default Similarity implementation – BM25 – instead of the previously used TF-IDF Similarity, we have improvements in the default Similarity configuration, new dimensional points, spatial module not using third-party libraries, and so on. We will look at all that in the upcoming weeks, but for now let’s dig into one of the biggest additions – Parallel SQL over MapReduce in SolrCloud.

Read More

Core Solr Training in London


April 4 & 5 — Covers Solr 5.x

Hands-on — lab exercises follow each class section

Early bird pricing until February 29

Add a second seat for 50% off

Sematext is running a 2-day, very comprehensive, hands-on workshop in London on April 4 & 5 for Developers and DevOps who want to configure, tune and manage Solr at scale.

The workshop will be taught by Sematext engineer — and author of Solr books — Rafał Kuć. Attendees will go through several sequences of short lectures followed by interactive, group, hands-on exercises. There will be a Q&A session after each such lecture-practicum block.  See details, including training overview.



Target audience:

Developers who want to configure, tune and manage Solr at scale and learn about a wide array of Solr features this training covers in its 23 chapters – we mean it when we say this is comprehensive!

What you’ll get out of it:

In two days of training Rafal will:

  1. Bring Solr novices to the level where he/she would be comfortable with taking Solr to production
  2. Give experienced Solr users proven and practical advice based on years of experience designing, tuning, and operating numerous Solr clusters to help with their most advanced and pressing issues

When & Where:

  • Dates: April 4-5 (Monday & Tuesday)
  • Time: 9:00 am to 5:00 pm
  • Place: Imparando City of London Training Centre — 56 Commercial Road, Aldgate, London, E1 1LP (see map)
  • Cost: GBP £845.00 “Early Bird” rate (valid through February 29) and GBP £1.045.00 afterward.  There’s also a 50% discount for the purchase of a 2nd seat! (limit of 1 discounted seat per full-price seat)
  • Food/Drinks: Light morning & afternoon refreshments and Lunch will be provided

Got any questions or suggestions for the course? Just drop us a line or hit us @sematext!

Lastly, if you can’t make it…watch this space or follow @sematext — we’ll be adding more Solr training workshops in the US, Europe and possibly other locations in the coming months.  We are also known worldwide for our Solr Consulting Services and Solr Production Support.


Hope to see you in the London in April!  See detailed info about this training.



Docker + Solr How-to: Monitoring the Official Solr Docker Image

The official Solr Image on Docker Hub was released just a few weeks ago and already has 16K pulls. Why not more? Well, there are more than 200 different Solr images on Docker Hub — probably because no official Image was available!

A rapidly growing number of organizations are using Solr and Docker in production and they are probably happy about the new official Image. Needless to say, monitoring Solr is essential in production. Docker is disruptive in many ways, and there are many things that are slightly different and worth mentioning.  These include:

  1. Changed deployment for Solr and its monitoring tools using Dockerfile, Docker Compose or various Orchestration Tools
  2. There is a new Layer to monitor: Container Metrics and Events, see: Docker Events and Metrics monitoring and SPM for Docker
  3. Logging has changed: containers log to the console and logs need to be retrieved from Docker-Daemon instead getting them from the Solr log file.  Check out our post on the subject: Innovative Docker Log Management
  4. Official Images may not provide options for monitoring (such as JMX).  However, the official Image for Solr provides an option to pass parameters to the Java Runtime Environment.  We we will use this option for Solr monitoring in this post.

Next, I’m going to demonstrate the setup of a Solr node with SPM. The final setup will provide the full Solr & Docker Monitoring and Logging package:

  • Detailed Application Metrics for Solr, deployed on Docker
  • Detailed Container Metrics and Docker Events
  • Centralized Logs for all Containers by SPM for Docker

Let’s first decide on one of the following options to monitor Solr on Docker:

  1. Build your own Solr container with a mix of open-source monitoring/alerting tools. I’m not going to go into detail about this option today because dealing with a mix of open-source DevOps tools and a non-official Solr image doesn’t sound clean; plus, we can do better.
  2. Use a standalone monitoring agent, which queries metrics from the Solr container. This requires a setup for JMX and Docker networking configurations for the monitor and Solr. The metrics gathered by remote agents are limited and, in the Docker context, running an external monitoring process plus Solr processes consumes more resources.  And the next option …
  3. Inject an SPM in-process monitoring agent into Solr. This option has the lowest resource usage and has support for advanced monitoring functions like Transaction Tracing and AppMap.

We’ll go with Option #3 in this blog post, as it provides the best insights into Solr.  Sematext provides the SPM Client (this includes the monitoring agent and metrics sender) pre-installed in a Docker Image.  We refer to this dockerized SPM Client as “SPM Client Image/Container” in the following instructions.  The main trick here is to mount a volume from SPM Client Container into Solr Containers in order to load the monitoring library that’s part of the SPM Client Container.

Let’s have a look at the desired setup and how to get there:

Monitoring Setup

We’ll use the latest Docker-Compose Version (> v1.5) because we can than use environment variables substitution in Docker-Compose.

1) Configure and start SPM-Client Container

The SPM Token is a unique identifier for monitored applications – if you haven’t created an SPM App for Solr, then create one here first. Should take about 37 seconds.

# Set the SPM Token as Environment Variable
export SPM_TOKEN=4feb144c-4da8-4081-83b5-b0b8e06e743a
# Set the JVM Name, which appears in SPM JVM Metrics Report
# In addition we will use it as Hostname for the Solr container

2) Create SPM Client and Solr service in docker-compose.yml Note: you may copy this file to make changes for additional Solr options; all parameters are set as Environment Variables.

 image: sematext/spm-client
 container_name: spm-client-solr
 hostname: spm-client-solr
 - SPM_CONFIG=${SPM_TOKEN} solr javaagent ${JVM_NAME}

 image: solr
 hostname: solr1
 - "8983:8983"
 - spm-client-solr
 - SOLR_OPTS=-Dcom.sun.management.jmxremote -javaagent:/opt/spm/spm-monitor/lib/spm-monitor-solr.jar=${SPM_TOKEN}::${JVM_NAME}
 command: bin/solr -f

In the Environment variable “SOLR_OPTS” in the Docker-Compose file above we see options for the SPM in-process monitor to inject a .jar file from the SPM Client Volume.  The SOLR_OPTS string is taken from SPM install instructions.  It includes the SPM Token (the ${SPM_TOKEN} part) and provides the JVM name so we can distinguish between multiple Solr instances if we run N of them on the same host (the ::${JVM_NAME} part).

3) Run Solr and SPM Monitor  

We are now ready to fire up Solr:

    docker-compose up -d


All done! After about a minute, metrics for the Docker Host, JVMs and Solr nodes will appear in SPM.  Because we chose a consistent naming for Container hostname, and JVM name we can immediately see, in every chart, the relevant filters named “SOLR1”.  This is much better than some random Container IDs.


Solr Metrics Overview

But what about my Solr Logs and the Container Metrics?

Simply run SPM for Docker – it collect logs as well as container and host metrics.  It can also parse Solr logs and store them in Logsene (see Logsene 1-Click ELK Stack), which is awesome because it means you can have both Solr/OS/JVM metrics AND Solr logs all in one place!  Or do you maybe like to ssh to your servers and grep log files?

Docker Logs & Metrics Steps:

First we create the SPM App with the type “Docker” for Docker-specific metrics and then we create a Logsene App for our logs. Then we use the generated App Tokens to run Sematext Agent for Docker.

docker run -d -name sematext-agent -e SPM_TOKEN=SPM_DOCKER_APP_TOKEN -e LOGSENE_TOKEN=LOGSENE_APP_TOKEN sematext/sematext-agent-docker

After a few minutes, you will get Host and Container Metrics together with Events and Logs in SPM, as shown here:


Please note that logs from the containers are automatically shipped and parsed! No setup for log shippers? That is correct — there is NO complicated setup of syslog, Logstash, Docker log drivers, etc.  All this work is done by SPM for Docker. For example, each log line has a “node_name” field for the Solr node. It takes the timestamp, severity, class, thread and source from the Solr log and each log is automatically tagged with the container ID and image name. Moving from SPM Metrics to detailed Solr Logs including Exceptions and parsed Stack Traces is just another mouse click away! Look:

Multi-Line Exception, captured and parsed from Solr container



The filters next to field stats on the right side of the screen make it easy to identify containers with the most logs by choosing “container_name”.  That’s just a little detail in the Logsene UI – feel free to explore it by creating Alerts or Kibana 4 Dashboards for your container logs.

Like what you saw here? To monitor Docker and Solr with SPM just get a free account here!  And drop us an email or hit us on Twitter with suggestions, questions or comments.  Solr and Docker are topics we enjoy chatting about with the community!