Let’s say you get an alert that one or more queries is slow. Or that your users complain, whichever comes first 🙂 We’ve all been there… How do you find the root cause for this slowness and then fix it?
In this article, I’ll go through my usual thought process: first, I’d try to find which queries are slow. Then, I’d dig deeper:
- Is the query itself expensive?
- Is there something in the environment, such as garbage collection, that’s impacting query performance?
- Last but not least, what are some ideas on fixing slow queries?
Let’s take a specific example and run through each step. I’ll use Sematext Cloud because it lets me easily analyze both Solr logs and performance metrics using a single UI.
Detecting Slow Queries
To find the slow queries, we’ll need to log them, either from Solr or from the application that queries it. Here, I’ll use Sematext Cloud’s Solr Logs integration to parse my Solr logs and give me some predefined dashboards to analyze them.
One of the predefined dashboards is for analyzing queries, so if there’s a latency spike, we can see it right away:
In this case, the 90th and 95th percentiles are normally around 100-150ms, but they suddenly spike to 400ms+.
We don’t have to be constantly on the lookout for such slow queries, we can filter them by QTime in the search bar and send alerts (click the bell icon). In this case, for queries slower than 400ms:
Assuming the alert came, we can run the same filter to identify the offending query:
Bingo! Here’s our slow query of 447ms! But wait: it’s not expensive at all, it’s looking for just one term (in this case,
title is a default
text_general field). To get to the root cause, we need more information about the environment in which this query runs: metrics.
Correlating Solr Logs with Metrics
Sematext Cloud has a Solr Monitoring integration to complement the Logs integration that I used so far. Setting it up works in a similar fashion: an agent points to Solr, picks up the relevant metrics and sends them to Sematext Cloud – where you have predefined dashboards to explore those metrics.
We can bring the monitoring dashboards without leaving the context of logs. I marked each step in the screenshot below:
- I clicked on the Split Screen button to bring another Sematext Cloud App side by side with the current dashboard.
- I selected the SolrCloud Monitoring App used to monitor the same Solr cluster. You could bring other Apps, for example a Zookeeper Monitoring App or a generic Logs App capturing logs from the application that queries Solr.
- I picked the CPU & Memory dashboard from this App, because I like to start with system metrics, to get an overview of what’s going on.
- I moved the mouse over the query latency spike in the Solr Logs App. The vertical line shows the same time on the right pane as well, indicating a CPU and load spike.
At this point, we can assume that the saturated CPU caused the cheap term query to be slow. But what caused the CPU spike in the first place? Maybe we have too many queries running at the same time?
After filtering for the
/select handler to keep only queries, I can see the request rate is constant. So that’s not it. What about indexing?
Aha! Now we’re down to the root cause: a spike in indexing must have saturated the CPU, making our query slow.
Fixing Solr Slow Queries
In this particular case, we’d want to make sure that indexing doesn’t saturate the CPU : either by rate-limiting from the indexing application or by adding enough headroom (hardware) to the Solr cluster. Tuning the merge policy may also squeeze more write performance, but is unlikely to solve the problem on its own (i.e., don’t expect miracles!). If anything, merges have more potential on improving query latency – when indexing isn’t an issue.
- The query is expensive. You may be able to make it cheaper with one of the following tricks:
- Querying many fields? Having a catch-all field (via copyField) may make things faster
- Analysis produces too many terms? Maybe you can be more conservative with things like ngram size, use stemming or stopwords, in order to reduce the cardinality of a field
- Using too many phrases? Try reducing those pf, pf2 and pf3 fields
- Is there a costly filter, e.g. based on a function? Use it as a post-filter by adding cost>100 and cache=false
- Expensive facets? Use debug to find out if the query or the facet takes more time. Try JSON Facets if you’re not already using them. Mind your size, refining options and preliminary sorting
- Multi-tenant search? Try one core/collection per client or use routing
- Garbage collection pauses. We either need more heap or to tune the garbage collector.
- Replication takes too many resources. We can throttle replication.
- Imbalanced data in a cluster. We can change the number of shards and replicas (though we may need to reindex) to have even out the load.
- Cold caches. We can add warm up queries.
All these have something in common: you’d see symptoms in metrics, logs or both. Which is where Sematext Cloud has you covered. You can see it in action in this article where we show you in greater detail how you can use Sematext to monitor and alert on Solr metrics and logs. Start your Sematext trial now and let us know what you think!