Skip to main content

Poll: Solr Index Size Monitoring

sematext sematext on

As you may know, Sematext runs a service we internally call SPM – Scalable Performance Monitoring, a currently-still-free SaaS for monitoring performance of Solr, HBase, and soon a few other technologies we often help our clients with.  One of the things we monitor for Solr and other search technologies is the size of the index.  We monitor it by periodically checking its size, number of documents in it, number of deleted documents, number of index segments, files, etc.

Recently, we had an internal discussion about how to best report the index size when the index changes over time and decided we’d ask people who run Solr (or ElasticSearch or Sensei or…) – you – what you would like to see in this report.

For example, imagine that in some 5-minute time period (say 10:00 AM to 10:05 AM) we check the index 5 times (in reality we do it much for frequently) and each time we do that we find the index has a different number of documents in it: 10, 15, 20, 25, and finally 30 documents.  Now imagine this data as a graph showing the number of indexed document over time, but with the smallest  time period shown being a 5 minutes interval.

At this point the question we have for you is: How many documents should this graph report for our example 10:00 – 10:05 AM period above? Should it show the minimum – 10?  Average – 20?  Mean – 20?  Maximum -30?  Something else?  Minimum, average, and maximum – 10, 20, 30?

Any feedback and suggestions you give us regarding this will be greatly appreciated – thanks!

10 thoughts on “Poll: Solr Index Size Monitoring

  1. To the author,
    Would you have any objections to me running this same poll on Javalobby and the Solr-Lucene Zone at I think it’d be interesting to compare and combine the data we get.

    -Mitch Pronschinske
    DZone Community Curator

  2. I would want the # at the end of the time period. Unless there is some weird anomaly (the index suddenly changes significantly every 5 minutes and then changes back) this would be “accurate enough”. If your sampling is higher than the granularity you use to present data, stop sampling so much. 🙂

  3. It depends on the Business need/goals.

    A Salesperson might want the Max to make the case for a sale.
    IT might want the Max for sizing/performance monitoring, although the sample size is small, and the variation large.

    Are there event triggers involved – too low/too high?

    It comes down to what question is the data used to answer?

    1. Jon – there are indeed triggers for alert purposes. Examples in English:

      • Alert me when the number of docs is > N (or < N).
      • Alert me when the number of documents in a given period is N% different from the number of documents in the previous period.
  4. Minimum does not make much sense – it is same as presenting maximum, just 5 min later.
    I would go with number that matches time on time axis even if there are more measurement then presented on graph. Whoever is interested in more details can change time unit and get new graph that is aligned with first one.

    1. Emir: Good way to think about the minimum (assuming the index is always growing, which is true 99% of the time). Not sure I follow the rest. Are you saying exact/real document numbers should be present at each point in time when the measurement was made, and it is up to a person to zoom in to that level of granularity if knowledge about the exact number of documents is needed? If that is what you mean, I think the question still remains what to show on the graph for a time period that spans multiple measurements.

      1. That is what I was suggesting – presenting number that corresponds time presented on graph and ignore measurements in between. Those are snapshots anyway and some fluctuation between measurement may be lost anyway.
        And it is not supposed to be single source of index changes. In combination with graph showing number of index changes you can identify periods where number of changes are larger than number of new documents and than use measurement between two snapshots to see what happened in those 1% of cases.

  5. This is a case where average is useful. For many things, especially response times, median and percentiles are more stable than averages. Averages are thrown off by a single slow response, but once it is slower than the front-end timeout, it doesn’t really matter how slow it is. But if the 99th percentile goes over the front-end timeout, you know for sure that 1% of you search result pages are showing an error.

Leave a Reply