Elasticsearch Refresh Interval vs Indexing Performance

Elasticsearch is near-realtime, in the sense that when you index a document, you need to wait for the next refresh for that document to appear in a search. Refreshing is an expensive operation and that is why by default it’s made at a regular interval, instead of after each indexing operation. This interval is defined by the index.refresh_interval setting, which can go either in Elasticsearch’s configuration, or in each index’s settings. If you use both, index settings override the configuration. The default is 1s, so newly indexed documents will appear in searches after 1 second at most.

Because refreshing is expensive, one way to improve indexing throughput is by increasing refresh_interval. Less refreshing means less load, and more resources can go to the indexing threads. How does all this translate into performance? Below is what our benchmarks revealed when we looked at it through the SPM lens.

Please tweet about Elasticsearch refresh interval vs indexing performance

Test conditions

For this benchmark, we indexed apache logs in bulks of 3000 each, on 2 threads. Those logs would go in one index, with 3 shards and 1 replica, hosted by 3 m1.small Amazon EC2 instances. The Elasticsearch version used was 0.90.0.

As indexing throughput is a priority for this test, we also made some configuration changes in this direction:

  • index.store.type: mmapfs. Because memory-mapped files make better use of OS caches
  • indices.memory.index_buffer_size: 30%. Increased from the default 10%. Usually, the more buffers, the better, but we don’t want to overdo it
  • index.translog.flush_threshold_ops: 50000. This makes commits from the translog to the actual Lucene indices less often than the default setting of 5000

Results

First, we’ve indexed documents with the default refresh_interval of 1s. Within 30 minutes, 3.6M new documents were indexed, at an average of 2K documents per second. Here’s how indexing throughput looks like in SPM for Elasticsearch:

refresh_interval: 1s

Then, refresh_interval was set to 5s. Within the same 30 minutes, 4.5M new documents were indexed at an average of 2.5K documents per second. Indexing thoughput was increased by 25%, compared to the initial setup:

refresh_interval: 5s

The last test was with refresh_interval set to 30s. This time 6.1M new documents were indexed, at an average of 3.4K documents per second. Indexing thoughput was increased by 70% compared to the default setting, and 25% from the previous scenario:

refresh_interval: 30s

Other metrics from SPM, such as load, CPU, disk I/O usage, memory, JVM or Garbage Collection didn’t change significantly between runs. So configuring the refresh_interval really is a trade-off between indexing performance and how “fresh” your searches are. What settings works best for you? It depends on the requirements of your use-case. Here’s a summary of the indexing throughput values we got:

  • refresh_interval: 1s   – 2.0K docs/s
  • refresh_interval: 5s   – 2.5K docs/s
  • refresh_interval: 30s3.4K docs/s

25 thoughts on “Elasticsearch Refresh Interval vs Indexing Performance

  1. Hi I observed a situation but I don’t know is this normal or not.
    I use Elasticsearch 1.7.2 and my observation as below.
    I set a index with refresh 6s and when I start to add a document to this index, every 6s I see filesystem change, that is a *.cfe,*cfs file were created. If I keep adding document to index, every 6s file were created.
    If this normal?
    Is refresh will trigger flush action if there is a change occurred? That measn if there is any pending doccument in buffer(or translog change), refresh will flush to disk and generate a small segment file?

    1. It is somewhat normal in the sense that refreshes writes new segments with whatever is in the indexing buffer at the moment, but those segments aren’t fsync()-ed to disk. Only a flush does that.

      1. I’m not fully understand what you mean about “those segments aren’t fsync()-ed to disk. Only a flush does that.”

        I noticed that when I add document to index, first I see two files created but 0KB until refresh_interval(6s), the files will be 2KB (depends on how many documents added before).

        If “that refreshes writes new segments with whatever is in the indexing buffer at the moment”, what is the purpose to set index buffer? I thought that refresh will not flush segments if buffer is not full?

          1. Oh, I don’t know what the directory implementation is on Windows (I suppose you’re running the default hybrid niofs+mmapfs?) but this might explain the behavior you’re seeing.

            As far as I understand, a refresh implies a Lucene flush (https://lucene.apache.org/core/5_0_0/core/org/apache/lucene/index/IndexWriter.html#flush%28boolean,%20boolean%29) while a flush is an actual commit (https://lucene.apache.org/core/5_0_0/core/org/apache/lucene/index/IndexWriter.html#commit%28%29). Only the latter guarantees that files are persisted to disk, while a refresh may or may not write something – depending on the Directory implementation and the operating system.

            I hope this helps shed a bit of light.

          2. RADU, thanks for you kindly reply. Does elasticsearch provides any API that can easily check on which fsImpl. being used currently?

          3. Hi Radu, thanks.
            I wonder, if choose mmapfs is better then hybrid one?
            Do you have any experience of this comparison?

          4. Normally, mmapfs is better, it’s coming back as the default in 5.0. It makes better use of OS caches, check this nice blog post for an explanation: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

            The only thing you’d need to look out for is the explosion of open files. This depends on the number of segments (which in turn depends on the size of your data and the merge policy) but also on whether ES uses compound files. The defaults since 1.5 are sensible, though: https://github.com/elastic/elasticsearch/issues/8919

  2. Thank you the good article.

    One quick question: When it says “refresh_interval” of 1 sec. – what does it mean?

    Does it mean after *every* 1 second keep refreshing the index (no matter if there are changes or not) – or does it mean once there is a change to the index then (and only then) after 1 second do the refresh?

    What I am trying to understand is: does the refresh( ) method keep getting called always or only as a trigger to some data changes?

    1. Hi, and sorry for the delayed reply. Refreshing is always called at the set interval – you will see load on your machines even if data wasn’t added in that time. But when that happens, the load shouldn’t be high enough to be worried about it anyway. It becomes problematic when refreshing slows down indexing (and invalidates caches, but that isn’t covered in this post).

  3. Yes, the difference in indexing throughput is there because if you up the refresh_interval, you basically lower the time ES spends on refresh operations – freeing up CPU for other operations (such as bulk indexing). As the refresh_interval increases, the refresh time takes a less significant part of the overall computation, so increasing it further may not matter (depends on how your data and hardware looks like).

    The 30% indexing buffer has nothing to do with refreshes, but with flushes. An automatic flush is triggered when the index buffer gets full or when one of the transaction log thresholds is passed:
    http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-translog.html

    Increasing the buffer and translog sizes will make ES flush less often and, again depending on your hardware, might decrease the overall time ES spends flushing.

    Elasticsearch exposes refresh and flush times and counts through stats APIs. You can monitor these times with SPM: http://sematext.com/spm/index.html

    Whether you use SPM or not, monitoring can tell you what the bottleneck is for your use-case (it really depends here, though I hate to use this expression). For example, if you see tons of CPU I/O wait, you’ll know it’s disk I/O. Usually the bottleneck for indexing performance is usually CPU, assuming the disks can handle the sustained writes, and in most cases they can, even with spinning disks.

  4. Great post. I wonder why we don’t see as much improvement when changing refresh_interval from 5s to 30s as much as we do when changing refresh_interval from 1s to 5s.

    Is it that 30% of index buffer is not enough? Or is it disk IO? Did you guys care to check?

  5. Hi,

    Thanks for the post. I would see a test with the same data and refresh_interval: -1. Probable the difference between 30s and -1 is not so significant then between 1s and 30s but still …

    1. Hi Tamas,

      Right, that should help (also check my reply below). Here I was assuming some sort of near-real-time search is needed. If you do batch indexing (for example, update stocks every night in a store or similar use-cases), then it makes sense to disable automatic refreshes altogether (set refresh_interval to -1) and enable it back after you’re done.

  6. Thanks for sharing the information. What was your average document size in your tests?

    We are running our test with 144 KB document size with SSD and good Hardware, but aren’t getting anything above 200 documents per second. We are keeping refresh interval as 1second, since we want the data to be read as soon as it is written.

    Any ideas to improve performance are most welcome.

    Thanks
    Pranav.

    1. Hello Pranav,

      I don’t remember exactly, but I think there were some Apache logs of about 240 bytes each. This is definitely the main cause for your smaller numbers, you have much bigger documents.

      Here are some things you may look at:
      – upgrade Elasticsearch to the latest version (there were quite some performance improvements lately)
      – increase index buffer size (see the post), which defaults to 10% of your heap
      – since you’re on SSDs, you should increase or remove the store throttling limit: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-store.html#store-throttling
      – tune your merge policy for less merging (this will slow your searches down a bit): http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-merge.html (a good start would be to set index.merge.policy.segments_per_tier to something like 20 or 30)

  7. TD: which part specifically? When your documents are huge then you can’t overdo indices.memory.index_buffer_size, yes, as I think it’s implied here. Were you referring to something else?

  8. When the indexed documents are large, this seemed like not a good idea. The suggestion to increase refresh_interval to 30s blew up my system. So be aware.

Leave a Reply

Your email address will not be published. Required fields are marked *