so when we do a search in ES we do it in 2 roundtrips. To make sure it's consistent we register a search context on every shard during the first roundtrip. The second roundtrip passes the context ID on to make sure we operate on the same point in time snapshot as the first roundtrip. Now if something closes the search context ie. if it times out (5min by default) you will see this message. It's possible for instance in situations like scan/scroll that if the users is not coming back we might clean up the context in that case. Do you use scan/scroll?
for doc in helpers.scan( es, index = "filebeat-*", doc_type = "etp", query = scanQuery.format( args.since, "classname.keyword" ), size = 5000, scroll = "5m", raise_on_error = False, # Don't know why we sometimes get ScanError otherwise preserve_order = True):
Is the "default five minutes" you mentioned what the "scroll" parameter is about? Is this five minutes for a single iteration of the loop processing a single document, or for a batch of 5000 (I'm guessing here that the "size" parameter divides the results up into batches of 5000 but I don't see why that should be my business), or for the entirety of the loop processing all the query results?
Thanks. I think that's now enough information for me to experiment with various timings, batch sizes, error detection and retries. (Which, as ever, will no doubt turn out to be far more work than just getting the original logic right. :frowning_face:)
I see this error when the search context has timed out so it's no longer present in the cluster. The `helpers.scan` function is a generator which hides the underlying multiple trips to elasticsearch. If you are not processing the results fast enough, @TimWard, it can lead to the context being cleaned up in elasticsearch which will make the next request for that `scroll_id` (which is done internally in the helper) produce this error.
To verify that this is the case you can tun on logging at the beginning of your script to see the individual requests being sent to elasticsearch: ```python import logging logging.basicConfig(level=logging.INFO) ```
You can also try setting the `scroll` param to a higher value to see if it goes away and then try to pinpoint the proper value for your environment.
@honzakral I wonder if we should allow scans to renew their context without consuming that way we can make this API more easy to use. Like your client can go back if you are making progress and tell ES to keep things open? just an idea...
hmm, that would definitely be a nice solution but I am a bit worried about the side effects that might not be immediately obvious to users - people that never finish iterating over the generator would then keep a context alive in elasticsearch indefinitely which would definitely be bad.
Ultimately I think the current approach is good in that it promotes the idea that `scan`/`scroll` is to be used for quick export of data - not keeping a "cursor" open while you perform expensive operation on every document, taking a long time. If that is the case I feel you should use some form of background processing with a queue and a pool of workers anyway.
To improve the user experience would there be a way to keep a tombstone of a search context to provide the user more accurate info? "Your scroll timed out, try increasing your `scroll` parameter" would be so much more helpful in this case if the overhead is not too big.
Another option would be the idea of a streaming API to/from elasticsearch where this would be done in the coordinating node (potentially same with `bulk`), that sounds to me though like more trouble than it's worth...