We've definitely observed the issue less when using the default codec rather than best compression but we do still see it a couple of times a week.
I haven't had a chance yet to disable async as I want to make sure I'm around to monitor things when we change it -- But I will have it in place for tomorrow so I'll see how things go over the weekend etc.
Thanks @lunatic . I want to double check - when made the synced flush call, was indexing stopped? if not it's expected to fail. Regarding the searches - was indexing stopped there and did you run refresh?
I presume that the above is OK. Can you reproduce this reliably? since your index is so small, this allows us to enable trace logging which will help immensely in find the source of the problem. If so, please run the following command on the cluster and try to reproduce. When you do, please send me the logs to first name at elastic.co and please also send the shard stats so I will know which operations to look at.
@Evesy the logs @lunatic supplied give me enough of a beginning of a thread and I think I found the source of the issue. I don't think it's needed for you to lose many hours on experiments. But you can confirm this: can you check your logs for failures around mapping updates? Grep for `put-mapping` or `put mappings` .
Looks like this problem is occurring on 6.2.4 Usually it's not a problem but when I have to bounce a node the shards that have this 'corruption' have to completely rebuild again vs just copying the translog. I also am ingesting about 1TB per hour. If I can help with any details I'd be glad to assist.