This was the issue guys. We were able to clear the fielddata cache using `curl -s -XPOST "http://$(hostname -i):9200/_all/_cache/clear"` on one of the master nodes, and observed that the query would return consistent results up until the field data cache filled up again and started tripping the field data circuit breaker.

The only question now is why the shards that failed due to the breaker being tripped did not return any exception in our logs.

Regardless we're going to go back and take a look at our mappings and see if we cannot refactor some of them to use doc_values.