We have 6 elasticsearch server and all of them are data and 3 of them are marked as master eligible. Currently we have 5TB of data stored across these nodes, All of the 6 nodes are Load balanced using HAProxy.
In our platform these host require a OS restart (for some reason I have no clarity of). With reference to links below we framed list of instructions on how to perform rolling restart. https://www.elastic.co/guide/en/elasticsearch/reference/6.2/rolling-upgrades.htmlhttps://www.elastic.co/guide/en/elasticsearch/reference/6.2/restart-upgrade.html
1. Disable shard allocation and perform a synced flush
2. Shut down a single node
3. Did OS Restart, Removed old logs from /var/log/elasticsearch
4. Start the node
5. Re-enable shard allocation
It took more than 12 hours to recover as in to see state changes from yellow to green. It didn't show any new data that was sent from various component during this 12+ hours.
We didn't touch other nodes or logstasth they were up and running all the time. We invoked all the API calls using LB URL not local URL.
DId we do anything wrong and if so what can we do to change in our approach. This cluster monitoring utilized by 25+ projects for log analysis So getting huge outage is not possible.