Subject: 7.2.1 cluster dies within minutes after restart


Hello,

We recently upgraded our clusters from 7.1 to 7.2.1. One collection (2 shard, 2 replica) specifically is in a bad state almost continuously, After proper restart the cluster is all green. Within minutes the logs are flooded with many bad omens:

o.a.z.ClientCnxn Client session timed out, have not heard from server in 22130ms (although zkClientTimeOut is 30000).
o.a.s.c.Overseer could not read the data
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer_elect/leader
o.a.s.c.c.DefaultConnectionStrategy Reconnect to ZooKeeper failed:org.apache.solr.common.cloud.ZooKeeperException: A ZK error has occurred
o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error trying to proxy request for url
etc etc etc
2018-01-26 16:43:31.419 WARN  (OverseerAutoScalingTriggerThread-171411573518537026-logs4.gr.nl.openindex.io:8983_solr-n_0000001853) [   ] o.a.s.c.a.OverseerTriggerThread OverseerTriggerThread woken up but we are closed, exiting.

Soon most nodes are gone, maybe one is still green or yellow (recovering from another dead node).

A point of interest is that this collection is always under maximum load, receiving  hundreds of queries per node per second. We disabled the querying of the cluster and restarted it again, this time it kept running fine, and it continued to run fine even when we slowly restarted the tons of queries that need to be fired.

We just reverted the modifications above, the cluster now receives full load of queries as soon as it is available, everything was restarted and everything is suddenly fine again.

We really have no clue why for a days everything is fine, then we suddenly come into some weird flow (loaded with o.a.z.ClientCnxn Client session timed out msgs) and it takes several full restarts for things to settle down. Then all is fine until this afternoon where for two hours long the cluster kept dying almost instantly. And at this moment, all is well, again, it seems. The only steady companion when things go bad are the time outs related to ZK.

Under normal circumstances, we do not time out due to GC, the heap is just 2 GB. Query response times are ~10 ms even when under maximum load. We would like to know why and how it enters a 'bad state' for no apparent reason. Any ideas?

Many thanks!
Markus

side note: This cluster always has been a pain but 7.2.1 made something worse, reverting to 7.1 is not possible due to index being too new (there were no notes in CHANGES indicateing an index incompatibility between these two minor versions).