We have had a few incidents when service under unexpected traffic/load
spike then container starts to respond slow/fail health check, which caused
massive instance rescheduling in Aurora. This could be a bad cycle that
instances rescheduled (being started) causing more load on other instances,
then more and more instances hammered down. Any one can share some best
practice/lessons for preventing such outage caused by dynamic rescheduling
in production cluster?