Hey Tengfei,

the Aurora health checks cannot differentiate a service instance which has deadlocked from one which is extremely slow. The decision to restart is then performed by the executor without central coordination by the scheduler. Your best course of action will therefore be to prevent the overload in the first place, for example via load shedding and graceful degradation. You can find further details in the Google SRE Book [1].

Specifically, you will want to do tight(er) health checking in your loadbalancers, so that instances drop out of rotation before they hit their capacity limit. In addition, I have had a good experience by also protecting instance with a limiting HAProxy/Nginx that runs as a side-car within Aurora tasks.

I hope this gets you started.

Best regards,

[1] https://landing.google.com/sre/book/chapters/addressing-cascading-failures.html
On 18.06.18, 21:45, "Tengfei Mu" <[EMAIL PROTECTED]> wrote:

    We have had a few incidents when service under unexpected traffic/load
    spike then container starts to respond slow/fail health check, which caused
    massive instance rescheduling in Aurora. This could be a bad cycle that
    instances rescheduled (being started) causing more load on other instances,
    then more and more instances hammered down. Any one can share some best
    practice/lessons for preventing such outage caused by dynamic rescheduling
    in production cluster?