Meghdoot bhattacharya 2017-07-17, 21:57
Based on the thread in the Mesos dev list, it looks like because they don't
persist task information so they don't have the task IDs to send when they
detect the agent is lost during failover. So unless this is changed on the
Mesos side, we need to act on the slaveLost message and mark all those
tasks as LOST in Aurora.

Or rely on reconciliation. To reconcile more often, you should keep in mind:

1) Implicit reconciliation sends one message to Mesos and Mesos replies
with N number of status updates immediately, where N = number of running
tasks. This process is usually quick (on the order of seconds) due to being
mostly NOOP status updates. When you have a large number of running tasks
(say 100k+), you may see some GC pressure due to the flood of status
updates. If this operation overlapped with another particularly expensive
operation (like a snapshot) it can cause a huge stop the world GC. But it
does not otherwise interfere with any operation.

2) Explicit reconciliation is done in batches, where Aurora batches up all
running tasks and sends one batch at a time, staggered by some delay. The
benefit here is there is less GC pressure, but the drawback is if you have
a lot of running tasks (again, 100k+), it will take over 10 minutes to
complete. So you have to make sure your reconciliation interval is aligned
with this (you can always increase the batch size to make this happen


On Sun, Jul 16, 2017 at 11:10 AM, Meghdoot bhattacharya <
[EMAIL PROTECTED]lid> wrote:
  Renan DelValle 2017-07-18, 17:45
  Meghdoot bhattacharya 2017-07-14, 17:05
  David McLaughlin 2017-07-14, 17:28
  meghdoot bhattacharya 2017-07-13, 23:32
  David McLaughlin 2017-07-14, 15:21
  Meghdoot bhattacharya 2017-07-15, 09:01
  David McLaughlin 2017-07-15, 16:21
  David McLaughlin 2017-07-15, 16:33
  Meghdoot bhattacharya 2017-07-16, 18:10
  Meghdoot bhattacharya 2017-07-16, 05:28
  Stephan Erb 2017-07-16, 16:49