We were finally able to find out why the job takes so long to start.
There was higher than normal network IO during job startup and so we
checked size of the checkpoint topic on disk and it was ~21GB.
We then restarted the Kafka node who was the leader for the checkpoint
topic, the topic disk size went down to ~1.8GB and the job started up
fairly quickly.
Its probably due to a bug in Kafka where log cleaner died and we never
noticed: https://issues.apache.org/jira/browse/KAFKA-3894.
We have since been working on upgrading Kafka to avoid this bug.
Hope this helps if anyone else ever runs into it.

Xiaochuan Yu

On Sat, Sep 23, 2017 at 6:17 PM XiaoChuan Yu <[EMAIL PROTECTED]> wrote: