Subject: Job recovers from an old dangling CheckPoint in case of Job Cluster based Flink pipeline


OK this happened again and it is bizarre ( and is definitely not what I
think should happen )
The job failed and I see these logs  ( In essence it is keeping the last 5
externalized checkpoints )  but deleting the zk checkpoints directory
*06.28.2019 20:33:13.738    2019-06-29 00:33:13,736 INFO
 org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
with ID 5654 at
'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5654'
not discarded.    06.28.2019 20:33:13.788    2019-06-29 00:33:13,786 INFO
 org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
with ID 5655 at
'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5655'
not discarded.    06.28.2019 20:33:13.838    2019-06-29 00:33:13,836 INFO
 org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
with ID 5656 at
'xxxxxxxxxx8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5656'
not discarded.    06.28.2019 20:33:13.888    2019-06-29 00:33:13,886 INFO
 org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
with ID 5657 at
'xxxxxxxxxx:8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5657'
not discarded.    06.28.2019 20:33:13.938    2019-06-29 00:33:13,936 INFO
 org.apache.flink.runtime.checkpoint.CompletedCheckpoint       - Checkpoint
with ID 5658 at
'xxxxxxxxxx8020/analytics_eng/kafka-to-hdfs-states/00000000000000000000000000000005/chk-5658'
not discarded.    06.28.2019 20:33:13.938    2019-06-29 00:33:13,936 INFO
 org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore  - Removing
/kafka-to-hdfs-v2/kafka-to-hdfs-v2/k8s/checkpoints/00000000000000000000000000000005
from ZooKeeper*

The job restarts and this is bizzare. It does not find the ZK checkpoint
directory but instead of going to the state.checkpoints.dir to get it's
last checkpoint, it restarts from a save point that we started this job
with ( resetting the checkpoint id )  like about 15 days ago


*    06.28.2019 20:33:20.047    2019-06-29 00:33:20,045 INFO
 org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
Recovering checkpoints from ZooKeeper.    06.28.2019 20:33:20.053
 2019-06-29 00:33:20,051 INFO
 org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
Trying to fetch 0 checkpoints from storage.    06.28.2019 20:33:20.053
 2019-06-29 00:33:20,051 INFO
 org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
Found 0 checkpoints in ZooKeeper.    06.28.2019 20:33:20.054    2019-06-29
00:33:20,053 INFO
 org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Starting
job 00000000000000000000000000000005 from savepoint
hdfs://nn-crunchy:8020/flink-savepoints_k8s/prod/kafka-to-hdfs/savepoint-000000-128f419cdc6f
()    06.28.2019 20:33:20.540    2019-06-29 00:33:20,538 INFO
 org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Reset the
checkpoint ID of job 00000000000000000000000000000005 to 4203.
06.28.2019 20:33:20.540    2019-06-29 00:33:20,538 INFO
 org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
Recovering checkpoints from ZooKeeper.    06.28.2019 20:33:20.550
 2019-06-29 00:33:20,549 INFO
 org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
Trying to retrieve checkpoint 4202.    06.28.2019 20:33:20.550
 2019-06-29 00:33:20,548 INFO
 org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
Found 1 checkpoints in ZooKeeper.    06.28.2019 20:33:20.550    2019-06-29
00:33:20,549 INFO
 org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
Trying to fetch 1 checkpoints from storage.*

This just does not make sense....

On Wed, Jun 5, 2019 at 9:29 AM Vishal Santoshi <[EMAIL PROTECTED]>
wrote: