I'm wondering how Spark is setting the "index" of task?
I'm asking this question because we have a job that constantly fails at
task index = 421.
When increasing number of partitions, this then fails at index=4421.
Increase it a little bit more, now it's 24421.
Our job is as simple as "(1) read json -> (2) group-by sesion identifier ->
(3) write parquet files" and always fails somewhere at step (3) with a
CommitDeniedException. We've identified that some troubles are basically
due to uneven data repartition right after step (2), and now try to go
further in our understanding on how does Spark behaves.
We're using Spark 1.5.2, scala 2.11, on top of hadoop 2.6.0
Head of Backend/Infrastructure
50, avenue Montaigne - 75008 Paris