Spark streaming kafka 0.10 integ provides an option to commit offset to kafka using commitAsyn() API. This only records the offset commit request. The actual commit is performed in compute() after RDD for next batch is created. Why is this so? Why not do a commit right when the API is called? Anyway the commit process itself is async with an option to provide callback handler.
This adds a window where application does a commit but it is not recorded in kafka internal topic. Any failure during that window will cause the last batch to be recomputed.
My app does a sink to external source that can't be idempotent. As such the operations are assumed to be atleast once. This seems to be one place where duplicates and be reduced.
People may be calling commit from listeners or who knows where. Point is it's not thread safe. If it's really important to you, it should be pretty straightforward for you to hack on it to allow it at your own risk. There is a check for concurrent access in the consumer, so worst case scenario you should get an exception.
On Sat, Oct 8, 2016 at 9:18 PM, Srikanth <[EMAIL PROTECTED]> wrote:
I'll probably add this behavior. It's a good balance between not having to rely on another external system just for offset management and reducing duplicates. I was more worried about the underlying framework using the consumer in parallel. Will watch out for concurrent mod exp.
BTW, the commitQueue used to store offset commit request is only appended. Elements are never removed. Looks like a bug?? On Sun, Oct 9, 2016 at 1:03 AM, Cody Koeninger <[EMAIL PROTECTED]> wrote: