Subject: [pyspark 2.3+] Dedupe records


Hi All,

I have around 100B records where I get new , update & delete records.
Update/delete records are not that frequent. I would like to get some
advice on below:

1) should I use rdd + reducibly or DataFrame window operation for data of
this size? Which one would outperform the other? Which is more reliable and
low maintenance?
2) Also how would you suggest we do incremental deduplication? Currently we
do full processing once a week and no dedupe during week days to avoid
heavy processing. However I would like to explore incremental dedupe option
and weight pros/cons.

Any input is highly appreciated!

--
Regards,

Rishi Shah