Subject: [pyspark 2.3+] Dedupe records

The performant way would be to partition your dataset into reasonably small
chunks and use a bloom filter to see if the entity might be in your set
before you make a lookup.

Check the bloom filter, if the entity might be in the set, rely on partition
pruning to read and backfill the relevant partition. If the entity isn't in
the set, just save as new data.

Sooner or later you probably would want to compact the appended partitions
to reduce the amount of small files.

Delta Lake has update and compation semantics unless you want to do it

Since 2.4.0 Spark is also able to prune buckets. But as far as I know
there's no way to backfill a single bucket. If it was the combination of
partition and bucket pruning could dramatically limit the amount data you
needed to read/write from/to disk.

RDD vs Dataframe, I'm not sure exactly how and when Tungsten is able to be
used when using RDD:s, if at all. Because of that I always try to use
Dataframes and the built in fucntions as long as possible just to get the
sweet offheap allocation and the "expressions to byte code" thingy along the
Catalyst optimizations. That will probably make more for your performance
than anything else. The memory overhead of JVM objects and GC runs might be
brutal on your performance and memory usage depending on your dataset and
use case.


Sent from:

To unsubscribe e-mail: [EMAIL PROTECTED]