Subject: Using existing distribution for join when subset of keys


You can use bucketBy to avoid shuffling in your scenario. This test suite
has some examples:
https://github.com/apache/spark/blob/45cf5e99503b00a6bd83ea94d6d92761db1a00ab/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala#L343

Thanks,
Terry

On Sun, May 31, 2020 at 7:43 AM Patrick Woody <[EMAIL PROTECTED]>
wrote: