Thank Lukasz. Unfortunately, decompressing the files is not an option for
I am trying to speed up Reshuffle step, since it waits for all data. Here
are two ways I have tried:
1. add timestamps to the PCollection's elements after reading (since it is
bounded source), then apply windowing before Reshuffle, but it still waits
2. run the pipeline with --streaming flag, but it leads to an error:
Workflow failed. Causes: Expected custom source to have non-zero number of
splits. Also, I found inhttps://beam.apache.org/documentation/sdks/python-streaming/#dataflowrunner-specific-features
*DataflowRunner does not currently support the following Cloud Dataflow
specific features with Python streaming execution.*
I doubt whether this approach can solve my issue.
Thanks so much!
*From: *Lukasz Cwik <[EMAIL PROTECTED]>
*Date: *Tue, May 14, 2019 at 11:16 AM
Do you need to perform any joins across the files (e.g.