Thank Lukasz. Unfortunately, decompressing the files is not an option for
us.
I am trying to speed up Reshuffle step, since it waits for all data. Here
are two ways I have tried:

1.  add timestamps to the PCollection's elements after reading (since it is
bounded source), then apply windowing before Reshuffle, but it still waits
all data.
2.  run the pipeline with --streaming flag, but it leads to an error:
Workflow failed. Causes: Expected custom source to have non-zero number of
splits. Also, I found in
https://beam.apache.org/documentation/sdks/python-streaming/#dataflowrunner-specific-features
:

*DataflowRunner does not currently support the following Cloud Dataflow
specific features with Python streaming execution.*

   -

   *Streaming autoscaling*

I doubt whether this approach can solve my issue.
Thanks so much!

Allie

*From: *Lukasz Cwik <[EMAIL PROTECTED]>
*Date: *Tue, May 14, 2019 at 11:16 AM
*To: *dev
*Cc: *user

Do you need to perform any joins across the files (e.g.