Probably include "--experiment=use_fastavro" (along with Robert's
suggestion) which will make Dataflow use fastavro library for intermediate
files (materialized in a fusion break).
Also, have you considered storing your input directly as Avro files with
block compression ? Avro files should perform much better since we can
split at block boundaries.
Regarding splittablegzip, usually we have avoided supporting splitting by
reading and discarding data till split points since this does not save read
time (and later split points have to waste more read data) but it's
interesting indeed if users find this useful in real world scenarios.
On Wed, May 15, 2019 at 8:49 AM Lukasz Cwik <[EMAIL PROTECTED]> wrote: