Thanks for the report Peter. Generally speaking autoscaling in Dataflow
streaming application is based on two factors : backlog (e.g. PubSub in
this case) and CPU used for current throughput. Multiple stages makes it a
bit more complex, but over it is essentially trying to scale CPU resource
so that extrapolated throughput handles current backlog, backlog growth.
As you noted, cases where small number of messages trigger large amount of
work in the pipeline makes backlog estimation tricky. That can cause delays
in upscaling (depending on how the fan out happens in the application), but
I would not expect the pipeline to downscale too early when it still has a
lot of work pending. One possibility is that processing involves blocking
work which might keep CPU utilization lower (say 40-50%).
I am certainly interested in closer look at your specific job. Would you
mind asking it on Stackoverflow (https://stackoverflow.com/questions/tagged/google-cloud-dataflow)?
also open to public. Please provide job_id to look at.
IFAIK, Dataflow is in the only Beam runner that currently supports
autoscaling based on changes in load. Others might need user to trigger
recaling (e.g. Flink).
On Fri, Nov 17, 2017 at 8:02 AM, Peter Mueller <[EMAIL PROTECTED]> wrote: