It is exciting to see this move forward, the ability to use Python opens
many new possibilities.
Regarding use of worker threads, this is a pattern that we are using
elsewhere (for example in the Kafka input operator). When the operator
performs blocking operations and consumes little memory and/or CPU, then it
is more economic to first use threads to increase parallelism and
throughput (up to a limit), and then the more expensive containers for
horizontal scaling (multiple threads to make good use of container
resources and then scale using the usual partitioning).
It is also correct that generally there is no ordering guarantee within a
streaming window, and that would be the case when multiple input ports are
present as well. (The platform cannot guarantee such ordering, this would
need to be done by the operator).
Idempotency can be expensive (latency and/or space complexity), and not all
applications need it (like certain use cases that process record by record
and don't accumulate state). An example might be Python logic that is used
for scoring against a model that was built offline. Idempotency would
actually be rather difficult to implement, since the operator would need to
remember which tuples were emitted in a given interval and on replay block
until they are available (and also hold others that may be processed sooner
than in the original order). It may be easier to record emitted tuples to a
WAL instead of reprocessing.
Regarding not emitting stragglers until the next input arrives, can this
not be accomplished using IdleTimeHandler?
What is preventing the use of virtual environments?
On Tue, Dec 19, 2017 at 8:19 AM, Pramod Immaneni <[EMAIL PROTECTED]>