A few more questions:
- How is this better than using a cached remote file system, e.g. mounting
HDFS or S3 (yes, it's possible) and letting the OS and drivers handle
everything? Maybe it's better, but the KIP doesn't address how or why, and
I'd think this would be a trivial benchmark. If, for some reason, mounting
and writing directly to a remote store is approximately as performant, it
would be hard to argue for this KIP. I wouldn't be surprised if this were
- Why wait until local segments expire before offloading them to cold
storage? Why not stream to HDFS/S3 on an ongoing basis? I'd think this
would reduce bursty behavior from periodic uploads.
- Can we write to multiple remote stores at the same time? Can we have some
topics go to S3 and some to HDFS? Since we're storing "RDIs" that point to
remote locations, can we generalize this to full URIs that may be in any
supported remote store? In particular, what happens when you want to switch
from HDFS to S3 -- can we add a new plugin and keep going? Can we fetch
s3:/// URIs from S3 and hdfs:/// URIs from HDFS?
- Instead of having brokers do all this, what if we just expose an API that
lets external tooling register a URI for a given segment? If I've copied a
segment file to S3, say with a daily cron job, why not just tell Kafka
where to find it? Assuming I've got a plugin to _read_ from S3, that's all
Kafka would need to know.
On Thu, Oct 24, 2019, 9:13 AM Eno Thereska <[EMAIL PROTECTED]> wrote: