Subject: [DISCUSS] KIP-405: Kafka Tiered Storage


Thanks Harsha, makes sense for the most part.

> tiered storage is to get away from this and make this transparent to the
user

I think you are saying that this enables additional (potentially cheaper)
storage options without *requiring* an existing ETL pipeline. But it's not
really a replacement for the sort of pipelines people build with Connect,
Gobblin etc. My point was that, if you are already offloading records in an
ETL pipeline, why do you need a new pipeline built into the broker to ship
the same data to the same place? I think in most cases this will be an
additional pipeline, not a replacement, because the segments written to
cold storage won't be useful outside Kafka. So you'd end up with one of 1)
cold segments are only useful to Kafka; 2) you have the same data written
to HDFS/etc twice, once for Kafka and once for everything else, in two
separate formats; 3) you use your existing ETL pipeline and read cold data
directly.

To me, an ideal solution would let me spool segments from Kafka to any sink
I would like, and then let Kafka clients seamlessly access that cold data.
Today I can do that in the client, but ideally the broker would do it for
me via some HDFS/Hive/S3 plugin. The KIP seems to accomplish that -- just
without leveraging anything I've currently got in place.

Ryanne

On Mon, Feb 4, 2019 at 3:34 PM Harsha <[EMAIL PROTECTED]> wrote:

> Hi Eric,
>        Thanks for your questions. Answers are in-line
>
> "The high-level design seems to indicate that all of the logic for when and
> how to copy log segments to remote storage lives in the RLM class. The
> default implementation is then HDFS specific with additional
> implementations being left to the community. This seems like it would
> require anyone implementing a new RLM to also re-implement the logic for
> when to ship data to remote storage."
>
> RLM will be responsible for shipping log segments and it will decide when
> a log segment is ready to be shipped over.
> Once a Log Segement(s) are identified as rolled over,  RLM will delegate
> this responsibility to a pluggable remote storage implementation. Users who
> are looking add their own implementation to enable other storages all they
> need to do is to implement the copy and read mechanisms and not to
> re-implement RLM itself.
>
>
> "Would it not be better for the Remote Log Manager implementation to be
> non-configurable, and instead have an interface for the remote storage
> layer? That way the "when" of the logic is consistent across all
> implementations and it's only a matter of "how," similar to how the Streams
> StateStores are managed."
>
> It's possible that we can RLM non-configurable. But for the initial
> release and to keep the backward compatibility
> we want to make this configurable and for any users who might not be
> interested in having the LogSegments shipped to remote, they don't need to
> worry about this.
>
>
> Hi Ryanne,
>  Thanks for your questions.
>
> "How could this be used to leverage fast key-value stores, e.g. Couchbase,
> which can serve individual records but maybe not entire segments? Or is the
> idea to only support writing and fetching entire segments? Would it make
> sense to support both?"
>
> LogSegment once its rolled over are immutable objects and we want to keep
> the current structure of LogSegments and corresponding Index files. It will
> be easy to copy the whole segment as it is, instead of re-reading each file
> and use a key/value store.
>
> "
> - Instead of defining a new interface and/or mechanism to ETL segment files
> from brokers to cold storage, can we just leverage Kafka itself? In
> particular, we can already ETL records to HDFS via Kafka Connect, Gobblin
> etc -- we really just need a way for brokers to read these records back.
> I'm wondering whether the new API could be limited to the fetch, and then
> existing ETL pipelines could be more easily leveraged. For example, if you
> already have an ETL pipeline from Kafka to HDFS, you could leave that in