Subject: KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations


Hi Anna and Jun,

Anna, thanks for your thoughtful feedback. Overall, I agree with what you
said. If I summarize, you said that using time on server threads is not
easier
to tune than a rate based approach and it does not really capture all the
load neither as the control requests are not taken into account.

I have been thinking about using an approach similar to the request quota
and
here are my thoughts:

1. With the current architecture of the controller, the ZK writes resulting
from a
topic being created, expanded or deleted are spread among the api handler
threads and the controller thread. This is problematic to measure the real
thread
usage. I suppose that this will change with KIP-500 though.

2. I do agree with Anna that we don't only care about requests tying up the
controller
thread but also care about the control requests. A rate based approach
would allow
us to define a value which copes with both dimensions.

3. Doing the accounting in the controller requires to attach a principal
and a client id
to each topic as the quotas are expressed per principal and/or client id. I
find this a
little odd. These information would have to be updated whenever a topic is
expanded
or deleted and any subsequent operations for that topic in the controller
would be
accounted for the last principal and client id. As an example, imagine a
client that
deletes topics which have partitions on a dead broker. In this case, the
deletion would
be retried until it succeeds. If that client has a low quota, that may
prevent it from
doing any new operations until the delete succeeds. This is a strange
behavior.

4. One important aspect of the proposal is that we want to be able to
reject requests
when the quota is exhausted. With a usage based approach, it is difficult
to compute
the time to wait before being able to create the topics. The only that we
could do is
to ask the client to wait until the measure time drops to the quota to
ensure that a new
operations can be accepted. With a rate based approach, we can
precisely compute
the time to wait.

5. I find the experience for users slightly better with a rate based
approach. When I say
users here, I mean the users of the admin client or the cli. When
throttled, they would
get an error saying: "can't create because you have reached X partition
mutations/sec".
With the other approach, we could only say: "quota exceeded".

6. I do agree that the rate based approach is less generic in the long
term, especially if
other resource types are added in the controller.

Altogether, I am not convinced by a usage based approach and I would rather
lean
towards keeping the current proposal.

Best,
David

On Thu, May 7, 2020 at 2:11 AM Anna Povzner <[EMAIL PROTECTED]> wrote: