For GDPR: I am not sure about the regulatory requirements of GDPR but I
would assume that the time for deletion starts counting from the time an
organisation received the data (i.e. the wall-clock ingestion time of the
data), and not the "event time" of the data. In other case, an organisaton
may be violating GDPR by just receiving e.g. 1 year old data of a user
whole deletion policy is "you are allowed to keep them for 6 months".
Now for the discussion in this thread, I think that the scenario:
* Timestamp stored: Event timestamp
* Timestamp to check expiration: Processing Time
has the underlying assumption that there is a relationship between
event-time and processing time, which is not necessarily the case.
Event-time, although we call it "time", is just another user-defined column
or attribute of the data and can be anything. It is not an "objective" and
independently evolving attribute like wall-clock time. I am not sure what
could be the solution, as out-of-orderness can always lead to arbitrary,
non-reproducible and difficult to debug behaviour (e.g. a super-early
element that arrives out-of-order and, as the succeeding elements set the
timestamp to lower values, it gets deleted by the state backend, although
the user-level windowing logic would expect it to be there).
Given that last point made above, and apart from the semantics of the
proposed feature, I think that we should also discuss if it is a good idea
to have event time TTL implemented in state backend level in the first
place. Personally, I am not so convinced that this is a good idea, as we
introduce another (potentially competing) mechanism for handling event
time, apart from the user program. An example can be the one that I
described above. And this also defeats one of the main advantages of event
time, in my opinion, which is reproducability of the results.
I may be wrong, but I would appreciate any opinions on this.
On Mon, Apr 8, 2019 at 11:12 AM Aljoscha Krettek <[EMAIL PROTECTED]>