Thanks for your input Belazs.
I am still pondering about this issue and I found very useful to follow your advice on seeing what's already in JIRA and doing some research on what others are doing.
I didn't find anything that exactly describes my problem on the Impala JIRA (I might have missed it), however I did find a closely related discussion that is occurring in the Hive JIRA: I am referring to JIRAs HIVE-14305 (https://issues.apache.org/jira/browse/HIVE-14305
), HIVE-14412 (https://issues.apache.org/jira/browse/HIVE-14412
), and HIVE-16418 (https://issues.apache.org/jira/browse/HIVE-16418
I found also useful to read how two of the most used RDBMS are dealing with timestamps:
- Oracle (see: https://docs.oracle.com/en/database/oracle/oracle-database/12.2/nlspg/datetime-data-types-and-time-zone-support.html
- PostgreSQL (see: https://www.postgresql.org/docs/10/static/datatype-datetime.html
Finally I am going to refer to the Java 8 API as described here: https://docs.oracle.com/javase/8/docs/api/overview-summary.html;
the reason is that a lot of Impala users (me included) connect to it via Java/JDBC and it is very important in my opinion that the Impala data types are correctly mapped to the correct Java classes to avoid confusion.
After this long preamble here are some notes that hopefully will help in this discussion.
- At the center of this issue is that there are two significantly different data types that we commonly refer to when we use the word 'timestamp':
- timestamp without time zone (most commonly referred to as just 'timestamp')
- timestamp with time zone
- These two types are like apples and oranges in my opinion; getting confused between the two (like I did) can cause a lot of frustration (my first post in this thread shows that pretty well..)
- The first data type ('timestamp without time zone' or just 'timestamp') is best described in this sentence in section 220.127.116.11 (Choosing a TIMESTAMP Data Type) in the Oracle document:
"Use the TIMESTAMP data type when you need a datetime value to record the time of an event without the time zone. For example, you can store information about the times when workers punch a time card in and out of their assembly line workstations. Because this is always a local time it is then not needed to store the timezone part"
- I think this is the kind of timestamp that is currently implemented in Impala (please correct me if I am wrong) and in my opinion it should be mapped to something like the Java type java.time.LocalDateTime; the Java 8 API for java.time.LocalDateTime has this sentence that I think provides a good insight on the meaning of this data type:
"This class does not store or represent a time-zone. Instead, it is a description of the date, as used for birthdays, combined with the local time as seen on a wall clock. It cannot represent an instant on the time-line without additional information such as an offset or time-zone."
- Also this timestamp (again in my opinion) should not be mapped to the Kudu type UNIXTIME_MICROS, because their meaning is different (and this is what triggered my initial confusion, and I suspect I am not the only one out there, who misunderstood this difference)
- For the second data type ('timestamp with time zone') I couldn't find a good definition in the Oracle document, however this sentence from the Java API for java.util.Date (of which java.sql.Timestamp is a subclass) offers a good insight:
"The class Date represents a specific instant in time"
- This is also indirectly implied by the following sentence in the Oracle document:
"Two TIMESTAMP WITH TIME ZONE values are considered identical if they represent the same instant in UTC, regardless of the TIME ZONE offsets stored in the data."
- The 'timestamp with time zone' is what I think should be mapped to the Java type 'java.sql.Timezone' and what also corresponds to the actual meaning of the Kudu type UNIXTIME_MICROS
- This is the kind of timestamp type that is most useful to businesses (financial, healthcase, security logs) because it captures the "absolute" moment in time a transaction or an event occurred, regardless of which timezone it occurred, or if it was daylight saving time or not
In conclusion the point I am at right now is that to really fix the issue in Impala, two things should occur:
- a new data type ('timestamp with time zone'), with the features mentioned above, should be implemented - this change would hopefully not break any of the existing applications
- the current 'timestamp' type should be mapped to the Java class java.time.LocalDateTime (not java.sql.Timestamp) - this would definitely break existing applications, and therefore should be postponed to Impala 3.0
From: "Jeszy" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Monday, December 18, 2017 7:49:52 AM
Subject: Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...)
Thanks for your feedback! I agree there are pain points with using
timestamps, especially together with other systems.
Is there any particular approach or solution you propose that would
work well for you? Have you found any jiras on issues.apache.org that
describe what you're asking for? Commenting on a jira will help the
team track your input better.
On 17 December 2017 at 00:38, Franco Venturi <[EMAIL PROTECTED]> wrote: