Please note that the discussion below refers to the following versions:
- Impala: v2.10.0-cdh5.13.0
- Kudu: 1.5.0-cdh5.13.0
- Everything runs on a standard Cloudera 5.13 installation
A few days ago I was writing some Java code to migrate several tables directly from Oracle to Kudu (to be queried later on by our developers and BI tools using Impala). Most of these tables have columns that are of type "timestamp" (to be exact, they come in as instances of class oracle.sql.TIMESTAMP and I cast them to java.sql.Timestamp; for the rest of this discussion I'll assume we only deal with objects of java.sql.Timestamp, to make things simple).
As you probably know, Kudu, starting I think with version 1.3.1, has a type called 'UNIXTIME_MICROS') and that type gets mapped by Impala as "Impala TIMESTAMP" data type (
A good description of the meaning of 'UNIXTIME_MICROS' in Kudu is in the 'Apache Kudu Schema Design' document (, which says:
unixtime_micros (64-bit microseconds since the Unix epoch)
where the 'Unix epoch' is defined as 1/1/1970 00:00:00 GMT.
With this understanding I went ahead and wrote my Java code; when I ran the first few tests, I noticed that the timestamp values returned by Impala (I created in Impala an 'external' table 'stored as kudu') were off by several hours compared to the values returned by the original table in Oracle (our servers, both the Oracle ones and the Impala/Kudu ones, are all configured in the 'America/New_York' timezone).
To investigate this difference, I created a simple table in Kudu with just two columns, an INT64 as the primary key and a UNIXTIME_MICROS as a timestamp. I ran a few inserts and selects over this table in Impala and figured out that Impala stores a value that is more or less defined as follow:
number of microseconds since the Unix epoch (i.e. what I was expecting originally)
+ offset of the timestamp I inserted with respect to GMT (in my case this offset is the offset for EST or EDT depending if that timestamp was during EST (winter) or EDT (summer))
This is how Impala achieves what is described as:
Impala does not store timestamps using the local timezone, to avoid undesired results from unexpected time zone issues
That same page has caveats like the following, that sent a shiver down my spine:
If that value was written to a data file, and shipped off to a distant server to be analyzed alongside other data from far-flung locations, the dates and times would not match up precisely because of time zone differences
This means that if anyone is using (or even thinking about using) "Impala timestamps" to say store financial or health services (or security) events, they'll find some nasty "surprises" (even if they don't plan to ever move their servers and only do business in one timezone).
Consider for instance the case of anything that occurred between 1am and 2am EDT on 11/5/2017 (i.e. in the hour before we moved our clocks back from EDT to EST) - there's no way to store the timestamps for these events in Kudu via Impala.

To prove this I wrote this simple piece of Java code (which uses Java 8 and all well documented and non-deprecated classes and methods) to do just an insert and a select via Impala JDBC of a timestamp row in the simple table that I mentioned above (primary key + timestamp column):

// run insert
long primaryKey = 1L;
PreparedStatement insert = connection.prepareStatement("insert into " + table + " values (?, ?)");
insert.setLong(1, primaryKey);
Timestamp timestampIn = new Timestamp(1509859800000L);
System.out.println("TimestampIn: " + timestampIn + " - getTime(): " + timestampIn.getTime());
insert.setTimestamp(2, timestampIn);
// run select
PreparedStatement select = connection.prepareStatement("select " + timestampColumn + " from " + table + " where " + primaryKeyColumn + "=?");
select.setLong(1, primaryKey);
ResultSet resultSet = select.executeQuery();
while ( {
Timestamp timestampOut = resultSet.getTimestamp(1);
System.out.println("TimestampOut: " + timestampOut + " - getTime(): " + timestampOut.getTime());

and this is the output:
TimestampIn: 2017-11-05 01:30:00.0 - getTime(): 1509859800000
TimestampOut: 2017-11-05 01:30:00.0 - getTime(): 1509863400000

If this kind of timestamp had been designed say in 1972, I would have probably been OK with it, but in 2017, when people and companies do business 24/7, I find this approach (and sentences like "dates and times would not match up precisely") unacceptable.

To its defense, the Impala TIMESTAMP web page mentioned above spends several paragraphs about how to deal with "UTC timestamps" and problems like the ones shown above, but it basically has the user wrap everything in Impala functions like 'to_utc_timestamp()' and 'from_utc_timestamp()'.

This proposed solution - besides the performance impact of say having a query like this:
select ... where to_utc_timestamp(timestamp_column_a) < to_utc_timestamp(timestamp_column_b)
makes also much harder to write queries having timestamp fields (which is very common in my workplace) or to use it with automated BI tools.

It is definitely not what one would expect from an analytic MPP database, that aspires to be used at the Enterprise level in 2017.

Franco Venturi