I've encountered an issue with hive's predicate push down optimization when multi-group-by is used together with transform. Here is a simple testcase to illustrate my point:
CREATE TABLE IF NOT EXISTS my_table ( id INT, property1 INT, property2 INT, count INT );
EXPLAIN FROM ( SELECT TRANSFORM( id, property1, property2, count ) USING 'cat' AS ( id INT, property1 INT, count INT, property2 INT ) FROM my_table ) t INSERT OVERWRITE LOCAL DIRECTORY '/tmp/test1' SELECT id, property1, SUM(count) GROUP BY id, property1 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/test2' SELECT id, property2, SUM(count) WHERE property1 != 0 GROUP BY id, property2;
When hive.optimize.ppd = true, hive moves the where clause from second select all the way down into the transform operator, which is obviously wrong, because it affects the first select as well. With hive.optimize.ppd=false everything works as expected. Without the transform, it works correctly as well.
I see this problem with Hive version 0.10.0 (cdh4.4.0). With Hive 0.7.1 the same query behaves correctly, regardless of hive.optimize.ppd settings. So it seems as a bug introduced with some ppd improvements in 0.8 or later. Can anyone confirm if this is still broken in newest versions? If it doesn't work with 0.12, I'll file a new issue in JIRA.