In the current implementation of generate * in the front end, a single
projection operator with the star attribute set to true is created.
During the schema computation, instead of generating the schema of the
projection input, a tuple that contains the schema of the projection
input is created. This results in double wrapping. An example will
illustrate the problem.

grunt> a = load 'one' using PigStorage(' ') as (field1, field2, field3);
grunt> b = load 'two' as (field4, field5, field6);
grunt> c = cogroup a by $0, b by $0;
grunt> d = foreach c generate *;
grunt> describe d;

d: {c: (group: bytearray,a: {field1: bytearray,field2: bytearray,field3:
bytearray},b: {field4: bytearray,field5: bytearray,field6: bytearray})}

In the above example, the schema for operator d should have been
identical to that of operator c. Instead, the schema of operator c is
wrapped in a tuple and embedded within the schema of d. As a result, we
have a couple of issues:

1. It is not intuitive to users that the schema of c and d are not
identical. They should be identical.

grunt> e = foreach d generate group;

2008-10-02 16:06:11,335 [main] ERROR
org.apache.pig.tools.grunt.GruntParser - java.io.IOException: Invalid
alias: group in {c: (group: bytearray,a: {field1: bytearray,field2:
bytearray,field3: bytearray},b: {field4: bytearray,field5:
bytearray,field6: bytearray})}

2. As a workaround, we could flatten the contents of d and then access
the contents of c.

grunt> e = foreach d generate flatten($0);
grunt> e = foreach d generate flatten($0);
grunt> describe e;

e: {c::group: bytearray,c::a: {field1: bytearray,field2:
bytearray,field3: bytearray},c::b: {field4: bytearray,field5:
bytearray,field6: bytearray}}

However, we will not be able to compute the lineage of the fields of
relation, as demonstrated by the following example:

grunt> f = foreach e generate flatten(a), flatten(b);
grunt> g = foreach f generate field1 + 1;
grunt> describe g;

2008-10-02 16:26:20,655 [main] WARN  org.apache.pig.PigServer -
bytearray is implicitly casted to integer under LOAdd Operator
2008-10-02 16:26:20,655 [main] ERROR org.apache.pig.PigServer - Problem
resolving LOForEach schema Cannot resolve load function to use for
casting from bytearray to integer. Found more than one load function to
use: [org.apache.pig.builtin.PigStorage,
org.apache.pig.builtin.BinStorage]

This problem is contained in the frontend alone. In the backend, the
double wrapping issue is resolved with the bug PIG-359. In order to
resolve this issue in the frontend, the project( * ) operator has to be
translated into project(0), project(1), ..., project(n - 2), project(n -
1); where n is the number of columns in the relation.

The translation of project( * ) into the multiple project operators
cannot be performed in the parser without major modifications. Each
relational operator that has an inner plan, can perform this
translation. In the current design, LOForEach, LOCogroup, LOSplitOutput
LOSort and LOFilter have inner plans.

There are corner cases that need to be handled during the translation.
If the schema of the project's input is not defined then the schema of
the relation or the column in the relation that contains the projection
could become undefined.

a = laod 'one';
b = load 'two';
c = foreach a generate *, $0, $1; -- schema of c is undefined
d = cogroup a by *, by by ($0, $1); -- schema of column named group in
cogroup is undefined; also arity checking cannot be enforced

Thoughts?

Thanks,
Santhosh