Subject: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem


Once more from the top.

There is a hadoop convention. Is has nothing to do with the
MANIFEST.MF as I read the code.

In the hadoop convention, if someone calls setJar on the job conf, the
'lib/' folder of the indicated jar will be unpacked and the jars in it
added to the classpath on whatever nodes the job runs code on. If no
one calls setJar, then the only thing in the classpath is the jar
itself, unless you make other arrangements (as with the distributed
cache).

I'm not an evangelist for the maven-shade-plugin, but my very
unscientific impression is that people walk up to mahout and expect
the mahout command to just 'work'. Unless someone can unveil a way to
script the exploitation of the distributed cache, that means that the
jar file that the mahout command hands to the hadoop command has to
use the 'lib/' convention, and have the correct structure of raw and
lib-ed classes.

Further, any unsophisticated user who goes to incorporate Mahout into
a larger structure has to do likewise.

We could avoid exciting uses of the shade plugin altogether if we
didn't have these static methods that initialize jobs and call
setJarByClass on themselves. However, I don't see that for 0.5 unless
we want to push the schedule back and make a concerted effort.

Further, I am concerned, based on Jake's remarks, that even following
the hadoop lib/ convention correctly doesn't always work, and we have
no diagnostic insight into the nature of the failure.

So it seems at the instant as if our choices are to hold our noses and
shade, or give up on a trivial command line that runs our jobs without
a prerequisite of pushing the dependencies out into the cluster.