Subject: -- misplaced calls to setJarByClass


Here's my view of the goals.

1: Mahout should function as a modular component that people can
incorporate into their application, either calling Mahout functions to
set up jobs, or using Mahout mappers and reducers in their own jobs.

2: Users should be able to treat the Mahout jar(s) as ordinary
libraries, and not need to unpack and repack it/them.

The particular design decisions in hadoop related to the classpath
make it harder than we would like to achieve this result. Indeed,
being one jar does not, by itself, achieve both results.

If we are going to provide functions that create jobs, then we have to
provide some mechanism for the user to control the call to
JobConf.setJarByClass or even, if they like, just call JobConf.setJar.

I can give you a few options:

1) Mahout doesn't call new JobConf. The user calls does 'new JobConf'
and passes the job to Mahout, and Mahout only sets up the
Mahout-specific items.

2) Mahout has a static/singleton API for naming the 'the job jar'. If
the user never calls it, things proceed as they do today. Since these
are static functions, I don't see how a static, global, 'set the job
jar' API will bother anyone.

3) As per before, add APIs that return the JobConf instead of running
it, and so the user can call setJarByClass for themselves, overriding
the Mahout default call.

4) Mahout has a family of subclasses of JobConf instead of a family of
driver classes that create and run jobs. To run some particular Mahout
task from their own code, the user would do new
SomeMahoutJobConf(....) and then call waitForCompletion, and the user
would, indeed, make the appropriate call to setJarByClass.

Do any of these appeal?