Subject: Processing 50 millions of file for LDA


It is hard to advise on detailed trade-offs for your case but I am pretty
sure that there are other options than S3 which is, as you say, very slow
in terms of latency due to transferring lots of small objects.

One alternative, for instance, would be to use a long-lived MapR cluster to
store the files.  That would give you much write and read performance and
could allow the data sources to write directly to the cluster via NFS.  It
would also allow you to snapshot your data before processing so that you
can have a reproducible work-flow.  Depending on your operational stance,
it may be possible to stop the nodes in the cluster and bring them back
into operation so that you can avoid paying the price of having the cluster
live at all times.

Beyond such generalities, we should probably take a detailed conversation
to direct email to avoid making this list sound (too much) like an
advertising venue.

On Tue, Jun 4, 2013 at 3:06 AM, nishant rathore <[EMAIL PROTECTED]
> wrote:

> Hi,
> we are running  LDA on 50 million files.
> Each file is not more than 5 MB. Each file represent the content of the
> user. Files keeps on updating as we receive new information about the user.
> Currently we store all these files on ec2 and when we need to run LDA, We
> transfer those files to S3 and run the mahout process. Transferring files
> to S3 takes a long time. Also hadoop job is not that efficient when size of
> less than 128MB.
> Currently we are thinking of moving files directly to s3 whenever collector
> gets the data. Say for exampe User1 will have these file as we get the
> content. User1_1, User2_2,...etc.
> And before running LDA, there would be another process which aggregates all
> the data from User1 and then feed it to convert it into vector.
> Can you please help us on designing the workflow.
> Thanks,
> Nishant