Dmitriy Lyubimov

2010-12-30, 20:56

Dmitriy Lyubimov

2010-12-30, 21:03

Ted Dunning

2010-12-30, 21:04

Ted Dunning

2010-12-30, 21:05

Dmitriy Lyubimov

2010-12-30, 21:38

Dmitriy Lyubimov

2010-12-30, 23:57

Jake Mannix

2011-01-06, 21:45

Dmitriy Lyubimov

2011-01-06, 22:07

Hi,

I would like to try LSI processing of results produced by seq2sparse.

What's more, I need to be able to fold-in a bunch of new documents

afterwards.

Is there any support for fold-in indexing in Mahout?

if not, is there a quick way for me to gain the understanding of seq2sparse

output?

In particular, if i wanted to add fold-in indexing, i need to be able to

produce TF or TF-IDF of the new document on the fly using pre-existing

dictionary and word counts. What's the api for this dictionary?

Thank you.

-Dmitriy

I would like to try LSI processing of results produced by seq2sparse.

What's more, I need to be able to fold-in a bunch of new documents

afterwards.

Is there any support for fold-in indexing in Mahout?

if not, is there a quick way for me to gain the understanding of seq2sparse

output?

In particular, if i wanted to add fold-in indexing, i need to be able to

produce TF or TF-IDF of the new document on the fly using pre-existing

dictionary and word counts. What's the api for this dictionary?

Thank you.

-Dmitriy

PS. i've already been reading thru SparseVectorsFromSequenceFiles.java, just

trying to figure if can do it faster by taking advice for more starting

points to look at.

Thanks in advance.

-Dmitriy

On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]>wrote:

> Hi,

>

> I would like to try LSI processing of results produced by seq2sparse.

>

> What's more, I need to be able to fold-in a bunch of new documents

> afterwards.

>

> Is there any support for fold-in indexing in Mahout?

>

> if not, is there a quick way for me to gain the understanding of seq2sparse

> output?

> In particular, if i wanted to add fold-in indexing, i need to be able to

> produce TF or TF-IDF of the new document on the fly using pre-existing

> dictionary and word counts. What's the api for this dictionary?

>

> Thank you.

> -Dmitriy

>

trying to figure if can do it faster by taking advice for more starting

points to look at.

Thanks in advance.

-Dmitriy

On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]>wrote:

> Hi,

>

> I would like to try LSI processing of results produced by seq2sparse.

>

> What's more, I need to be able to fold-in a bunch of new documents

> afterwards.

>

> Is there any support for fold-in indexing in Mahout?

>

> if not, is there a quick way for me to gain the understanding of seq2sparse

> output?

> In particular, if i wanted to add fold-in indexing, i need to be able to

> produce TF or TF-IDF of the new document on the fly using pre-existing

> dictionary and word counts. What's the api for this dictionary?

>

> Thank you.

> -Dmitriy

>

There are two dictionary-like systems in Mahout. Neither is quite right.

The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary. It

doesn't do the frequency counting you want.

The more complex one is in DictionaryVectorizer. Unfortunately, it is a

mass of static functions that depend on statically named files rather than

being a real API.

There is a third choice as well

in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder. It does

on-line IDF weighting and can be used underneath a text encoder to get

on-line TF-IDF weighting of the sort you desire. You can preset counts

using the getDictionary accessor.

A fourth choice is to simply use a static word encoder with hashed vectors

and do the IDF weighting as a vector element-wise multiplication. That way

you only need to keep around a vector of weights and no dictionary. That

should be much cheaper in memory.

On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]>wrote:

> Hi,

>

> I would like to try LSI processing of results produced by seq2sparse.

>

> What's more, I need to be able to fold-in a bunch of new documents

> afterwards.

>

> Is there any support for fold-in indexing in Mahout?

>

> if not, is there a quick way for me to gain the understanding of seq2sparse

> output?

> In particular, if i wanted to add fold-in indexing, i need to be able to

> produce TF or TF-IDF of the new document on the fly using pre-existing

> dictionary and word counts. What's the api for this dictionary?

>

> Thank you.

> -Dmitriy

>

The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary. It

doesn't do the frequency counting you want.

The more complex one is in DictionaryVectorizer. Unfortunately, it is a

mass of static functions that depend on statically named files rather than

being a real API.

There is a third choice as well

in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder. It does

on-line IDF weighting and can be used underneath a text encoder to get

on-line TF-IDF weighting of the sort you desire. You can preset counts

using the getDictionary accessor.

A fourth choice is to simply use a static word encoder with hashed vectors

and do the IDF weighting as a vector element-wise multiplication. That way

you only need to keep around a vector of weights and no dictionary. That

should be much cheaper in memory.

On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]>wrote:

> Hi,

>

> I would like to try LSI processing of results produced by seq2sparse.

>

> What's more, I need to be able to fold-in a bunch of new documents

> afterwards.

>

> Is there any support for fold-in indexing in Mahout?

>

> if not, is there a quick way for me to gain the understanding of seq2sparse

> output?

> In particular, if i wanted to add fold-in indexing, i need to be able to

> produce TF or TF-IDF of the new document on the fly using pre-existing

> dictionary and word counts. What's the api for this dictionary?

>

> Thank you.

> -Dmitriy

>

The fourth choice is what I would recommend in general unless you need very

easy reverse-engineering of your vectors.

On Thu, Dec 30, 2010 at 1:04 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>

> There are two dictionary-like systems in Mahout. Neither is quite right.

>

> The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary. It

> doesn't do the frequency counting you want.

>

> The more complex one is in DictionaryVectorizer. Unfortunately, it is a

> mass of static functions that depend on statically named files rather than

> being a real API.

>

> There is a third choice as well

> in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder. It does

> on-line IDF weighting and can be used underneath a text encoder to get

> on-line TF-IDF weighting of the sort you desire. You can preset counts

> using the getDictionary accessor.

>

> A fourth choice is to simply use a static word encoder with hashed vectors

> and do the IDF weighting as a vector element-wise multiplication. That way

> you only need to keep around a vector of weights and no dictionary. That

> should be much cheaper in memory.

>

>

> On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]>wrote:

>

>> Hi,

>>

>> I would like to try LSI processing of results produced by seq2sparse.

>>

>> What's more, I need to be able to fold-in a bunch of new documents

>> afterwards.

>>

>> Is there any support for fold-in indexing in Mahout?

>>

>> if not, is there a quick way for me to gain the understanding of

>> seq2sparse

>> output?

>> In particular, if i wanted to add fold-in indexing, i need to be able to

>> produce TF or TF-IDF of the new document on the fly using pre-existing

>> dictionary and word counts. What's the api for this dictionary?

>>

>> Thank you.

>> -Dmitriy

>>

>

>

easy reverse-engineering of your vectors.

On Thu, Dec 30, 2010 at 1:04 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>

> There are two dictionary-like systems in Mahout. Neither is quite right.

>

> The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary. It

> doesn't do the frequency counting you want.

>

> The more complex one is in DictionaryVectorizer. Unfortunately, it is a

> mass of static functions that depend on statically named files rather than

> being a real API.

>

> There is a third choice as well

> in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder. It does

> on-line IDF weighting and can be used underneath a text encoder to get

> on-line TF-IDF weighting of the sort you desire. You can preset counts

> using the getDictionary accessor.

>

> A fourth choice is to simply use a static word encoder with hashed vectors

> and do the IDF weighting as a vector element-wise multiplication. That way

> you only need to keep around a vector of weights and no dictionary. That

> should be much cheaper in memory.

>

>

> On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]>wrote:

>

>> Hi,

>>

>> I would like to try LSI processing of results produced by seq2sparse.

>>

>> What's more, I need to be able to fold-in a bunch of new documents

>> afterwards.

>>

>> Is there any support for fold-in indexing in Mahout?

>>

>> if not, is there a quick way for me to gain the understanding of

>> seq2sparse

>> output?

>> In particular, if i wanted to add fold-in indexing, i need to be able to

>> produce TF or TF-IDF of the new document on the fly using pre-existing

>> dictionary and word counts. What's the api for this dictionary?

>>

>> Thank you.

>> -Dmitriy

>>

>

>

Thank you, Ted.

On Thu, Dec 30, 2010 at 1:05 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

> The fourth choice is what I would recommend in general unless you need very

> easy reverse-engineering of your vectors.

>

> On Thu, Dec 30, 2010 at 1:04 PM, Ted Dunning <[EMAIL PROTECTED]>

> wrote:

>

> >

> > There are two dictionary-like systems in Mahout. Neither is quite right.

> >

> > The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary.

> It

> > doesn't do the frequency counting you want.

> >

> > The more complex one is in DictionaryVectorizer. Unfortunately, it is a

> > mass of static functions that depend on statically named files rather

> than

> > being a real API.

> >

> > There is a third choice as well

> > in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder. It

> does

> > on-line IDF weighting and can be used underneath a text encoder to get

> > on-line TF-IDF weighting of the sort you desire. You can preset counts

> > using the getDictionary accessor.

> >

> > A fourth choice is to simply use a static word encoder with hashed

> vectors

> > and do the IDF weighting as a vector element-wise multiplication. That

> way

> > you only need to keep around a vector of weights and no dictionary. That

> > should be much cheaper in memory.

> >

> >

> > On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]

> >wrote:

> >

> >> Hi,

> >>

> >> I would like to try LSI processing of results produced by seq2sparse.

> >>

> >> What's more, I need to be able to fold-in a bunch of new documents

> >> afterwards.

> >>

> >> Is there any support for fold-in indexing in Mahout?

> >>

> >> if not, is there a quick way for me to gain the understanding of

> >> seq2sparse

> >> output?

> >> In particular, if i wanted to add fold-in indexing, i need to be able to

> >> produce TF or TF-IDF of the new document on the fly using pre-existing

> >> dictionary and word counts. What's the api for this dictionary?

> >>

> >> Thank you.

> >> -Dmitriy

> >>

> >

> >

>

On Thu, Dec 30, 2010 at 1:05 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

> The fourth choice is what I would recommend in general unless you need very

> easy reverse-engineering of your vectors.

>

> On Thu, Dec 30, 2010 at 1:04 PM, Ted Dunning <[EMAIL PROTECTED]>

> wrote:

>

> >

> > There are two dictionary-like systems in Mahout. Neither is quite right.

> >

> > The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary.

> It

> > doesn't do the frequency counting you want.

> >

> > The more complex one is in DictionaryVectorizer. Unfortunately, it is a

> > mass of static functions that depend on statically named files rather

> than

> > being a real API.

> >

> > There is a third choice as well

> > in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder. It

> does

> > on-line IDF weighting and can be used underneath a text encoder to get

> > on-line TF-IDF weighting of the sort you desire. You can preset counts

> > using the getDictionary accessor.

> >

> > A fourth choice is to simply use a static word encoder with hashed

> vectors

> > and do the IDF weighting as a vector element-wise multiplication. That

> way

> > you only need to keep around a vector of weights and no dictionary. That

> > should be much cheaper in memory.

> >

> >

> > On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]

> >wrote:

> >

> >> Hi,

> >>

> >> I would like to try LSI processing of results produced by seq2sparse.

> >>

> >> What's more, I need to be able to fold-in a bunch of new documents

> >> afterwards.

> >>

> >> Is there any support for fold-in indexing in Mahout?

> >>

> >> if not, is there a quick way for me to gain the understanding of

> >> seq2sparse

> >> output?

> >> In particular, if i wanted to add fold-in indexing, i need to be able to

> >> produce TF or TF-IDF of the new document on the fly using pre-existing

> >> dictionary and word counts. What's the api for this dictionary?

> >>

> >> Thank you.

> >> -Dmitriy

> >>

> >

> >

>

Also, if i have a bunch of new documents to fold-in, it looks like i'd need

to run a matrix multiplication job between new document vectors and V, both

matrices represented row-wise. So DistributedRowMatrix should help me,

shouldn't it? do i need to transpose the first matrix first?

Thank you once again, your help is really invaluable.

-Dmitriy

On Thu, Dec 30, 2010 at 1:38 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

> Thank you, Ted.

>

>

> On Thu, Dec 30, 2010 at 1:05 PM, Ted Dunning <[EMAIL PROTECTED]>wrote:

>

>> The fourth choice is what I would recommend in general unless you need

>> very

>> easy reverse-engineering of your vectors.

>>

>> On Thu, Dec 30, 2010 at 1:04 PM, Ted Dunning <[EMAIL PROTECTED]>

>> wrote:

>>

>> >

>> > There are two dictionary-like systems in Mahout. Neither is quite

>> right.

>> >

>> > The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary.

>> It

>> > doesn't do the frequency counting you want.

>> >

>> > The more complex one is in DictionaryVectorizer. Unfortunately, it is a

>> > mass of static functions that depend on statically named files rather

>> than

>> > being a real API.

>> >

>> > There is a third choice as well

>> > in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder. It

>> does

>> > on-line IDF weighting and can be used underneath a text encoder to get

>> > on-line TF-IDF weighting of the sort you desire. You can preset counts

>> > using the getDictionary accessor.

>> >

>> > A fourth choice is to simply use a static word encoder with hashed

>> vectors

>> > and do the IDF weighting as a vector element-wise multiplication. That

>> way

>> > you only need to keep around a vector of weights and no dictionary.

>> That

>> > should be much cheaper in memory.

>> >

>> >

>> > On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]

>> >wrote:

>> >

>> >> Hi,

>> >>

>> >> I would like to try LSI processing of results produced by seq2sparse.

>> >>

>> >> What's more, I need to be able to fold-in a bunch of new documents

>> >> afterwards.

>> >>

>> >> Is there any support for fold-in indexing in Mahout?

>> >>

>> >> if not, is there a quick way for me to gain the understanding of

>> >> seq2sparse

>> >> output?

>> >> In particular, if i wanted to add fold-in indexing, i need to be able

>> to

>> >> produce TF or TF-IDF of the new document on the fly using pre-existing

>> >> dictionary and word counts. What's the api for this dictionary?

>> >>

>> >> Thank you.

>> >> -Dmitriy

>> >>

>> >

>> >

>>

>

>

to run a matrix multiplication job between new document vectors and V, both

matrices represented row-wise. So DistributedRowMatrix should help me,

shouldn't it? do i need to transpose the first matrix first?

Thank you once again, your help is really invaluable.

-Dmitriy

On Thu, Dec 30, 2010 at 1:38 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

> Thank you, Ted.

>

>

> On Thu, Dec 30, 2010 at 1:05 PM, Ted Dunning <[EMAIL PROTECTED]>wrote:

>

>> The fourth choice is what I would recommend in general unless you need

>> very

>> easy reverse-engineering of your vectors.

>>

>> On Thu, Dec 30, 2010 at 1:04 PM, Ted Dunning <[EMAIL PROTECTED]>

>> wrote:

>>

>> >

>> > There are two dictionary-like systems in Mahout. Neither is quite

>> right.

>> >

>> > The simpler one is in org.apache.mahout.vectorizer.encoders.Dictionary.

>> It

>> > doesn't do the frequency counting you want.

>> >

>> > The more complex one is in DictionaryVectorizer. Unfortunately, it is a

>> > mass of static functions that depend on statically named files rather

>> than

>> > being a real API.

>> >

>> > There is a third choice as well

>> > in org.apache.mahout.vectorizer.encoders.AdaptiveWordValueEncoder. It

>> does

>> > on-line IDF weighting and can be used underneath a text encoder to get

>> > on-line TF-IDF weighting of the sort you desire. You can preset counts

>> > using the getDictionary accessor.

>> >

>> > A fourth choice is to simply use a static word encoder with hashed

>> vectors

>> > and do the IDF weighting as a vector element-wise multiplication. That

>> way

>> > you only need to keep around a vector of weights and no dictionary.

>> That

>> > should be much cheaper in memory.

>> >

>> >

>> > On Thu, Dec 30, 2010 at 12:56 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]

>> >wrote:

>> >

>> >> Hi,

>> >>

>> >> I would like to try LSI processing of results produced by seq2sparse.

>> >>

>> >> What's more, I need to be able to fold-in a bunch of new documents

>> >> afterwards.

>> >>

>> >> Is there any support for fold-in indexing in Mahout?

>> >>

>> >> if not, is there a quick way for me to gain the understanding of

>> >> seq2sparse

>> >> output?

>> >> In particular, if i wanted to add fold-in indexing, i need to be able

>> to

>> >> produce TF or TF-IDF of the new document on the fly using pre-existing

>> >> dictionary and word counts. What's the api for this dictionary?

>> >>

>> >> Thank you.

>> >> -Dmitriy

>> >>

>> >

>> >

>>

>

>

Dmitriy,

I'm not sure if you figured this out on your own and I didn't see the

email,

but if not:

On Thu, Dec 30, 2010 at 3:57 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

> Also, if i have a bunch of new documents to fold-in, it looks like i'd need

> to run a matrix multiplication job between new document vectors and V, both

> matrices represented row-wise. So DistributedRowMatrix should help me,

> shouldn't it? do i need to transpose the first matrix first?

>

If you have a dense matrix V of eigenvectors (ie, it has K (a small number

like 100's) rows of dense vectors, each of which are cardinality M (which

may large)), which is a DistributedRowMatrix, and you have your original

document matrix C, which has N rows, each of which has cardinality M, then

you actually need to take the transpose of *both* matrices, then take

the DistributedRowMatrix.times() on these:

V_transpose = V.transpose();

C_transpose = C.transpose();

C_times_V_transpose = C_transpose.times(V_transpose);

This code will yield the mathematical result of C * V^T, which is probably

what you want.

(it turns out that this set of operations could also be done in a custom

operation

using the row-paths of both V and C as inputs, but you'd still require two

MapReduce shuffles to get the answer, so it's not really a savings to do

this).

-jake

I'm not sure if you figured this out on your own and I didn't see the

email,

but if not:

On Thu, Dec 30, 2010 at 3:57 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

> Also, if i have a bunch of new documents to fold-in, it looks like i'd need

> to run a matrix multiplication job between new document vectors and V, both

> matrices represented row-wise. So DistributedRowMatrix should help me,

> shouldn't it? do i need to transpose the first matrix first?

>

If you have a dense matrix V of eigenvectors (ie, it has K (a small number

like 100's) rows of dense vectors, each of which are cardinality M (which

may large)), which is a DistributedRowMatrix, and you have your original

document matrix C, which has N rows, each of which has cardinality M, then

you actually need to take the transpose of *both* matrices, then take

the DistributedRowMatrix.times() on these:

V_transpose = V.transpose();

C_transpose = C.transpose();

C_times_V_transpose = C_transpose.times(V_transpose);

This code will yield the mathematical result of C * V^T, which is probably

what you want.

(it turns out that this set of operations could also be done in a custom

operation

using the row-paths of both V and C as inputs, but you'd still require two

MapReduce shuffles to get the answer, so it's not really a savings to do

this).

-jake

Thank you, Jake.

Yes, i have figured that, and it seems that DRM.times does just that. I was

just not sure of the production quality of this code. It seems DRM

experiences a lot of fixes and discussions lately, including simple

multiplication.

On a side node one needs to compute Cx V^t x Sigma^-1 . But i have an

option in stochastic svd command line to compute V x Sigma ^ 0.5 instead of

V and U x Sigma ^ 0.5 instead of U , in which case correction for singular

vectors indeed turns into simple multiplication C x V^t and singular values

matrix can be ignored . (esp if one may want to measure similarities between

a user and an item, not just user-user or item-item).

-d

On Thu, Jan 6, 2011 at 1:45 PM, Jake Mannix <[EMAIL PROTECTED]> wrote:

> Dmitriy,

>

> I'm not sure if you figured this out on your own and I didn't see the

> email,

> but if not:

>

> On Thu, Dec 30, 2010 at 3:57 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]>

> wrote:

>

> > Also, if i have a bunch of new documents to fold-in, it looks like i'd

> need

> > to run a matrix multiplication job between new document vectors and V,

> both

> > matrices represented row-wise. So DistributedRowMatrix should help me,

> > shouldn't it? do i need to transpose the first matrix first?

> >

>

> If you have a dense matrix V of eigenvectors (ie, it has K (a small number

> like 100's) rows of dense vectors, each of which are cardinality M (which

> may large)), which is a DistributedRowMatrix, and you have your original

> document matrix C, which has N rows, each of which has cardinality M, then

> you actually need to take the transpose of *both* matrices, then take

> the DistributedRowMatrix.times() on these:

>

> V_transpose = V.transpose();

> C_transpose = C.transpose();

> C_times_V_transpose = C_transpose.times(V_transpose);

>

> This code will yield the mathematical result of C * V^T, which is probably

> what you want.

>

> (it turns out that this set of operations could also be done in a custom

> operation

> using the row-paths of both V and C as inputs, but you'd still require two

> MapReduce shuffles to get the answer, so it's not really a savings to do

> this).

>

> -jake

>

Yes, i have figured that, and it seems that DRM.times does just that. I was

just not sure of the production quality of this code. It seems DRM

experiences a lot of fixes and discussions lately, including simple

multiplication.

On a side node one needs to compute Cx V^t x Sigma^-1 . But i have an

option in stochastic svd command line to compute V x Sigma ^ 0.5 instead of

V and U x Sigma ^ 0.5 instead of U , in which case correction for singular

vectors indeed turns into simple multiplication C x V^t and singular values

matrix can be ignored . (esp if one may want to measure similarities between

a user and an item, not just user-user or item-item).

-d

On Thu, Jan 6, 2011 at 1:45 PM, Jake Mannix <[EMAIL PROTECTED]> wrote:

> Dmitriy,

>

> I'm not sure if you figured this out on your own and I didn't see the

> email,

> but if not:

>

> On Thu, Dec 30, 2010 at 3:57 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]>

> wrote:

>

> > Also, if i have a bunch of new documents to fold-in, it looks like i'd

> need

> > to run a matrix multiplication job between new document vectors and V,

> both

> > matrices represented row-wise. So DistributedRowMatrix should help me,

> > shouldn't it? do i need to transpose the first matrix first?

> >

>

> If you have a dense matrix V of eigenvectors (ie, it has K (a small number

> like 100's) rows of dense vectors, each of which are cardinality M (which

> may large)), which is a DistributedRowMatrix, and you have your original

> document matrix C, which has N rows, each of which has cardinality M, then

> you actually need to take the transpose of *both* matrices, then take

> the DistributedRowMatrix.times() on these:

>

> V_transpose = V.transpose();

> C_transpose = C.transpose();

> C_times_V_transpose = C_transpose.times(V_transpose);

>

> This code will yield the mathematical result of C * V^T, which is probably

> what you want.

>

> (it turns out that this set of operations could also be done in a custom

> operation

> using the row-paths of both V and C as inputs, but you'd still require two

> MapReduce shuffles to get the answer, so it's not really a savings to do

> this).

>

> -jake

>