Suneel Marthi

2012-01-20, 16:38

Sebastian Schelter

2012-01-20, 17:58

Suneel Marthi

2012-01-20, 22:02

Lance Norskog

2012-01-20, 22:48

Suneel Marthi

2012-01-31, 21:40

Lance Norskog

2012-02-01, 04:29

Sebastian Schelter

2012-02-01, 11:06

Vicky

2012-02-02, 11:47

Dan Brickley

2012-02-02, 12:59

Sebastian Schelter

2012-02-02, 13:08

Sebastian Schelter

2012-02-02, 13:08

Suneel Marthi

2012-02-02, 15:51

Lance Norskog

2012-02-03, 07:44

Sebastian Schelter

2012-02-04, 17:23

I am working on determining document similarity of a corpus I am working with using RowSimilarity.

Questions:-

a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?

b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .

c) Also do we have any docs on RowIdJob ?

Thanks and Regards,

Suneel

Questions:-

a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?

b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .

c) Also do we have any docs on RowIdJob ?

Thanks and Regards,

Suneel

Hi,

'maxSimilaritiesPerRow' denotes the maximum number of similar rows

(documents in your use case) to keep per document.

'excludeSelfSimilarity' means that rows (documents) should not be

compared to themselves.

Sry for the lack of documentation, RowSimilarityJob was originally only

an internal job for the recommendation code. I'll try to add something

on the wiki in the next days.

--sebastian

On 20.01.2012 17:38, Suneel Marthi wrote:

> I am working on determining document similarity of a corpus I am working with using RowSimilarity.

>

> Questions:-

>

> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?

> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .

>

> c) Also do we have any docs on RowIdJob ?

>

> Thanks and Regards,

> Suneel

>

'maxSimilaritiesPerRow' denotes the maximum number of similar rows

(documents in your use case) to keep per document.

'excludeSelfSimilarity' means that rows (documents) should not be

compared to themselves.

Sry for the lack of documentation, RowSimilarityJob was originally only

an internal job for the recommendation code. I'll try to add something

on the wiki in the next days.

--sebastian

On 20.01.2012 17:38, Suneel Marthi wrote:

> I am working on determining document similarity of a corpus I am working with using RowSimilarity.

>

> Questions:-

>

> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?

> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .

>

> c) Also do we have any docs on RowIdJob ?

>

> Thanks and Regards,

> Suneel

>

Thanks for the reply Sebastian.

For the task I am working on - 'determine similar documents in a collection' - would RowSimilarity be the right approach?

________________________________

From: Sebastian Schelter <[EMAIL PROTECTED]>

To: [EMAIL PROTECTED]

Sent: Friday, January 20, 2012 12:58 PM

Subject: Re: Question on RowSimilarityJob

Hi,

'maxSimilaritiesPerRow' denotes the maximum number of similar rows

(documents in your use case) to keep per document.

'excludeSelfSimilarity' means that rows (documents) should not be

compared to themselves.

Sry for the lack of documentation, RowSimilarityJob was originally only

an internal job for the recommendation code. I'll try to add something

on the wiki in the next days.

--sebastian

On 20.01.2012 17:38, Suneel Marthi wrote:

> I am working on determining document similarity of a corpus I am working with using RowSimilarity.

>

> Questions:-

>

> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?

> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .

>

> c) Also do we have any docs on RowIdJob ?

>

> Thanks and Regards,

> Suneel

>

For the task I am working on - 'determine similar documents in a collection' - would RowSimilarity be the right approach?

________________________________

From: Sebastian Schelter <[EMAIL PROTECTED]>

To: [EMAIL PROTECTED]

Sent: Friday, January 20, 2012 12:58 PM

Subject: Re: Question on RowSimilarityJob

Hi,

'maxSimilaritiesPerRow' denotes the maximum number of similar rows

(documents in your use case) to keep per document.

'excludeSelfSimilarity' means that rows (documents) should not be

compared to themselves.

Sry for the lack of documentation, RowSimilarityJob was originally only

an internal job for the recommendation code. I'll try to add something

on the wiki in the next days.

--sebastian

On 20.01.2012 17:38, Suneel Marthi wrote:

> I am working on determining document similarity of a corpus I am working with using RowSimilarity.

>

> Questions:-

>

> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?

> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .

>

> c) Also do we have any docs on RowIdJob ?

>

> Thanks and Regards,

> Suneel

>

This can be done as a recommender or with clustering. As a

recommender, you would say, "find similar documents". In clustering,

you can take one document and find its neighbors in the cluster graph.

How much data do you have? An important point here is that most of

Mahout's recommender algorithms are in-memory programs. There is only

one Hadoop-based recommender. There are several Hadoop-based

clustering algorithms.

On Fri, Jan 20, 2012 at 2:02 PM, Suneel Marthi <[EMAIL PROTECTED]> wrote:

> Thanks for the reply Sebastian.

>

>

> For the task I am working on - 'determine similar documents in a collection' - would RowSimilarity be the right approach?

>

>

>

>

>

>

> ________________________________

> From: Sebastian Schelter <[EMAIL PROTECTED]>

> To: [EMAIL PROTECTED]

> Sent: Friday, January 20, 2012 12:58 PM

> Subject: Re: Question on RowSimilarityJob

>

> Hi,

>

> 'maxSimilaritiesPerRow' denotes the maximum number of similar rows

> (documents in your use case) to keep per document.

> 'excludeSelfSimilarity' means that rows (documents) should not be

> compared to themselves.

>

> Sry for the lack of documentation, RowSimilarityJob was originally only

> an internal job for the recommendation code. I'll try to add something

> on the wiki in the next days.

>

> --sebastian

>

>

> On 20.01.2012 17:38, Suneel Marthi wrote:

>> I am working on determining document similarity of a corpus I am working with using RowSimilarity.

>>

>> Questions:-

>>

>> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?

>> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .

>>

>> c) Also do we have any docs on RowIdJob ?

>>

>> Thanks and Regards,

>> Suneel

>>

--

Lance Norskog

[EMAIL PROTECTED]

recommender, you would say, "find similar documents". In clustering,

you can take one document and find its neighbors in the cluster graph.

How much data do you have? An important point here is that most of

Mahout's recommender algorithms are in-memory programs. There is only

one Hadoop-based recommender. There are several Hadoop-based

clustering algorithms.

On Fri, Jan 20, 2012 at 2:02 PM, Suneel Marthi <[EMAIL PROTECTED]> wrote:

> Thanks for the reply Sebastian.

>

>

> For the task I am working on - 'determine similar documents in a collection' - would RowSimilarity be the right approach?

>

>

>

>

>

>

> ________________________________

> From: Sebastian Schelter <[EMAIL PROTECTED]>

> To: [EMAIL PROTECTED]

> Sent: Friday, January 20, 2012 12:58 PM

> Subject: Re: Question on RowSimilarityJob

>

> Hi,

>

> 'maxSimilaritiesPerRow' denotes the maximum number of similar rows

> (documents in your use case) to keep per document.

> 'excludeSelfSimilarity' means that rows (documents) should not be

> compared to themselves.

>

> Sry for the lack of documentation, RowSimilarityJob was originally only

> an internal job for the recommendation code. I'll try to add something

> on the wiki in the next days.

>

> --sebastian

>

>

> On 20.01.2012 17:38, Suneel Marthi wrote:

>> I am working on determining document similarity of a corpus I am working with using RowSimilarity.

>>

>> Questions:-

>>

>> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?

>> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .

>>

>> c) Also do we have any docs on RowIdJob ?

>>

>> Thanks and Regards,

>> Suneel

>>

--

Lance Norskog

[EMAIL PROTECTED]

Sebastian,

Question on the RowSimilarity job.

a) I created sequence files from my document corpus.

b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1

c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of documents in my collection, N = cardinality of the vector - Correct?

d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where R < N.

I am not sure I completely understand as to what's happening in the RowSimilarity Job, I did read the paper at http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf and have been staring at your whiteboard (http://ssc.io/wp-content/uploads/2011/08/v2.jpg) for a while to understand what's happening, but guess I need some help.

I am willing to put some docs on the wiki for RowSimilarityJob once I am done.

Thanks for your help.

________________________________

From: Sebastian Schelter <[EMAIL PROTECTED]>

To: [EMAIL PROTECTED]

Sent: Friday, January 20, 2012 12:58 PM

Subject: Re: Question on RowSimilarityJob

Hi,

'maxSimilaritiesPerRow' denotes the maximum number of similar rows

(documents in your use case) to keep per document.

'excludeSelfSimilarity' means that rows (documents) should not be

compared to themselves.

Sry for the lack of documentation, RowSimilarityJob was originally only

an internal job for the recommendation code. I'll try to add something

on the wiki in the next days.

--sebastian

On 20.01.2012 17:38, Suneel Marthi wrote:

> I am working on determining document similarity of a corpus I am working with using RowSimilarity.

>

> Questions:-

>

> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?

> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .

>

> c) Also do we have any docs on RowIdJob ?

>

> Thanks and Regards,

> Suneel

>

Question on the RowSimilarity job.

a) I created sequence files from my document corpus.

b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1

c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of documents in my collection, N = cardinality of the vector - Correct?

d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where R < N.

I am not sure I completely understand as to what's happening in the RowSimilarity Job, I did read the paper at http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf and have been staring at your whiteboard (http://ssc.io/wp-content/uploads/2011/08/v2.jpg) for a while to understand what's happening, but guess I need some help.

I am willing to put some docs on the wiki for RowSimilarityJob once I am done.

Thanks for your help.

________________________________

From: Sebastian Schelter <[EMAIL PROTECTED]>

To: [EMAIL PROTECTED]

Sent: Friday, January 20, 2012 12:58 PM

Subject: Re: Question on RowSimilarityJob

Hi,

'maxSimilaritiesPerRow' denotes the maximum number of similar rows

(documents in your use case) to keep per document.

'excludeSelfSimilarity' means that rows (documents) should not be

compared to themselves.

Sry for the lack of documentation, RowSimilarityJob was originally only

an internal job for the recommendation code. I'll try to add something

on the wiki in the next days.

--sebastian

On 20.01.2012 17:38, Suneel Marthi wrote:

> I am working on determining document similarity of a corpus I am working with using RowSimilarity.

>

> Questions:-

>

> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?

> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .

>

> c) Also do we have any docs on RowIdJob ?

>

> Thanks and Regards,

> Suneel

>

In fact I will back off there. It is possible to use Hadoop to create

a data model, which you can then use to do recommendations for your

interactive search. It is similarly possible to do your clustering job

and use a disk-backed store to fetch a cluster.

I have not teased apart the various ways to compose a Hadoop-based

recommender job. There seem to be more options than I understood.

Lance

On Mon, Jan 23, 2012 at 6:06 AM, Suneel Marthi <[EMAIL PROTECTED]> wrote:

> Thanks Lance for your reply.

>

> I cannot go with any in-memory implementations as I am working with a corpus

> of ~600K documents.

>

> As regards clustering, I did try Mahout's various clustering implementations

> - kmeans, canopy and minhash. Minhash would have been the right algorithm

> for

> what I am trying to do but I never had good results from Mahout's minhash

> implementation.

>

> I had never tried any recommenders in Mahout, which one would you suggest; I

> would definitely need a Hadoop based solution as I am dealing with large

> data.

>

> Thanks,

> Suneel

>

> ________________________________

> From: Lance Norskog <[EMAIL PROTECTED]>

> To: [EMAIL PROTECTED]; Suneel Marthi <[EMAIL PROTECTED]>

> Sent: Friday, January 20, 2012 5:48 PM

>

> Subject: Re: Question on RowSimilarityJob

>

> This can be done as a recommender or with clustering. As a

> recommender, you would say, "find similar documents". In clustering,

> you can take one document and find its neighbors in the cluster graph.

>

> How much data do you have? An important point here is that most of

> Mahout's recommender algorithms are in-memory programs. There is only

> one Hadoop-based recommender. There are several Hadoop-based

> clustering algorithms.

>

> On Fri, Jan 20, 2012 at 2:02 PM, Suneel Marthi <[EMAIL PROTECTED]>

> wrote:

>> Thanks for the reply Sebastian.

>>

>>

>> For the task I am working on - 'determine similar documents in a

>> collection' - would RowSimilarity be the right approach?

>>

>>

>>

>>

>>

>>

>> ________________________________

>> From: Sebastian Schelter <[EMAIL PROTECTED]>

>> To: [EMAIL PROTECTED]

>> Sent: Friday, January 20, 2012 12:58 PM

>> Subject: Re: Question on RowSimilarityJob

>>

>> Hi,

>>

>> 'maxSimilaritiesPerRow' denotes the maximum number of similar rows

>> (documents in your use case) to keep per document.

>> 'excludeSelfSimilarity' means that rows (documents) should not be

>> compared to themselves.

>>

>> Sry for the lack of documentation, RowSimilarityJob was originally only

>> an internal job for the recommendation code. I'll try to add something

>> on the wiki in the next days.

>>

>> --sebastian

>>

>>

>> On 20.01.2012 17:38, Suneel Marthi wrote:

>>> I am working on determining document similarity of a corpus I am working

>>> with using RowSimilarity.

>>>

>>> Questions:-

>>>

>>> a) What do the parameters - 'maxSimilaritiesPerRow' and

>>> 'excludeSelfSimilarity' mean?

>>> b) Are there any docs available on RowSimilarityJob available, this is

>>> the best I could find on Sebastian's blog -

>>> http://ssc.io/rowsimilarityjob-on-steroids/ .

>>>

>>> c) Also do we have any docs on RowIdJob ?

>>>

>>> Thanks and Regards,

>>> Suneel

>>>

>

>

>

> --

> Lance Norskog

> [EMAIL PROTECTED]

>

>

--

Lance Norskog

[EMAIL PROTECTED]

a data model, which you can then use to do recommendations for your

interactive search. It is similarly possible to do your clustering job

and use a disk-backed store to fetch a cluster.

I have not teased apart the various ways to compose a Hadoop-based

recommender job. There seem to be more options than I understood.

Lance

On Mon, Jan 23, 2012 at 6:06 AM, Suneel Marthi <[EMAIL PROTECTED]> wrote:

> Thanks Lance for your reply.

>

> I cannot go with any in-memory implementations as I am working with a corpus

> of ~600K documents.

>

> As regards clustering, I did try Mahout's various clustering implementations

> - kmeans, canopy and minhash. Minhash would have been the right algorithm

> for

> what I am trying to do but I never had good results from Mahout's minhash

> implementation.

>

> I had never tried any recommenders in Mahout, which one would you suggest; I

> would definitely need a Hadoop based solution as I am dealing with large

> data.

>

> Thanks,

> Suneel

>

> ________________________________

> From: Lance Norskog <[EMAIL PROTECTED]>

> To: [EMAIL PROTECTED]; Suneel Marthi <[EMAIL PROTECTED]>

> Sent: Friday, January 20, 2012 5:48 PM

>

> Subject: Re: Question on RowSimilarityJob

>

> This can be done as a recommender or with clustering. As a

> recommender, you would say, "find similar documents". In clustering,

> you can take one document and find its neighbors in the cluster graph.

>

> How much data do you have? An important point here is that most of

> Mahout's recommender algorithms are in-memory programs. There is only

> one Hadoop-based recommender. There are several Hadoop-based

> clustering algorithms.

>

> On Fri, Jan 20, 2012 at 2:02 PM, Suneel Marthi <[EMAIL PROTECTED]>

> wrote:

>> Thanks for the reply Sebastian.

>>

>>

>> For the task I am working on - 'determine similar documents in a

>> collection' - would RowSimilarity be the right approach?

>>

>>

>>

>>

>>

>>

>> ________________________________

>> From: Sebastian Schelter <[EMAIL PROTECTED]>

>> To: [EMAIL PROTECTED]

>> Sent: Friday, January 20, 2012 12:58 PM

>> Subject: Re: Question on RowSimilarityJob

>>

>> Hi,

>>

>> 'maxSimilaritiesPerRow' denotes the maximum number of similar rows

>> (documents in your use case) to keep per document.

>> 'excludeSelfSimilarity' means that rows (documents) should not be

>> compared to themselves.

>>

>> Sry for the lack of documentation, RowSimilarityJob was originally only

>> an internal job for the recommendation code. I'll try to add something

>> on the wiki in the next days.

>>

>> --sebastian

>>

>>

>> On 20.01.2012 17:38, Suneel Marthi wrote:

>>> I am working on determining document similarity of a corpus I am working

>>> with using RowSimilarity.

>>>

>>> Questions:-

>>>

>>> a) What do the parameters - 'maxSimilaritiesPerRow' and

>>> 'excludeSelfSimilarity' mean?

>>> b) Are there any docs available on RowSimilarityJob available, this is

>>> the best I could find on Sebastian's blog -

>>> http://ssc.io/rowsimilarityjob-on-steroids/ .

>>>

>>> c) Also do we have any docs on RowIdJob ?

>>>

>>> Thanks and Regards,

>>> Suneel

>>>

>

>

>

> --

> Lance Norskog

> [EMAIL PROTECTED]

>

>

--

Lance Norskog

[EMAIL PROTECTED]

Here's a brief description of RowSimilarityJob:

The goal is to compute all pairwise similarities between the rows of a

sparse matrix A.

The computation should be executed in a way that only rows that have at

least one non-zero value in the same dimension (column) are compared. We

need this to avoid a quadratic number of pairwise comparisons.

Furthermore we should be able to 'embed' arbitrary similarity measures

and we should always be able to use a combiner in all MapReduce steps.

The computation is executed using three MapReduce passes:

In the first step, the rows of A are preprocessed via

VectorSimilarityMeasure.normalize() (they could e.g. be binarized or

scaled to unit-length), a single number for each row of A is computed

via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.

The second steps operates on the rows of A' (the columns of A). The

mapper sums up all pairwise cooccurrences using

VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so

called 'stripes' pattern). The reducers sums up all cooccurrence vectors

for one row and uses the similarity measure and the precomputed numbers

from step one to compute all similarities via

VectorSimilarityMeasure.similarity().

The third step ensures that only the top k similar rows per row are kept.

It's hard to see from the code but actually the job performs the matrix

multiplication AA' via outer products with a modified (similarity

measure specific) dot product.

/s

On 31.01.2012 22:40, Suneel Marthi wrote:

> Sebastian,

>

> Question on the RowSimilarity job.

>

> a) I created sequence files from my document corpus.

> b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1

> c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of documents in my collection, N = cardinality of the vector - Correct?

>

> d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where R < N.

>

>

> I am not sure I completely understand as to what's happening in the RowSimilarity Job, I did read the paper at http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf and have been staring at your whiteboard (http://ssc.io/wp-content/uploads/2011/08/v2.jpg) for a while to understand what's happening, but guess I need some help.

>

> I am willing to put some docs on the wiki for RowSimilarityJob once I am done.

>

> Thanks for your help.

>

>

>

> ________________________________

> From: Sebastian Schelter <[EMAIL PROTECTED]>

> To: [EMAIL PROTECTED]

> Sent: Friday, January 20, 2012 12:58 PM

> Subject: Re: Question on RowSimilarityJob

>

> Hi,

>

> 'maxSimilaritiesPerRow' denotes the maximum number of similar rows

> (documents in your use case) to keep per document.

> 'excludeSelfSimilarity' means that rows (documents) should not be

> compared to themselves.

>

> Sry for the lack of documentation, RowSimilarityJob was originally only

> an internal job for the recommendation code. I'll try to add something

> on the wiki in the next days.

>

> --sebastian

>

>

> On 20.01.2012 17:38, Suneel Marthi wrote:

>> I am working on determining document similarity of a corpus I am working with using RowSimilarity.

>>

>> Questions:-

>>

>> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?

>> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .

>>

>> c) Also do we have any docs on RowIdJob ?

>>

>> Thanks and Regards,

>> Suneel

>>

The goal is to compute all pairwise similarities between the rows of a

sparse matrix A.

The computation should be executed in a way that only rows that have at

least one non-zero value in the same dimension (column) are compared. We

need this to avoid a quadratic number of pairwise comparisons.

Furthermore we should be able to 'embed' arbitrary similarity measures

and we should always be able to use a combiner in all MapReduce steps.

The computation is executed using three MapReduce passes:

In the first step, the rows of A are preprocessed via

VectorSimilarityMeasure.normalize() (they could e.g. be binarized or

scaled to unit-length), a single number for each row of A is computed

via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.

The second steps operates on the rows of A' (the columns of A). The

mapper sums up all pairwise cooccurrences using

VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so

called 'stripes' pattern). The reducers sums up all cooccurrence vectors

for one row and uses the similarity measure and the precomputed numbers

from step one to compute all similarities via

VectorSimilarityMeasure.similarity().

The third step ensures that only the top k similar rows per row are kept.

It's hard to see from the code but actually the job performs the matrix

multiplication AA' via outer products with a modified (similarity

measure specific) dot product.

/s

On 31.01.2012 22:40, Suneel Marthi wrote:

> Sebastian,

>

> Question on the RowSimilarity job.

>

> a) I created sequence files from my document corpus.

> b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1

> c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of documents in my collection, N = cardinality of the vector - Correct?

>

> d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where R < N.

>

>

> I am not sure I completely understand as to what's happening in the RowSimilarity Job, I did read the paper at http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf and have been staring at your whiteboard (http://ssc.io/wp-content/uploads/2011/08/v2.jpg) for a while to understand what's happening, but guess I need some help.

>

> I am willing to put some docs on the wiki for RowSimilarityJob once I am done.

>

> Thanks for your help.

>

>

>

> ________________________________

> From: Sebastian Schelter <[EMAIL PROTECTED]>

> To: [EMAIL PROTECTED]

> Sent: Friday, January 20, 2012 12:58 PM

> Subject: Re: Question on RowSimilarityJob

>

> Hi,

>

> 'maxSimilaritiesPerRow' denotes the maximum number of similar rows

> (documents in your use case) to keep per document.

> 'excludeSelfSimilarity' means that rows (documents) should not be

> compared to themselves.

>

> Sry for the lack of documentation, RowSimilarityJob was originally only

> an internal job for the recommendation code. I'll try to add something

> on the wiki in the next days.

>

> --sebastian

>

>

> On 20.01.2012 17:38, Suneel Marthi wrote:

>> I am working on determining document similarity of a corpus I am working with using RowSimilarity.

>>

>> Questions:-

>>

>> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?

>> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .

>>

>> c) Also do we have any docs on RowIdJob ?

>>

>> Thanks and Regards,

>> Suneel

>>

Sebastian,

when you said,

>>The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared

if there are three columns in the document (columnA,columnB,columnC,columnD).

Doc1: ColumnA, columnB, columnC,columnD

Doc2:ColumnA,columnB,columnC,columnD

Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise?

----- Original Message -----

From: Sebastian Schelter <[EMAIL PROTECTED]>

To: [EMAIL PROTECTED]

Cc:

Sent: Wednesday, February 1, 2012 6:06 AM

Subject: Re: Question on RowSimilarityJob

Here's a brief description of RowSimilarityJob:

The goal is to compute all pairwise similarities between the rows of a

sparse matrix A.

The computation should be executed in a way that only rows that have at

least one non-zero value in the same dimension (column) are compared. We

need this to avoid a quadratic number of pairwise comparisons.

Furthermore we should be able to 'embed' arbitrary similarity measures

and we should always be able to use a combiner in all MapReduce steps.

The computation is executed using three MapReduce passes:

In the first step, the rows of A are preprocessed via

VectorSimilarityMeasure.normalize() (they could e.g. be binarized or

scaled to unit-length), a single number for each row of A is computed

via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.

The second steps operates on the rows of A' (the columns of A). The

mapper sums up all pairwise cooccurrences using

VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so

called 'stripes' pattern). The reducers sums up all cooccurrence vectors

for one row and uses the similarity measure and the precomputed numbers

from step one to compute all similarities via

VectorSimilarityMeasure.similarity().

The third step ensures that only the top k similar rows per row are kept.

It's hard to see from the code but actually the job performs the matrix

multiplication AA' via outer products with a modified (similarity

measure specific) dot product.

/s

On 31.01.2012 22:40, Suneel Marthi wrote:

> Sebastian,

>

> Question on the RowSimilarity job.

>

> a) I created sequence files from my document corpus.

> b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1

> c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of documents in my collection, N = cardinality of the vector - Correct?

>

> d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where R < N.

>

>

> I am not sure I completely understand as to what's happening in the RowSimilarity Job, I did read the paper at http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf and have been staring at your whiteboard (http://ssc.io/wp-content/uploads/2011/08/v2.jpg) for a while to understand what's happening, but guess I need some help.

>

> I am willing to put some docs on the wiki for RowSimilarityJob once I am done.

>

> Thanks for your help.

>

>

>

> ________________________________

> From: Sebastian Schelter <[EMAIL PROTECTED]>

> To: [EMAIL PROTECTED]

> Sent: Friday, January 20, 2012 12:58 PM

> Subject: Re: Question on RowSimilarityJob

>

> Hi,

>

> 'maxSimilaritiesPerRow' denotes the maximum number of similar rows

> (documents in your use case) to keep per document.

> 'excludeSelfSimilarity' means that rows (documents) should not be

> compared to themselves.

>

> Sry for the lack of documentation, RowSimilarityJob was originally only

> an internal job for the recommendation code. I'll try to add something

> on the wiki in the next days.

>

> --sebastian

>

>

> On 20.01.2012 17:38, Suneel Marthi wrote:

>> I am working on determining document similarity of a corpus I am working with using RowSimilarity.

when you said,

>>The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared

if there are three columns in the document (columnA,columnB,columnC,columnD).

Doc1: ColumnA, columnB, columnC,columnD

Doc2:ColumnA,columnB,columnC,columnD

Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise?

----- Original Message -----

From: Sebastian Schelter <[EMAIL PROTECTED]>

To: [EMAIL PROTECTED]

Cc:

Sent: Wednesday, February 1, 2012 6:06 AM

Subject: Re: Question on RowSimilarityJob

Here's a brief description of RowSimilarityJob:

The goal is to compute all pairwise similarities between the rows of a

sparse matrix A.

The computation should be executed in a way that only rows that have at

least one non-zero value in the same dimension (column) are compared. We

need this to avoid a quadratic number of pairwise comparisons.

Furthermore we should be able to 'embed' arbitrary similarity measures

and we should always be able to use a combiner in all MapReduce steps.

The computation is executed using three MapReduce passes:

In the first step, the rows of A are preprocessed via

VectorSimilarityMeasure.normalize() (they could e.g. be binarized or

scaled to unit-length), a single number for each row of A is computed

via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.

The second steps operates on the rows of A' (the columns of A). The

mapper sums up all pairwise cooccurrences using

VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so

called 'stripes' pattern). The reducers sums up all cooccurrence vectors

for one row and uses the similarity measure and the precomputed numbers

from step one to compute all similarities via

VectorSimilarityMeasure.similarity().

The third step ensures that only the top k similar rows per row are kept.

It's hard to see from the code but actually the job performs the matrix

multiplication AA' via outer products with a modified (similarity

measure specific) dot product.

/s

On 31.01.2012 22:40, Suneel Marthi wrote:

> Sebastian,

>

> Question on the RowSimilarity job.

>

> a) I created sequence files from my document corpus.

> b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1

> c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of documents in my collection, N = cardinality of the vector - Correct?

>

> d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where R < N.

>

>

> I am not sure I completely understand as to what's happening in the RowSimilarity Job, I did read the paper at http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf and have been staring at your whiteboard (http://ssc.io/wp-content/uploads/2011/08/v2.jpg) for a while to understand what's happening, but guess I need some help.

>

> I am willing to put some docs on the wiki for RowSimilarityJob once I am done.

>

> Thanks for your help.

>

>

>

> ________________________________

> From: Sebastian Schelter <[EMAIL PROTECTED]>

> To: [EMAIL PROTECTED]

> Sent: Friday, January 20, 2012 12:58 PM

> Subject: Re: Question on RowSimilarityJob

>

> Hi,

>

> 'maxSimilaritiesPerRow' denotes the maximum number of similar rows

> (documents in your use case) to keep per document.

> 'excludeSelfSimilarity' means that rows (documents) should not be

> compared to themselves.

>

> Sry for the lack of documentation, RowSimilarityJob was originally only

> an internal job for the recommendation code. I'll try to add something

> on the wiki in the next days.

>

> --sebastian

>

>

> On 20.01.2012 17:38, Suneel Marthi wrote:

>> I am working on determining document similarity of a corpus I am working with using RowSimilarity.

On 1 February 2012 12:06, Sebastian Schelter <[EMAIL PROTECTED]> wrote:

> Here's a brief description of RowSimilarityJob:

>

> The goal is to compute all pairwise similarities between the rows of a

> sparse matrix A.

>

> The computation should be executed in a way that only rows that have at

> least one non-zero value in the same dimension (column) are compared. We

> need this to avoid a quadratic number of pairwise comparisons.

> Furthermore we should be able to 'embed' arbitrary similarity measures

> and we should always be able to use a combiner in all MapReduce steps.

>

> The computation is executed using three MapReduce passes:

>

> In the first step, the rows of A are preprocessed via

> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or

> scaled to unit-length), a single number for each row of A is computed

> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.

>

> The second steps operates on the rows of A' (the columns of A). The

> mapper sums up all pairwise cooccurrences using

> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so

> called 'stripes' pattern). The reducers sums up all cooccurrence vectors

> for one row and uses the similarity measure and the precomputed numbers

> from step one to compute all similarities via

> VectorSimilarityMeasure.similarity().

>

> The third step ensures that only the top k similar rows per row are kept.

>

> It's hard to see from the code but actually the job performs the matrix

> multiplication AA' via outer products with a modified (similarity

> measure specific) dot product.

Thanks for this. I've taken liberty of dropping it in the Wiki at

https://cwiki.apache.org/confluence/display/MAHOUT/RowSimilarityJob

(and linking this thread). Hope that's ok...

cheers,

Dan

> Here's a brief description of RowSimilarityJob:

>

> The goal is to compute all pairwise similarities between the rows of a

> sparse matrix A.

>

> The computation should be executed in a way that only rows that have at

> least one non-zero value in the same dimension (column) are compared. We

> need this to avoid a quadratic number of pairwise comparisons.

> Furthermore we should be able to 'embed' arbitrary similarity measures

> and we should always be able to use a combiner in all MapReduce steps.

>

> The computation is executed using three MapReduce passes:

>

> In the first step, the rows of A are preprocessed via

> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or

> scaled to unit-length), a single number for each row of A is computed

> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.

>

> The second steps operates on the rows of A' (the columns of A). The

> mapper sums up all pairwise cooccurrences using

> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so

> called 'stripes' pattern). The reducers sums up all cooccurrence vectors

> for one row and uses the similarity measure and the precomputed numbers

> from step one to compute all similarities via

> VectorSimilarityMeasure.similarity().

>

> The third step ensures that only the top k similar rows per row are kept.

>

> It's hard to see from the code but actually the job performs the matrix

> multiplication AA' via outer products with a modified (similarity

> measure specific) dot product.

Thanks for this. I've taken liberty of dropping it in the Wiki at

https://cwiki.apache.org/confluence/display/MAHOUT/RowSimilarityJob

(and linking this thread). Hope that's ok...

cheers,

Dan

The heart of RowSimilarityJob is simple cooccurrence counting (with some

tweaks and optimizations). I'll give a simplified example:

We have a 3x3 matrix (could represent items rated by users, docs with

word counts e.g.)

row1: column1, column2

row2: column1, column3

row3: column1, column2

In the first MapReduce pass, we created an inverted index over the

columns (which means we transpose the matrix)

column1: row1, row2, row3

column2: row1, row3

column3: row2

In the second MapReduce pass, we the mapper emits all coocurring row

pairs per column:

for column1:

(row1,row2)

(row1,row3)

(row2,row1)

(row2,row3)

(row3,row1)

(row3,row2)

for column2:

(row1,row3)

(row3,row1)

for column3 there's nothing to emit.

The reducer receives all cooccurrences for a particular row and simply

needs to sum them up:

for row1:

(row1,row2)

(row1,row3)

(row1,row3)

row1,row2 -> 1 cooccurrence

row1,row3 -> 2 cooccurrences

for row2:

(row2,row1)

(row2,row3)

row2,row1 -> 1 cooccurrence

row2,row3 -> 1 cooccurrence

for row3:

(row3,row1)

(row3,row2)

(row3,row1)

row3,row1 -> 2 cooccurrences

row3,row2 -> 1 cooccurrence

Hope that helps with understanding the approach.

--sebastian

On 02.02.2012 12:47, Vicky wrote:

> Sebastian,

> when you said,

>>> The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared

>

>

> if there are three columns in the document (columnA,columnB,columnC,columnD).

>

> Doc1: ColumnA, columnB, columnC,columnD

> Doc2:ColumnA,columnB,columnC,columnD

>

> Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise?

>

>

>

>

> ----- Original Message -----

> From: Sebastian Schelter <[EMAIL PROTECTED]>

> To: [EMAIL PROTECTED]

> Cc:

> Sent: Wednesday, February 1, 2012 6:06 AM

> Subject: Re: Question on RowSimilarityJob

>

> Here's a brief description of RowSimilarityJob:

>

> The goal is to compute all pairwise similarities between the rows of a

> sparse matrix A.

>

> The computation should be executed in a way that only rows that have at

> least one non-zero value in the same dimension (column) are compared. We

> need this to avoid a quadratic number of pairwise comparisons.

> Furthermore we should be able to 'embed' arbitrary similarity measures

> and we should always be able to use a combiner in all MapReduce steps.

>

> The computation is executed using three MapReduce passes:

>

> In the first step, the rows of A are preprocessed via

> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or

> scaled to unit-length), a single number for each row of A is computed

> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.

>

> The second steps operates on the rows of A' (the columns of A). The

> mapper sums up all pairwise cooccurrences using

> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so

> called 'stripes' pattern). The reducers sums up all cooccurrence vectors

> for one row and uses the similarity measure and the precomputed numbers

> from step one to compute all similarities via

> VectorSimilarityMeasure.similarity().

>

> The third step ensures that only the top k similar rows per row are kept.

>

> It's hard to see from the code but actually the job performs the matrix

> multiplication AA' via outer products with a modified (similarity

> measure specific) dot product.

>

> /s

>

>

>

> On 31.01.2012 22:40, Suneel Marthi wrote:

>> Sebastian,

>>

>> Question on the RowSimilarity job.

>>

>> a) I created sequence files from my document corpus.

>> b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1

>> c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of documents in my collection, N = cardinality of the vector - Correct?

>>

>> d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where R < N.

tweaks and optimizations). I'll give a simplified example:

We have a 3x3 matrix (could represent items rated by users, docs with

word counts e.g.)

row1: column1, column2

row2: column1, column3

row3: column1, column2

In the first MapReduce pass, we created an inverted index over the

columns (which means we transpose the matrix)

column1: row1, row2, row3

column2: row1, row3

column3: row2

In the second MapReduce pass, we the mapper emits all coocurring row

pairs per column:

for column1:

(row1,row2)

(row1,row3)

(row2,row1)

(row2,row3)

(row3,row1)

(row3,row2)

for column2:

(row1,row3)

(row3,row1)

for column3 there's nothing to emit.

The reducer receives all cooccurrences for a particular row and simply

needs to sum them up:

for row1:

(row1,row2)

(row1,row3)

(row1,row3)

row1,row2 -> 1 cooccurrence

row1,row3 -> 2 cooccurrences

for row2:

(row2,row1)

(row2,row3)

row2,row1 -> 1 cooccurrence

row2,row3 -> 1 cooccurrence

for row3:

(row3,row1)

(row3,row2)

(row3,row1)

row3,row1 -> 2 cooccurrences

row3,row2 -> 1 cooccurrence

Hope that helps with understanding the approach.

--sebastian

On 02.02.2012 12:47, Vicky wrote:

> Sebastian,

> when you said,

>>> The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared

>

>

> if there are three columns in the document (columnA,columnB,columnC,columnD).

>

> Doc1: ColumnA, columnB, columnC,columnD

> Doc2:ColumnA,columnB,columnC,columnD

>

> Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise?

>

>

>

>

> ----- Original Message -----

> From: Sebastian Schelter <[EMAIL PROTECTED]>

> To: [EMAIL PROTECTED]

> Cc:

> Sent: Wednesday, February 1, 2012 6:06 AM

> Subject: Re: Question on RowSimilarityJob

>

> Here's a brief description of RowSimilarityJob:

>

> The goal is to compute all pairwise similarities between the rows of a

> sparse matrix A.

>

> The computation should be executed in a way that only rows that have at

> least one non-zero value in the same dimension (column) are compared. We

> need this to avoid a quadratic number of pairwise comparisons.

> Furthermore we should be able to 'embed' arbitrary similarity measures

> and we should always be able to use a combiner in all MapReduce steps.

>

> The computation is executed using three MapReduce passes:

>

> In the first step, the rows of A are preprocessed via

> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or

> scaled to unit-length), a single number for each row of A is computed

> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.

>

> The second steps operates on the rows of A' (the columns of A). The

> mapper sums up all pairwise cooccurrences using

> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so

> called 'stripes' pattern). The reducers sums up all cooccurrence vectors

> for one row and uses the similarity measure and the precomputed numbers

> from step one to compute all similarities via

> VectorSimilarityMeasure.similarity().

>

> The third step ensures that only the top k similar rows per row are kept.

>

> It's hard to see from the code but actually the job performs the matrix

> multiplication AA' via outer products with a modified (similarity

> measure specific) dot product.

>

> /s

>

>

>

> On 31.01.2012 22:40, Suneel Marthi wrote:

>> Sebastian,

>>

>> Question on the RowSimilarity job.

>>

>> a) I created sequence files from my document corpus.

>> b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1

>> c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of documents in my collection, N = cardinality of the vector - Correct?

>>

>> d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where R < N.

Great! Thank you!

On 02.02.2012 13:59, Dan Brickley wrote:

> On 1 February 2012 12:06, Sebastian Schelter <[EMAIL PROTECTED]> wrote:

>> Here's a brief description of RowSimilarityJob:

>>

>> The goal is to compute all pairwise similarities between the rows of a

>> sparse matrix A.

>>

>> The computation should be executed in a way that only rows that have at

>> least one non-zero value in the same dimension (column) are compared. We

>> need this to avoid a quadratic number of pairwise comparisons.

>> Furthermore we should be able to 'embed' arbitrary similarity measures

>> and we should always be able to use a combiner in all MapReduce steps.

>>

>> The computation is executed using three MapReduce passes:

>>

>> In the first step, the rows of A are preprocessed via

>> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or

>> scaled to unit-length), a single number for each row of A is computed

>> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.

>>

>> The second steps operates on the rows of A' (the columns of A). The

>> mapper sums up all pairwise cooccurrences using

>> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so

>> called 'stripes' pattern). The reducers sums up all cooccurrence vectors

>> for one row and uses the similarity measure and the precomputed numbers

>> from step one to compute all similarities via

>> VectorSimilarityMeasure.similarity().

>>

>> The third step ensures that only the top k similar rows per row are kept.

>>

>> It's hard to see from the code but actually the job performs the matrix

>> multiplication AA' via outer products with a modified (similarity

>> measure specific) dot product.

>

> Thanks for this. I've taken liberty of dropping it in the Wiki at

> https://cwiki.apache.org/confluence/display/MAHOUT/RowSimilarityJob

> (and linking this thread). Hope that's ok...

>

> cheers,

>

> Dan

On 02.02.2012 13:59, Dan Brickley wrote:

> On 1 February 2012 12:06, Sebastian Schelter <[EMAIL PROTECTED]> wrote:

>> Here's a brief description of RowSimilarityJob:

>>

>> The goal is to compute all pairwise similarities between the rows of a

>> sparse matrix A.

>>

>> The computation should be executed in a way that only rows that have at

>> least one non-zero value in the same dimension (column) are compared. We

>> need this to avoid a quadratic number of pairwise comparisons.

>> Furthermore we should be able to 'embed' arbitrary similarity measures

>> and we should always be able to use a combiner in all MapReduce steps.

>>

>> The computation is executed using three MapReduce passes:

>>

>> In the first step, the rows of A are preprocessed via

>> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or

>> scaled to unit-length), a single number for each row of A is computed

>> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.

>>

>> The second steps operates on the rows of A' (the columns of A). The

>> mapper sums up all pairwise cooccurrences using

>> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so

>> called 'stripes' pattern). The reducers sums up all cooccurrence vectors

>> for one row and uses the similarity measure and the precomputed numbers

>> from step one to compute all similarities via

>> VectorSimilarityMeasure.similarity().

>>

>> The third step ensures that only the top k similar rows per row are kept.

>>

>> It's hard to see from the code but actually the job performs the matrix

>> multiplication AA' via outer products with a modified (similarity

>> measure specific) dot product.

>

> Thanks for this. I've taken liberty of dropping it in the Wiki at

> https://cwiki.apache.org/confluence/display/MAHOUT/RowSimilarityJob

> (and linking this thread). Hope that's ok...

>

> cheers,

>

> Dan

Thanks for the details, Sebastian. This is indeed very helpful and I completely understand now what you had on your whiteboard.

I had been normalizing my vectors during seq2sparse (with L2 norm since I would be using Cosine/Tanimoto Similarity in the RowSimilarityJob). But since the first step in the RowSimilarity job is normalizing the vectors I don't have to do that in seq2sparse; doing that would not impact the final outcome anyways.

This has been a very useful thread and thanks for all your help.

________________________________

From: Sebastian Schelter <[EMAIL PROTECTED]>

To: Vicky <[EMAIL PROTECTED]>

Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>

Sent: Thursday, February 2, 2012 8:08 AM

Subject: Re: Question on RowSimilarityJob

The heart of RowSimilarityJob is simple cooccurrence counting (with some

tweaks and optimizations). I'll give a simplified example:

We have a 3x3 matrix (could represent items rated by users, docs with

word counts e.g.)

row1: column1, column2

row2: column1, column3

row3: column1, column2

In the first MapReduce pass, we created an inverted index over the

columns (which means we transpose the matrix)

column1: row1, row2, row3

column2: row1, row3

column3: row2

In the second MapReduce pass, we the mapper emits all coocurring row

pairs per column:

for column1:

(row1,row2)

(row1,row3)

(row2,row1)

(row2,row3)

(row3,row1)

(row3,row2)

for column2:

(row1,row3)

(row3,row1)

for column3 there's nothing to emit.

The reducer receives all cooccurrences for a particular row and simply

needs to sum them up:

for row1:

(row1,row2)

(row1,row3)

(row1,row3)

row1,row2 -> 1 cooccurrence

row1,row3 -> 2 cooccurrences

for row2:

(row2,row1)

(row2,row3)

row2,row1 -> 1 cooccurrence

row2,row3 -> 1 cooccurrence

for row3:

(row3,row1)

(row3,row2)

(row3,row1)

row3,row1 -> 2 cooccurrences

row3,row2 -> 1 cooccurrence

Hope that helps with understanding the approach.

--sebastian

On 02.02.2012 12:47, Vicky wrote:

> Sebastian,

> when you said,

>>> The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared

>

>

> if there are three columns in the document (columnA,columnB,columnC,columnD).

>

> Doc1: ColumnA, columnB, columnC,columnD

> Doc2:ColumnA,columnB,columnC,columnD

>

> Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise?

>

>

>

>

> ----- Original Message -----

> From: Sebastian Schelter <[EMAIL PROTECTED]>

> To: [EMAIL PROTECTED]

> Cc:

> Sent: Wednesday, February 1, 2012 6:06 AM

> Subject: Re: Question on RowSimilarityJob

>

> Here's a brief description of RowSimilarityJob:

>

> The goal is to compute all pairwise similarities between the rows of a

> sparse matrix A.

>

> The computation should be executed in a way that only rows that have at

> least one non-zero value in the same dimension (column) are compared. We

> need this to avoid a quadratic number of pairwise comparisons.

> Furthermore we should be able to 'embed' arbitrary similarity measures

> and we should always be able to use a combiner in all MapReduce steps.

>

> The computation is executed using three MapReduce passes:

>

> In the first step, the rows of A are preprocessed via

> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or

> scaled to unit-length), a single number for each row of A is computed

> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.

>

> The second steps operates on the rows of A' (the columns of A). The

> mapper sums up all pairwise cooccurrences using

> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so

> called 'stripes' pattern). The reducers sums up all cooccurrence vectors

> for one row and uses the similarity measure and the precomputed numbers

> from step one to compute all similarities via

> VectorSimilarityMeasure.similarity().

>

> The third step ensures that only the top k similar rows per row are kept.

I had been normalizing my vectors during seq2sparse (with L2 norm since I would be using Cosine/Tanimoto Similarity in the RowSimilarityJob). But since the first step in the RowSimilarity job is normalizing the vectors I don't have to do that in seq2sparse; doing that would not impact the final outcome anyways.

This has been a very useful thread and thanks for all your help.

________________________________

From: Sebastian Schelter <[EMAIL PROTECTED]>

To: Vicky <[EMAIL PROTECTED]>

Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>

Sent: Thursday, February 2, 2012 8:08 AM

Subject: Re: Question on RowSimilarityJob

The heart of RowSimilarityJob is simple cooccurrence counting (with some

tweaks and optimizations). I'll give a simplified example:

We have a 3x3 matrix (could represent items rated by users, docs with

word counts e.g.)

row1: column1, column2

row2: column1, column3

row3: column1, column2

In the first MapReduce pass, we created an inverted index over the

columns (which means we transpose the matrix)

column1: row1, row2, row3

column2: row1, row3

column3: row2

In the second MapReduce pass, we the mapper emits all coocurring row

pairs per column:

for column1:

(row1,row2)

(row1,row3)

(row2,row1)

(row2,row3)

(row3,row1)

(row3,row2)

for column2:

(row1,row3)

(row3,row1)

for column3 there's nothing to emit.

The reducer receives all cooccurrences for a particular row and simply

needs to sum them up:

for row1:

(row1,row2)

(row1,row3)

(row1,row3)

row1,row2 -> 1 cooccurrence

row1,row3 -> 2 cooccurrences

for row2:

(row2,row1)

(row2,row3)

row2,row1 -> 1 cooccurrence

row2,row3 -> 1 cooccurrence

for row3:

(row3,row1)

(row3,row2)

(row3,row1)

row3,row1 -> 2 cooccurrences

row3,row2 -> 1 cooccurrence

Hope that helps with understanding the approach.

--sebastian

On 02.02.2012 12:47, Vicky wrote:

> Sebastian,

> when you said,

>>> The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared

>

>

> if there are three columns in the document (columnA,columnB,columnC,columnD).

>

> Doc1: ColumnA, columnB, columnC,columnD

> Doc2:ColumnA,columnB,columnC,columnD

>

> Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise?

>

>

>

>

> ----- Original Message -----

> From: Sebastian Schelter <[EMAIL PROTECTED]>

> To: [EMAIL PROTECTED]

> Cc:

> Sent: Wednesday, February 1, 2012 6:06 AM

> Subject: Re: Question on RowSimilarityJob

>

> Here's a brief description of RowSimilarityJob:

>

> The goal is to compute all pairwise similarities between the rows of a

> sparse matrix A.

>

> The computation should be executed in a way that only rows that have at

> least one non-zero value in the same dimension (column) are compared. We

> need this to avoid a quadratic number of pairwise comparisons.

> Furthermore we should be able to 'embed' arbitrary similarity measures

> and we should always be able to use a combiner in all MapReduce steps.

>

> The computation is executed using three MapReduce passes:

>

> In the first step, the rows of A are preprocessed via

> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or

> scaled to unit-length), a single number for each row of A is computed

> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.

>

> The second steps operates on the rows of A' (the columns of A). The

> mapper sums up all pairwise cooccurrences using

> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so

> called 'stripes' pattern). The reducers sums up all cooccurrence vectors

> for one row and uses the similarity measure and the precomputed numbers

> from step one to compute all similarities via

> VectorSimilarityMeasure.similarity().

>

> The third step ensures that only the top k similar rows per row are kept.

Please add these details to the wiki page. I had no idea what this

thing does: I though it was doing matrix multiplies.

On Thu, Feb 2, 2012 at 5:08 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote:

> The heart of RowSimilarityJob is simple cooccurrence counting (with some

> tweaks and optimizations). I'll give a simplified example:

>

> We have a 3x3 matrix (could represent items rated by users, docs with

> word counts e.g.)

>

> row1: column1, column2

> row2: column1, column3

> row3: column1, column2

>

> In the first MapReduce pass, we created an inverted index over the

> columns (which means we transpose the matrix)

>

> column1: row1, row2, row3

> column2: row1, row3

> column3: row2

>

> In the second MapReduce pass, we the mapper emits all coocurring row

> pairs per column:

>

> for column1:

>

> (row1,row2)

> (row1,row3)

> (row2,row1)

> (row2,row3)

> (row3,row1)

> (row3,row2)

>

> for column2:

>

> (row1,row3)

> (row3,row1)

>

> for column3 there's nothing to emit.

>

> The reducer receives all cooccurrences for a particular row and simply

> needs to sum them up:

>

> for row1:

>

> (row1,row2)

> (row1,row3)

> (row1,row3)

>

> row1,row2 -> 1 cooccurrence

> row1,row3 -> 2 cooccurrences

>

> for row2:

>

> (row2,row1)

> (row2,row3)

>

> row2,row1 -> 1 cooccurrence

> row2,row3 -> 1 cooccurrence

>

> for row3:

>

> (row3,row1)

> (row3,row2)

> (row3,row1)

>

> row3,row1 -> 2 cooccurrences

> row3,row2 -> 1 cooccurrence

>

>

> Hope that helps with understanding the approach.

>

> --sebastian

>

>

>

>

> On 02.02.2012 12:47, Vicky wrote:

>> Sebastian,

>> when you said,

>>>> The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared

>>

>>

>> if there are three columns in the document (columnA,columnB,columnC,columnD).

>>

>> Doc1: ColumnA, columnB, columnC,columnD

>> Doc2:ColumnA,columnB,columnC,columnD

>>

>> Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise?

>>

>>

>>

>>

>> ----- Original Message -----

>> From: Sebastian Schelter <[EMAIL PROTECTED]>

>> To: [EMAIL PROTECTED]

>> Cc:

>> Sent: Wednesday, February 1, 2012 6:06 AM

>> Subject: Re: Question on RowSimilarityJob

>>

>> Here's a brief description of RowSimilarityJob:

>>

>> The goal is to compute all pairwise similarities between the rows of a

>> sparse matrix A.

>>

>> The computation should be executed in a way that only rows that have at

>> least one non-zero value in the same dimension (column) are compared. We

>> need this to avoid a quadratic number of pairwise comparisons.

>> Furthermore we should be able to 'embed' arbitrary similarity measures

>> and we should always be able to use a combiner in all MapReduce steps.

>>

>> The computation is executed using three MapReduce passes:

>>

>> In the first step, the rows of A are preprocessed via

>> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or

>> scaled to unit-length), a single number for each row of A is computed

>> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.

>>

>> The second steps operates on the rows of A' (the columns of A). The

>> mapper sums up all pairwise cooccurrences using

>> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so

>> called 'stripes' pattern). The reducers sums up all cooccurrence vectors

>> for one row and uses the similarity measure and the precomputed numbers

>> from step one to compute all similarities via

>> VectorSimilarityMeasure.similarity().

>>

>> The third step ensures that only the top k similar rows per row are kept.

>>

>> It's hard to see from the code but actually the job performs the matrix

>> multiplication AA' via outer products with a modified (similarity

>> measure specific) dot product.

>>

>> /s

>>

>>

>>

>> On 31.01.2012 22:40, Suneel Marthi wrote:

>>> Sebastian,

>>>

>>> Question on the RowSimilarity job.

>>>

>>> a) I created sequence files from my document corpus.

>>> b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1

Lance Norskog

[EMAIL PROTECTED]

thing does: I though it was doing matrix multiplies.

On Thu, Feb 2, 2012 at 5:08 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote:

> The heart of RowSimilarityJob is simple cooccurrence counting (with some

> tweaks and optimizations). I'll give a simplified example:

>

> We have a 3x3 matrix (could represent items rated by users, docs with

> word counts e.g.)

>

> row1: column1, column2

> row2: column1, column3

> row3: column1, column2

>

> In the first MapReduce pass, we created an inverted index over the

> columns (which means we transpose the matrix)

>

> column1: row1, row2, row3

> column2: row1, row3

> column3: row2

>

> In the second MapReduce pass, we the mapper emits all coocurring row

> pairs per column:

>

> for column1:

>

> (row1,row2)

> (row1,row3)

> (row2,row1)

> (row2,row3)

> (row3,row1)

> (row3,row2)

>

> for column2:

>

> (row1,row3)

> (row3,row1)

>

> for column3 there's nothing to emit.

>

> The reducer receives all cooccurrences for a particular row and simply

> needs to sum them up:

>

> for row1:

>

> (row1,row2)

> (row1,row3)

> (row1,row3)

>

> row1,row2 -> 1 cooccurrence

> row1,row3 -> 2 cooccurrences

>

> for row2:

>

> (row2,row1)

> (row2,row3)

>

> row2,row1 -> 1 cooccurrence

> row2,row3 -> 1 cooccurrence

>

> for row3:

>

> (row3,row1)

> (row3,row2)

> (row3,row1)

>

> row3,row1 -> 2 cooccurrences

> row3,row2 -> 1 cooccurrence

>

>

> Hope that helps with understanding the approach.

>

> --sebastian

>

>

>

>

> On 02.02.2012 12:47, Vicky wrote:

>> Sebastian,

>> when you said,

>>>> The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared

>>

>>

>> if there are three columns in the document (columnA,columnB,columnC,columnD).

>>

>> Doc1: ColumnA, columnB, columnC,columnD

>> Doc2:ColumnA,columnB,columnC,columnD

>>

>> Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise?

>>

>>

>>

>>

>> ----- Original Message -----

>> From: Sebastian Schelter <[EMAIL PROTECTED]>

>> To: [EMAIL PROTECTED]

>> Cc:

>> Sent: Wednesday, February 1, 2012 6:06 AM

>> Subject: Re: Question on RowSimilarityJob

>>

>> Here's a brief description of RowSimilarityJob:

>>

>> The goal is to compute all pairwise similarities between the rows of a

>> sparse matrix A.

>>

>> The computation should be executed in a way that only rows that have at

>> least one non-zero value in the same dimension (column) are compared. We

>> need this to avoid a quadratic number of pairwise comparisons.

>> Furthermore we should be able to 'embed' arbitrary similarity measures

>> and we should always be able to use a combiner in all MapReduce steps.

>>

>> The computation is executed using three MapReduce passes:

>>

>> In the first step, the rows of A are preprocessed via

>> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or

>> scaled to unit-length), a single number for each row of A is computed

>> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.

>>

>> The second steps operates on the rows of A' (the columns of A). The

>> mapper sums up all pairwise cooccurrences using

>> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so

>> called 'stripes' pattern). The reducers sums up all cooccurrence vectors

>> for one row and uses the similarity measure and the precomputed numbers

>> from step one to compute all similarities via

>> VectorSimilarityMeasure.similarity().

>>

>> The third step ensures that only the top k similar rows per row are kept.

>>

>> It's hard to see from the code but actually the job performs the matrix

>> multiplication AA' via outer products with a modified (similarity

>> measure specific) dot product.

>>

>> /s

>>

>>

>>

>> On 31.01.2012 22:40, Suneel Marthi wrote:

>>> Sebastian,

>>>

>>> Question on the RowSimilarity job.

>>>

>>> a) I created sequence files from my document corpus.

>>> b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1

Lance Norskog

[EMAIL PROTECTED]

Well these details are only a very, very simplified view of the job's

actual implementation, so it might confuse the readers.

And actually RowSimilarityJob is executing a kind of matrix

multiplication. If A is 0-1 matrix and we create AA' by summing outer

products, we end up with a computation that is very similar to the example.

--sebastian

On 03.02.2012 08:44, Lance Norskog wrote:

> Please add these details to the wiki page. I had no idea what this

> thing does: I though it was doing matrix multiplies.

>

> On Thu, Feb 2, 2012 at 5:08 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote:

>> The heart of RowSimilarityJob is simple cooccurrence counting (with some

>> tweaks and optimizations). I'll give a simplified example:

>>

>> We have a 3x3 matrix (could represent items rated by users, docs with

>> word counts e.g.)

>>

>> row1: column1, column2

>> row2: column1, column3

>> row3: column1, column2

>>

>> In the first MapReduce pass, we created an inverted index over the

>> columns (which means we transpose the matrix)

>>

>> column1: row1, row2, row3

>> column2: row1, row3

>> column3: row2

>>

>> In the second MapReduce pass, we the mapper emits all coocurring row

>> pairs per column:

>>

>> for column1:

>>

>> (row1,row2)

>> (row1,row3)

>> (row2,row1)

>> (row2,row3)

>> (row3,row1)

>> (row3,row2)

>>

>> for column2:

>>

>> (row1,row3)

>> (row3,row1)

>>

>> for column3 there's nothing to emit.

>>

>> The reducer receives all cooccurrences for a particular row and simply

>> needs to sum them up:

>>

>> for row1:

>>

>> (row1,row2)

>> (row1,row3)

>> (row1,row3)

>>

>> row1,row2 -> 1 cooccurrence

>> row1,row3 -> 2 cooccurrences

>>

>> for row2:

>>

>> (row2,row1)

>> (row2,row3)

>>

>> row2,row1 -> 1 cooccurrence

>> row2,row3 -> 1 cooccurrence

>>

>> for row3:

>>

>> (row3,row1)

>> (row3,row2)

>> (row3,row1)

>>

>> row3,row1 -> 2 cooccurrences

>> row3,row2 -> 1 cooccurrence

>>

>>

>> Hope that helps with understanding the approach.

>>

>> --sebastian

>>

>>

>>

>>

>> On 02.02.2012 12:47, Vicky wrote:

>>> Sebastian,

>>> when you said,

>>>>> The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared

>>>

>>>

>>> if there are three columns in the document (columnA,columnB,columnC,columnD).

>>>

>>> Doc1: ColumnA, columnB, columnC,columnD

>>> Doc2:ColumnA,columnB,columnC,columnD

>>>

>>> Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise?

>>>

>>>

>>>

>>>

>>> ----- Original Message -----

>>> From: Sebastian Schelter <[EMAIL PROTECTED]>

>>> To: [EMAIL PROTECTED]

>>> Cc:

>>> Sent: Wednesday, February 1, 2012 6:06 AM

>>> Subject: Re: Question on RowSimilarityJob

>>>

>>> Here's a brief description of RowSimilarityJob:

>>>

>>> The goal is to compute all pairwise similarities between the rows of a

>>> sparse matrix A.

>>>

>>> The computation should be executed in a way that only rows that have at

>>> least one non-zero value in the same dimension (column) are compared. We

>>> need this to avoid a quadratic number of pairwise comparisons.

>>> Furthermore we should be able to 'embed' arbitrary similarity measures

>>> and we should always be able to use a combiner in all MapReduce steps.

>>>

>>> The computation is executed using three MapReduce passes:

>>>

>>> In the first step, the rows of A are preprocessed via

>>> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or

>>> scaled to unit-length), a single number for each row of A is computed

>>> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.

>>>

>>> The second steps operates on the rows of A' (the columns of A). The

>>> mapper sums up all pairwise cooccurrences using

>>> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so

>>> called 'stripes' pattern). The reducers sums up all cooccurrence vectors

>>> for one row and uses the similarity measure and the precomputed numbers

>>> from step one to compute all similarities via

actual implementation, so it might confuse the readers.

And actually RowSimilarityJob is executing a kind of matrix

multiplication. If A is 0-1 matrix and we create AA' by summing outer

products, we end up with a computation that is very similar to the example.

--sebastian

On 03.02.2012 08:44, Lance Norskog wrote:

> Please add these details to the wiki page. I had no idea what this

> thing does: I though it was doing matrix multiplies.

>

> On Thu, Feb 2, 2012 at 5:08 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote:

>> The heart of RowSimilarityJob is simple cooccurrence counting (with some

>> tweaks and optimizations). I'll give a simplified example:

>>

>> We have a 3x3 matrix (could represent items rated by users, docs with

>> word counts e.g.)

>>

>> row1: column1, column2

>> row2: column1, column3

>> row3: column1, column2

>>

>> In the first MapReduce pass, we created an inverted index over the

>> columns (which means we transpose the matrix)

>>

>> column1: row1, row2, row3

>> column2: row1, row3

>> column3: row2

>>

>> In the second MapReduce pass, we the mapper emits all coocurring row

>> pairs per column:

>>

>> for column1:

>>

>> (row1,row2)

>> (row1,row3)

>> (row2,row1)

>> (row2,row3)

>> (row3,row1)

>> (row3,row2)

>>

>> for column2:

>>

>> (row1,row3)

>> (row3,row1)

>>

>> for column3 there's nothing to emit.

>>

>> The reducer receives all cooccurrences for a particular row and simply

>> needs to sum them up:

>>

>> for row1:

>>

>> (row1,row2)

>> (row1,row3)

>> (row1,row3)

>>

>> row1,row2 -> 1 cooccurrence

>> row1,row3 -> 2 cooccurrences

>>

>> for row2:

>>

>> (row2,row1)

>> (row2,row3)

>>

>> row2,row1 -> 1 cooccurrence

>> row2,row3 -> 1 cooccurrence

>>

>> for row3:

>>

>> (row3,row1)

>> (row3,row2)

>> (row3,row1)

>>

>> row3,row1 -> 2 cooccurrences

>> row3,row2 -> 1 cooccurrence

>>

>>

>> Hope that helps with understanding the approach.

>>

>> --sebastian

>>

>>

>>

>>

>> On 02.02.2012 12:47, Vicky wrote:

>>> Sebastian,

>>> when you said,

>>>>> The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared

>>>

>>>

>>> if there are three columns in the document (columnA,columnB,columnC,columnD).

>>>

>>> Doc1: ColumnA, columnB, columnC,columnD

>>> Doc2:ColumnA,columnB,columnC,columnD

>>>

>>> Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise?

>>>

>>>

>>>

>>>

>>> ----- Original Message -----

>>> From: Sebastian Schelter <[EMAIL PROTECTED]>

>>> To: [EMAIL PROTECTED]

>>> Cc:

>>> Sent: Wednesday, February 1, 2012 6:06 AM

>>> Subject: Re: Question on RowSimilarityJob

>>>

>>> Here's a brief description of RowSimilarityJob:

>>>

>>> The goal is to compute all pairwise similarities between the rows of a

>>> sparse matrix A.

>>>

>>> The computation should be executed in a way that only rows that have at

>>> least one non-zero value in the same dimension (column) are compared. We

>>> need this to avoid a quadratic number of pairwise comparisons.

>>> Furthermore we should be able to 'embed' arbitrary similarity measures

>>> and we should always be able to use a combiner in all MapReduce steps.

>>>

>>> The computation is executed using three MapReduce passes:

>>>

>>> In the first step, the rows of A are preprocessed via

>>> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or

>>> scaled to unit-length), a single number for each row of A is computed

>>> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.

>>>

>>> The second steps operates on the rows of A' (the columns of A). The

>>> mapper sums up all pairwise cooccurrences using

>>> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so

>>> called 'stripes' pattern). The reducers sums up all cooccurrence vectors

>>> for one row and uses the similarity measure and the precomputed numbers

>>> from step one to compute all similarities via