Sebastian,
when you said,
>>The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared
if there are three columns in the document (columnA,columnB,columnC,columnD).
Doc1: ColumnA, columnB, columnC,columnD
Doc2:ColumnA,columnB,columnC,columnD
Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise?
----- Original Message -----
From: Sebastian Schelter <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc:
Sent: Wednesday, February 1, 2012 6:06 AM
Subject: Re: Question on RowSimilarityJob
Here's a brief description of RowSimilarityJob:
The goal is to compute all pairwise similarities between the rows of a
sparse matrix A.
The computation should be executed in a way that only rows that have at
least one non-zero value in the same dimension (column) are compared. We
need this to avoid a quadratic number of pairwise comparisons.
Furthermore we should be able to 'embed' arbitrary similarity measures
and we should always be able to use a combiner in all MapReduce steps.
The computation is executed using three MapReduce passes:
In the first step, the rows of A are preprocessed via
VectorSimilarityMeasure.normalize() (they could e.g. be binarized or
scaled to unit-length), a single number for each row of A is computed
via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.
The second steps operates on the rows of A' (the columns of A). The
mapper sums up all pairwise cooccurrences using
VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so
called 'stripes' pattern). The reducers sums up all cooccurrence vectors
for one row and uses the similarity measure and the precomputed numbers
from step one to compute all similarities via
VectorSimilarityMeasure.similarity().
The third step ensures that only the top k similar rows per row are kept.
It's hard to see from the code but actually the job performs the matrix
multiplication AA' via outer products with a modified (similarity
measure specific) dot product.
/s
On 31.01.2012 22:40, Suneel Marthi wrote:
> Sebastian,
>
> Question on the RowSimilarity job.
>
> a) I created sequence files from my document corpus.
> b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1
> c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of documents in my collection, N = cardinality of the vector - Correct?
>
> d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where R < N.
>
>
> I am not sure I completely understand as to what's happening in the RowSimilarity Job, I did read the paper at
http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf and have been staring at your whiteboard (
http://ssc.io/wp-content/uploads/2011/08/v2.jpg) for a while to understand what's happening, but guess I need some help.
>
> I am willing to put some docs on the wiki for RowSimilarityJob once I am done.
>
> Thanks for your help.
>
>
>
> ________________________________
> From: Sebastian Schelter <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Friday, January 20, 2012 12:58 PM
> Subject: Re: Question on RowSimilarityJob
>
> Hi,
>
> 'maxSimilaritiesPerRow' denotes the maximum number of similar rows
> (documents in your use case) to keep per document.
> 'excludeSelfSimilarity' means that rows (documents) should not be
> compared to themselves.
>
> Sry for the lack of documentation, RowSimilarityJob was originally only
> an internal job for the recommendation code. I'll try to add something
> on the wiki in the next days.
>
> --sebastian
>
>
> On 20.01.2012 17:38, Suneel Marthi wrote:
>> I am working on determining document similarity of a corpus I am working with using RowSimilarity.