Subject: Question on RowSimilarityJob


The heart of RowSimilarityJob is simple cooccurrence counting (with some
tweaks and optimizations). I'll give a simplified example:

We have a 3x3 matrix (could represent items rated by users, docs with
word counts e.g.)

row1: column1, column2
row2: column1, column3
row3: column1, column2

In the first MapReduce pass, we created an inverted index over the
columns (which means we transpose the matrix)

column1: row1, row2, row3
column2: row1, row3
column3: row2

In the second MapReduce pass, we the mapper emits all coocurring row
pairs per column:

for column1:

(row1,row2)
(row1,row3)
(row2,row1)
(row2,row3)
(row3,row1)
(row3,row2)

for column2:

(row1,row3)
(row3,row1)

for column3 there's nothing to emit.

The reducer receives all cooccurrences for a particular row and simply
needs to sum them up:

for row1:

(row1,row2)
(row1,row3)
(row1,row3)

row1,row2 -> 1 cooccurrence
row1,row3 -> 2 cooccurrences

for row2:

(row2,row1)
(row2,row3)

row2,row1 -> 1 cooccurrence
row2,row3 -> 1 cooccurrence

for row3:

(row3,row1)
(row3,row2)
(row3,row1)

row3,row1 -> 2 cooccurrences
row3,row2 -> 1 cooccurrence
Hope that helps with understanding the approach.

--sebastian
On 02.02.2012 12:47, Vicky wrote: