Stefan Wienert

2011-06-14, 17:15

Jake Mannix

2011-06-14, 17:34

Jake Mannix

2011-06-14, 17:36

Fernando Fernández

2011-06-14, 17:51

Stefan Wienert

2011-06-14, 18:00

Stefan Wienert

2011-06-14, 18:39

Sean Owen

2011-06-14, 18:54

Sebastian Schelter

2011-06-14, 19:09

Fernando Fernández

2011-06-14, 19:23

Stefan Wienert

2011-06-14, 21:04

Stefan Wienert

2011-06-14, 21:28

Stefan Wienert

2011-06-14, 21:28

Dmitriy Lyubimov

2011-06-14, 21:35

Dmitriy Lyubimov

2011-06-14, 21:59

Stefan Wienert

2011-06-14, 22:09

Dmitriy Lyubimov

2011-06-14, 22:35

Dmitriy Lyubimov

2011-06-14, 22:37

Jake Mannix

2011-06-14, 23:09

Dmitriy Lyubimov

2011-06-14, 23:23

Ted Dunning

2011-06-15, 08:17

Fernando Fernández

2011-06-15, 08:44

Stefan Wienert

2011-06-15, 09:10

Sean Owen

2011-06-15, 09:27

Fernando Fernández

2011-06-15, 10:57

Jake Mannix

2011-06-15, 16:31

Stefan Wienert

2011-06-15, 17:06

Jake Mannix

2011-06-15, 17:44

Ted Dunning

2011-06-15, 18:32

Hey Guys,

I have some strange results in my LSA-Pipeline.

First, I explain the steps my data is making:

1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as weighter

2) Transposing TDM

3a) Using Mahout SVD (Lanczos) with the transposed TDM

3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM

3c) Using no dimension reduction (for testing purpose)

4) Transpose result (ONLY none / svd)

5) Calculating Cosine Similarty (from Mahout)

Now... Some strange thinks happen:

First of all: The demo data shows the similarity from document 1 to

all other documents.

the results using only cosine similarty (without dimension reduction):

http://the-lord.de/img/none.png

the result using svd, rank 10

http://the-lord.de/img/svd-10.png

some points falling down to the bottom.

the results using ssvd rank 10

http://the-lord.de/img/ssvd-10.png

the result using svd, rank 100

http://the-lord.de/img/svd-100.png

more points falling down to the bottom.

the results using ssvd rank 100

http://the-lord.de/img/ssvd-100.png

the results using svd rank 200

http://the-lord.de/img/svd-200.png

even more points falling down to the bottom.

the results using svd rank 1000

http://the-lord.de/img/svd-1000.png

most points are at the bottom

please beware of the scale:

- the avg from none: 0,8712

- the avg from svd rank 10: 0,2648

- the avg from svd rank 100: 0,0628

- the avg from svd rank 200: 0,0238

- the avg from svd rank 1000: 0,0116

so my question is:

Can you explain this behavior? Why are the documents getting more

equal with more ranks in svd. I thought it was the opposite.

Cheers

Stefan

You are running into "the curse of dimensionality". The higher the

dimension you are in, the further apart (random) vectors are.

What you should to compare quality is to find the documents that you can

manually label as being "very similar" to document #1, and then see what

rank they show up in a list of "most similar to document 1" by each of the

various similarity metrics you've produced. The metric which makes the

"known similar" documents highest in rank order *relative to the rest of the

documents* will be the one you think is best.

-jake

On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Hey Guys,

>

> I have some strange results in my LSA-Pipeline.

>

> First, I explain the steps my data is making:

> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as

> weighter

> 2) Transposing TDM

> 3a) Using Mahout SVD (Lanczos) with the transposed TDM

> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM

> 3c) Using no dimension reduction (for testing purpose)

> 4) Transpose result (ONLY none / svd)

> 5) Calculating Cosine Similarty (from Mahout)

>

> Now... Some strange thinks happen:

> First of all: The demo data shows the similarity from document 1 to

> all other documents.

>

> the results using only cosine similarty (without dimension reduction):

> http://the-lord.de/img/none.png

>

> the result using svd, rank 10

> http://the-lord.de/img/svd-10.png

> some points falling down to the bottom.

>

> the results using ssvd rank 10

> http://the-lord.de/img/ssvd-10.png

>

> the result using svd, rank 100

> http://the-lord.de/img/svd-100.png

> more points falling down to the bottom.

>

> the results using ssvd rank 100

> http://the-lord.de/img/ssvd-100.png

>

> the results using svd rank 200

> http://the-lord.de/img/svd-200.png

> even more points falling down to the bottom.

>

> the results using svd rank 1000

> http://the-lord.de/img/svd-1000.png

> most points are at the bottom

>

> please beware of the scale:

> - the avg from none: 0,8712

> - the avg from svd rank 10: 0,2648

> - the avg from svd rank 100: 0,0628

> - the avg from svd rank 200: 0,0238

> - the avg from svd rank 1000: 0,0116

>

> so my question is:

> Can you explain this behavior? Why are the documents getting more

> equal with more ranks in svd. I thought it was the opposite.

>

> Cheers

> Stefan

>

On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

actually, wait - are your graphs showing *similarity*, or *distance*? In

higher

dimensions, *distance* (and cosine angle) should grow, but on the other

hand,

*similarity* (1-cos(angle)) should go toward 0.

On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Hey Guys,

>

> I have some strange results in my LSA-Pipeline.

>

> First, I explain the steps my data is making:

> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as

> weighter

> 2) Transposing TDM

> 3a) Using Mahout SVD (Lanczos) with the transposed TDM

> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM

> 3c) Using no dimension reduction (for testing purpose)

> 4) Transpose result (ONLY none / svd)

> 5) Calculating Cosine Similarty (from Mahout)

>

> Now... Some strange thinks happen:

> First of all: The demo data shows the similarity from document 1 to

> all other documents.

>

> the results using only cosine similarty (without dimension reduction):

> http://the-lord.de/img/none.png

>

> the result using svd, rank 10

> http://the-lord.de/img/svd-10.png

> some points falling down to the bottom.

>

> the results using ssvd rank 10

> http://the-lord.de/img/ssvd-10.png

>

> the result using svd, rank 100

> http://the-lord.de/img/svd-100.png

> more points falling down to the bottom.

>

> the results using ssvd rank 100

> http://the-lord.de/img/ssvd-100.png

>

> the results using svd rank 200

> http://the-lord.de/img/svd-200.png

> even more points falling down to the bottom.

>

> the results using svd rank 1000

> http://the-lord.de/img/svd-1000.png

> most points are at the bottom

>

> please beware of the scale:

> - the avg from none: 0,8712

> - the avg from svd rank 10: 0,2648

> - the avg from svd rank 100: 0,0628

> - the avg from svd rank 200: 0,0238

> - the avg from svd rank 1000: 0,0116

>

> so my question is:

> Can you explain this behavior? Why are the documents getting more

> equal with more ranks in svd. I thought it was the opposite.

>

> Cheers

> Stefan

>

On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

Actually that's what your results are showing, aren't they? With rank 1000

the similarity avg is the lowest...

2011/6/14 Jake Mannix <[EMAIL PROTECTED]>

> actually, wait - are your graphs showing *similarity*, or *distance*? In

> higher

> dimensions, *distance* (and cosine angle) should grow, but on the other

> hand,

> *similarity* (1-cos(angle)) should go toward 0.

>

> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <[EMAIL PROTECTED]>

> wrote:

>

> > Hey Guys,

> >

> > I have some strange results in my LSA-Pipeline.

> >

> > First, I explain the steps my data is making:

> > 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as

> > weighter

> > 2) Transposing TDM

> > 3a) Using Mahout SVD (Lanczos) with the transposed TDM

> > 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM

> > 3c) Using no dimension reduction (for testing purpose)

> > 4) Transpose result (ONLY none / svd)

> > 5) Calculating Cosine Similarty (from Mahout)

> >

> > Now... Some strange thinks happen:

> > First of all: The demo data shows the similarity from document 1 to

> > all other documents.

> >

> > the results using only cosine similarty (without dimension reduction):

> > http://the-lord.de/img/none.png

> >

> > the result using svd, rank 10

> > http://the-lord.de/img/svd-10.png

> > some points falling down to the bottom.

> >

> > the results using ssvd rank 10

> > http://the-lord.de/img/ssvd-10.png

> >

> > the result using svd, rank 100

> > http://the-lord.de/img/svd-100.png

> > more points falling down to the bottom.

> >

> > the results using ssvd rank 100

> > http://the-lord.de/img/ssvd-100.png

> >

> > the results using svd rank 200

> > http://the-lord.de/img/svd-200.png

> > even more points falling down to the bottom.

> >

> > the results using svd rank 1000

> > http://the-lord.de/img/svd-1000.png

> > most points are at the bottom

> >

> > please beware of the scale:

> > - the avg from none: 0,8712

> > - the avg from svd rank 10: 0,2648

> > - the avg from svd rank 100: 0,0628

> > - the avg from svd rank 200: 0,0238

> > - the avg from svd rank 1000: 0,0116

> >

> > so my question is:

> > Can you explain this behavior? Why are the documents getting more

> > equal with more ranks in svd. I thought it was the opposite.

> >

> > Cheers

> > Stefan

> >

>

2011/6/14 Jake Mannix <[EMAIL PROTECTED]>

but... why do I get the different results with cosine similarity with

no dimension reduction (with 100,000 dimensions) ?

2011/6/14 Fernando Fernández <[EMAIL PROTECTED]>:

> Actually that's what your results are showing, aren't they? With rank 1000

> the similarity avg is the lowest...

>

>

> 2011/6/14 Jake Mannix <[EMAIL PROTECTED]>

>

>> actually, wait - are your graphs showing *similarity*, or *distance*? In

>> higher

>> dimensions, *distance* (and cosine angle) should grow, but on the other

>> hand,

>> *similarity* (1-cos(angle)) should go toward 0.

>>

>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <[EMAIL PROTECTED]>

>> wrote:

>>

>> > Hey Guys,

>> >

>> > I have some strange results in my LSA-Pipeline.

>> >

>> > First, I explain the steps my data is making:

>> > 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as

>> > weighter

>> > 2) Transposing TDM

>> > 3a) Using Mahout SVD (Lanczos) with the transposed TDM

>> > 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM

>> > 3c) Using no dimension reduction (for testing purpose)

>> > 4) Transpose result (ONLY none / svd)

>> > 5) Calculating Cosine Similarty (from Mahout)

>> >

>> > Now... Some strange thinks happen:

>> > First of all: The demo data shows the similarity from document 1 to

>> > all other documents.

>> >

>> > the results using only cosine similarty (without dimension reduction):

>> > http://the-lord.de/img/none.png

>> >

>> > the result using svd, rank 10

>> > http://the-lord.de/img/svd-10.png

>> > some points falling down to the bottom.

>> >

>> > the results using ssvd rank 10

>> > http://the-lord.de/img/ssvd-10.png

>> >

>> > the result using svd, rank 100

>> > http://the-lord.de/img/svd-100.png

>> > more points falling down to the bottom.

>> >

>> > the results using ssvd rank 100

>> > http://the-lord.de/img/ssvd-100.png

>> >

>> > the results using svd rank 200

>> > http://the-lord.de/img/svd-200.png

>> > even more points falling down to the bottom.

>> >

>> > the results using svd rank 1000

>> > http://the-lord.de/img/svd-1000.png

>> > most points are at the bottom

>> >

>> > please beware of the scale:

>> > - the avg from none: 0,8712

>> > - the avg from svd rank 10: 0,2648

>> > - the avg from svd rank 100: 0,0628

>> > - the avg from svd rank 200: 0,0238

>> > - the avg from svd rank 1000: 0,0116

>> >

>> > so my question is:

>> > Can you explain this behavior? Why are the documents getting more

>> > equal with more ranks in svd. I thought it was the opposite.

>> >

>> > Cheers

>> > Stefan

>> >

>>

>

2011/6/14 Fernando Fernández <[EMAIL PROTECTED]>:

Actually I'm using RowSimilarityJob() with

--input input

--output output

--numberOfColumns documentCount

--maxSimilaritiesPerRow documentCount

--similarityClassname SIMILARITY_UNCENTERED_COSINE

Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE

calculates...

the source says: "distributed implementation of cosine similarity that

does not center its data"

So... this seems to be the similarity and not the distance?

Cheers,

Stefan

2011/6/14 Stefan Wienert <[EMAIL PROTECTED]>:

> but... why do I get the different results with cosine similarity with

> no dimension reduction (with 100,000 dimensions) ?

>

> 2011/6/14 Fernando Fernández <[EMAIL PROTECTED]>:

>> Actually that's what your results are showing, aren't they? With rank 1000

>> the similarity avg is the lowest...

>>

>>

>> 2011/6/14 Jake Mannix <[EMAIL PROTECTED]>

>>

>>> actually, wait - are your graphs showing *similarity*, or *distance*? In

>>> higher

>>> dimensions, *distance* (and cosine angle) should grow, but on the other

>>> hand,

>>> *similarity* (1-cos(angle)) should go toward 0.

>>>

>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <[EMAIL PROTECTED]>

>>> wrote:

>>>

>>> > Hey Guys,

>>> >

>>> > I have some strange results in my LSA-Pipeline.

>>> >

>>> > First, I explain the steps my data is making:

>>> > 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as

>>> > weighter

>>> > 2) Transposing TDM

>>> > 3a) Using Mahout SVD (Lanczos) with the transposed TDM

>>> > 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM

>>> > 3c) Using no dimension reduction (for testing purpose)

>>> > 4) Transpose result (ONLY none / svd)

>>> > 5) Calculating Cosine Similarty (from Mahout)

>>> >

>>> > Now... Some strange thinks happen:

>>> > First of all: The demo data shows the similarity from document 1 to

>>> > all other documents.

>>> >

>>> > the results using only cosine similarty (without dimension reduction):

>>> > http://the-lord.de/img/none.png

>>> >

>>> > the result using svd, rank 10

>>> > http://the-lord.de/img/svd-10.png

>>> > some points falling down to the bottom.

>>> >

>>> > the results using ssvd rank 10

>>> > http://the-lord.de/img/ssvd-10.png

>>> >

>>> > the result using svd, rank 100

>>> > http://the-lord.de/img/svd-100.png

>>> > more points falling down to the bottom.

>>> >

>>> > the results using ssvd rank 100

>>> > http://the-lord.de/img/ssvd-100.png

>>> >

>>> > the results using svd rank 200

>>> > http://the-lord.de/img/svd-200.png

>>> > even more points falling down to the bottom.

>>> >

>>> > the results using svd rank 1000

>>> > http://the-lord.de/img/svd-1000.png

>>> > most points are at the bottom

>>> >

>>> > please beware of the scale:

>>> > - the avg from none: 0,8712

>>> > - the avg from svd rank 10: 0,2648

>>> > - the avg from svd rank 100: 0,0628

>>> > - the avg from svd rank 200: 0,0238

>>> > - the avg from svd rank 1000: 0,0116

>>> >

>>> > so my question is:

>>> > Can you explain this behavior? Why are the documents getting more

>>> > equal with more ranks in svd. I thought it was the opposite.

>>> >

>>> > Cheers

>>> > Stefan

>>> >

>>>

>>

>

>

>

> --

> Stefan Wienert

>

> http://www.wienert.cc

> [EMAIL PROTECTED]

>

> Telefon: +495251-2026838

> Mobil: +49176-40170270

>

--input input

--output output

--numberOfColumns documentCount

--maxSimilaritiesPerRow documentCount

--similarityClassname SIMILARITY_UNCENTERED_COSINE

Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE

calculates...

the source says: "distributed implementation of cosine similarity that

does not center its data"

So... this seems to be the similarity and not the distance?

Cheers,

Stefan

2011/6/14 Stefan Wienert <[EMAIL PROTECTED]>:

> but... why do I get the different results with cosine similarity with

> no dimension reduction (with 100,000 dimensions) ?

>

> 2011/6/14 Fernando Fernández <[EMAIL PROTECTED]>:

>> Actually that's what your results are showing, aren't they? With rank 1000

>> the similarity avg is the lowest...

>>

>>

>> 2011/6/14 Jake Mannix <[EMAIL PROTECTED]>

>>

>>> actually, wait - are your graphs showing *similarity*, or *distance*? In

>>> higher

>>> dimensions, *distance* (and cosine angle) should grow, but on the other

>>> hand,

>>> *similarity* (1-cos(angle)) should go toward 0.

>>>

>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <[EMAIL PROTECTED]>

>>> wrote:

>>>

>>> > Hey Guys,

>>> >

>>> > I have some strange results in my LSA-Pipeline.

>>> >

>>> > First, I explain the steps my data is making:

>>> > 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as

>>> > weighter

>>> > 2) Transposing TDM

>>> > 3a) Using Mahout SVD (Lanczos) with the transposed TDM

>>> > 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM

>>> > 3c) Using no dimension reduction (for testing purpose)

>>> > 4) Transpose result (ONLY none / svd)

>>> > 5) Calculating Cosine Similarty (from Mahout)

>>> >

>>> > Now... Some strange thinks happen:

>>> > First of all: The demo data shows the similarity from document 1 to

>>> > all other documents.

>>> >

>>> > the results using only cosine similarty (without dimension reduction):

>>> > http://the-lord.de/img/none.png

>>> >

>>> > the result using svd, rank 10

>>> > http://the-lord.de/img/svd-10.png

>>> > some points falling down to the bottom.

>>> >

>>> > the results using ssvd rank 10

>>> > http://the-lord.de/img/ssvd-10.png

>>> >

>>> > the result using svd, rank 100

>>> > http://the-lord.de/img/svd-100.png

>>> > more points falling down to the bottom.

>>> >

>>> > the results using ssvd rank 100

>>> > http://the-lord.de/img/ssvd-100.png

>>> >

>>> > the results using svd rank 200

>>> > http://the-lord.de/img/svd-200.png

>>> > even more points falling down to the bottom.

>>> >

>>> > the results using svd rank 1000

>>> > http://the-lord.de/img/svd-1000.png

>>> > most points are at the bottom

>>> >

>>> > please beware of the scale:

>>> > - the avg from none: 0,8712

>>> > - the avg from svd rank 10: 0,2648

>>> > - the avg from svd rank 100: 0,0628

>>> > - the avg from svd rank 200: 0,0238

>>> > - the avg from svd rank 1000: 0,0116

>>> >

>>> > so my question is:

>>> > Can you explain this behavior? Why are the documents getting more

>>> > equal with more ranks in svd. I thought it was the opposite.

>>> >

>>> > Cheers

>>> > Stefan

>>> >

>>>

>>

>

>

>

> --

> Stefan Wienert

>

> http://www.wienert.cc

> [EMAIL PROTECTED]

>

> Telefon: +495251-2026838

> Mobil: +49176-40170270

>

It is a similarity, not a distance. Higher values mean more

similarity, not less.

I agree that similarity ought to decrease with more dimensions. That

is what you observe -- except that you see quite high average

similarity with no dimension reduction!

An average cosine similarity of 0.87 sounds "high" to me for anything

but a few dimensions. What's the dimensionality of the input without

dimension reduction?

Something is amiss in this pipeline. It is an interesting question!

On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Actually I'm using RowSimilarityJob() with

> --input input

> --output output

> --numberOfColumns documentCount

> --maxSimilaritiesPerRow documentCount

> --similarityClassname SIMILARITY_UNCENTERED_COSINE

>

> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE

> calculates...

> the source says: "distributed implementation of cosine similarity that

> does not center its data"

>

> So... this seems to be the similarity and not the distance?

>

> Cheers,

> Stefan

>

>

>

> 2011/6/14 Stefan Wienert <[EMAIL PROTECTED]>:

>> but... why do I get the different results with cosine similarity with

>> no dimension reduction (with 100,000 dimensions) ?

>>

>> 2011/6/14 Fernando Fernández <[EMAIL PROTECTED]>:

>>> Actually that's what your results are showing, aren't they? With rank 1000

>>> the similarity avg is the lowest...

>>>

>>>

>>> 2011/6/14 Jake Mannix <[EMAIL PROTECTED]>

>>>

>>>> actually, wait - are your graphs showing *similarity*, or *distance*? In

>>>> higher

>>>> dimensions, *distance* (and cosine angle) should grow, but on the other

>>>> hand,

>>>> *similarity* (1-cos(angle)) should go toward 0.

>>>>

>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <[EMAIL PROTECTED]>

>>>> wrote:

>>>>

>>>> > Hey Guys,

>>>> >

>>>> > I have some strange results in my LSA-Pipeline.

>>>> >

>>>> > First, I explain the steps my data is making:

>>>> > 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as

>>>> > weighter

>>>> > 2) Transposing TDM

>>>> > 3a) Using Mahout SVD (Lanczos) with the transposed TDM

>>>> > 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM

>>>> > 3c) Using no dimension reduction (for testing purpose)

>>>> > 4) Transpose result (ONLY none / svd)

>>>> > 5) Calculating Cosine Similarty (from Mahout)

>>>> >

>>>> > Now... Some strange thinks happen:

>>>> > First of all: The demo data shows the similarity from document 1 to

>>>> > all other documents.

>>>> >

>>>> > the results using only cosine similarty (without dimension reduction):

>>>> > http://the-lord.de/img/none.png

>>>> >

>>>> > the result using svd, rank 10

>>>> > http://the-lord.de/img/svd-10.png

>>>> > some points falling down to the bottom.

>>>> >

>>>> > the results using ssvd rank 10

>>>> > http://the-lord.de/img/ssvd-10.png

>>>> >

>>>> > the result using svd, rank 100

>>>> > http://the-lord.de/img/svd-100.png

>>>> > more points falling down to the bottom.

>>>> >

>>>> > the results using ssvd rank 100

>>>> > http://the-lord.de/img/ssvd-100.png

>>>> >

>>>> > the results using svd rank 200

>>>> > http://the-lord.de/img/svd-200.png

>>>> > even more points falling down to the bottom.

>>>> >

>>>> > the results using svd rank 1000

>>>> > http://the-lord.de/img/svd-1000.png

>>>> > most points are at the bottom

>>>> >

>>>> > please beware of the scale:

>>>> > - the avg from none: 0,8712

>>>> > - the avg from svd rank 10: 0,2648

>>>> > - the avg from svd rank 100: 0,0628

>>>> > - the avg from svd rank 200: 0,0238

>>>> > - the avg from svd rank 1000: 0,0116

>>>> >

>>>> > so my question is:

>>>> > Can you explain this behavior? Why are the documents getting more

>>>> > equal with more ranks in svd. I thought it was the opposite.

>>>> >

>>>> > Cheers

>>>> > Stefan

>>>> >

>>>>

>>>

>>

>>

>>

>> --

>> Stefan Wienert

>>

>> http://www.wienert.cc

>> [EMAIL PROTECTED]

>>

>> Telefon: +495251-2026838

>> Mobil: +49176-40170270

>>

>

>

>

> --

> Stefan Wienert

>

> http://www.wienert.cc

On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

Hi Stefan,

I checked the implementation of RowSimilarityJob and we might still have

a bug in the 0.5 release... (f**k). I don't know if your problem is

caused by that, but the similarity scores might not be correct...

We had this issue in 0.4 already, when someone realized that

cooccurrences were mapped out inconsistently, so for 0.5 we made sure

that we always map the smaller row as first value. But apparently I did

not adjust the value setting for the Cooccurrence object...

In 0.5 the code is:

if (rowA <= rowB) {

rowPair.set(rowA, rowB, weightA, weightB);

} else {

rowPair.set(rowB, rowA, weightB, weightA);

}

coocurrence.set(column.get(), valueA, valueB);

But I should be (already fixed in current trunk some days ago):

if (rowA <= rowB) {

rowPair.set(rowA, rowB, weightA, weightB);

coocurrence.set(column.get(), valueA, valueB);

} else {

rowPair.set(rowB, rowA, weightB, weightA);

coocurrence.set(column.get(), valueB, valueA);

}

Maybe you could rerun your test with the current trunk?

--sebastian

On 14.06.2011 20:54, Sean Owen wrote:

> It is a similarity, not a distance. Higher values mean more

> similarity, not less.

>

> I agree that similarity ought to decrease with more dimensions. That

> is what you observe -- except that you see quite high average

> similarity with no dimension reduction!

>

> An average cosine similarity of 0.87 sounds "high" to me for anything

> but a few dimensions. What's the dimensionality of the input without

> dimension reduction?

>

> Something is amiss in this pipeline. It is an interesting question!

>

> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<[EMAIL PROTECTED]> wrote:

>> Actually I'm using RowSimilarityJob() with

>> --input input

>> --output output

>> --numberOfColumns documentCount

>> --maxSimilaritiesPerRow documentCount

>> --similarityClassname SIMILARITY_UNCENTERED_COSINE

>>

>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE

>> calculates...

>> the source says: "distributed implementation of cosine similarity that

>> does not center its data"

>>

>> So... this seems to be the similarity and not the distance?

>>

>> Cheers,

>> Stefan

>>

>>

>>

>> 2011/6/14 Stefan Wienert<[EMAIL PROTECTED]>:

>>> but... why do I get the different results with cosine similarity with

>>> no dimension reduction (with 100,000 dimensions) ?

>>>

>>> 2011/6/14 Fernando Fernández<[EMAIL PROTECTED]>:

>>>> Actually that's what your results are showing, aren't they? With rank 1000

>>>> the similarity avg is the lowest...

>>>>

>>>>

>>>> 2011/6/14 Jake Mannix<[EMAIL PROTECTED]>

>>>>

>>>>> actually, wait - are your graphs showing *similarity*, or *distance*? In

>>>>> higher

>>>>> dimensions, *distance* (and cosine angle) should grow, but on the other

>>>>> hand,

>>>>> *similarity* (1-cos(angle)) should go toward 0.

>>>>>

>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<[EMAIL PROTECTED]>

>>>>> wrote:

>>>>>

>>>>>> Hey Guys,

>>>>>>

>>>>>> I have some strange results in my LSA-Pipeline.

>>>>>>

>>>>>> First, I explain the steps my data is making:

>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as

>>>>>> weighter

>>>>>> 2) Transposing TDM

>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM

>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM

>>>>>> 3c) Using no dimension reduction (for testing purpose)

>>>>>> 4) Transpose result (ONLY none / svd)

>>>>>> 5) Calculating Cosine Similarty (from Mahout)

>>>>>>

>>>>>> Now... Some strange thinks happen:

>>>>>> First of all: The demo data shows the similarity from document 1 to

>>>>>> all other documents.

>>>>>>

>>>>>> the results using only cosine similarty (without dimension reduction):

>>>>>> http://the-lord.de/img/none.png

>>>>>>

>>>>>> the result using svd, rank 10

>>>>>> http://the-lord.de/img/svd-10.png

>>>>>> some points falling down to the bottom.

>>>>>>

>>>>>> the results using ssvd rank 10

>>>>>> http://the-lord.de/img/ssvd-10.png

I checked the implementation of RowSimilarityJob and we might still have

a bug in the 0.5 release... (f**k). I don't know if your problem is

caused by that, but the similarity scores might not be correct...

We had this issue in 0.4 already, when someone realized that

cooccurrences were mapped out inconsistently, so for 0.5 we made sure

that we always map the smaller row as first value. But apparently I did

not adjust the value setting for the Cooccurrence object...

In 0.5 the code is:

if (rowA <= rowB) {

rowPair.set(rowA, rowB, weightA, weightB);

} else {

rowPair.set(rowB, rowA, weightB, weightA);

}

coocurrence.set(column.get(), valueA, valueB);

But I should be (already fixed in current trunk some days ago):

if (rowA <= rowB) {

rowPair.set(rowA, rowB, weightA, weightB);

coocurrence.set(column.get(), valueA, valueB);

} else {

rowPair.set(rowB, rowA, weightB, weightA);

coocurrence.set(column.get(), valueB, valueA);

}

Maybe you could rerun your test with the current trunk?

--sebastian

On 14.06.2011 20:54, Sean Owen wrote:

> It is a similarity, not a distance. Higher values mean more

> similarity, not less.

>

> I agree that similarity ought to decrease with more dimensions. That

> is what you observe -- except that you see quite high average

> similarity with no dimension reduction!

>

> An average cosine similarity of 0.87 sounds "high" to me for anything

> but a few dimensions. What's the dimensionality of the input without

> dimension reduction?

>

> Something is amiss in this pipeline. It is an interesting question!

>

> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<[EMAIL PROTECTED]> wrote:

>> Actually I'm using RowSimilarityJob() with

>> --input input

>> --output output

>> --numberOfColumns documentCount

>> --maxSimilaritiesPerRow documentCount

>> --similarityClassname SIMILARITY_UNCENTERED_COSINE

>>

>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE

>> calculates...

>> the source says: "distributed implementation of cosine similarity that

>> does not center its data"

>>

>> So... this seems to be the similarity and not the distance?

>>

>> Cheers,

>> Stefan

>>

>>

>>

>> 2011/6/14 Stefan Wienert<[EMAIL PROTECTED]>:

>>> but... why do I get the different results with cosine similarity with

>>> no dimension reduction (with 100,000 dimensions) ?

>>>

>>> 2011/6/14 Fernando Fernández<[EMAIL PROTECTED]>:

>>>> Actually that's what your results are showing, aren't they? With rank 1000

>>>> the similarity avg is the lowest...

>>>>

>>>>

>>>> 2011/6/14 Jake Mannix<[EMAIL PROTECTED]>

>>>>

>>>>> actually, wait - are your graphs showing *similarity*, or *distance*? In

>>>>> higher

>>>>> dimensions, *distance* (and cosine angle) should grow, but on the other

>>>>> hand,

>>>>> *similarity* (1-cos(angle)) should go toward 0.

>>>>>

>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<[EMAIL PROTECTED]>

>>>>> wrote:

>>>>>

>>>>>> Hey Guys,

>>>>>>

>>>>>> I have some strange results in my LSA-Pipeline.

>>>>>>

>>>>>> First, I explain the steps my data is making:

>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as

>>>>>> weighter

>>>>>> 2) Transposing TDM

>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM

>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM

>>>>>> 3c) Using no dimension reduction (for testing purpose)

>>>>>> 4) Transpose result (ONLY none / svd)

>>>>>> 5) Calculating Cosine Similarty (from Mahout)

>>>>>>

>>>>>> Now... Some strange thinks happen:

>>>>>> First of all: The demo data shows the similarity from document 1 to

>>>>>> all other documents.

>>>>>>

>>>>>> the results using only cosine similarty (without dimension reduction):

>>>>>> http://the-lord.de/img/none.png

>>>>>>

>>>>>> the result using svd, rank 10

>>>>>> http://the-lord.de/img/svd-10.png

>>>>>> some points falling down to the bottom.

>>>>>>

>>>>>> the results using ssvd rank 10

>>>>>> http://the-lord.de/img/ssvd-10.png

Hi Stefan,

Are you sure you need to transpose the input marix? I thought that what you

get from lucene index was already document(rows)-term(columns) matrix, but

you say that you obtain term-document matrix and transpose it. Is this

correct? What are you using to obtain this matrix from Lucene? Is it

possible that you are calculating similarities with the wrong matrix in some

of the two cases? (With/without dimension reduction).

Best,

Fernando.

2011/6/14 Sebastian Schelter <[EMAIL PROTECTED]>

> Hi Stefan,

>

> I checked the implementation of RowSimilarityJob and we might still have a

> bug in the 0.5 release... (f**k). I don't know if your problem is caused by

> that, but the similarity scores might not be correct...

>

> We had this issue in 0.4 already, when someone realized that cooccurrences

> were mapped out inconsistently, so for 0.5 we made sure that we always map

> the smaller row as first value. But apparently I did not adjust the value

> setting for the Cooccurrence object...

>

> In 0.5 the code is:

>

> if (rowA <= rowB) {

> rowPair.set(rowA, rowB, weightA, weightB);

> } else {

> rowPair.set(rowB, rowA, weightB, weightA);

> }

> coocurrence.set(column.get(), valueA, valueB);

>

> But I should be (already fixed in current trunk some days ago):

>

> if (rowA <= rowB) {

> rowPair.set(rowA, rowB, weightA, weightB);

> coocurrence.set(column.get(), valueA, valueB);

> } else {

> rowPair.set(rowB, rowA, weightB, weightA);

> coocurrence.set(column.get(), valueB, valueA);

> }

>

> Maybe you could rerun your test with the current trunk?

>

> --sebastian

>

>

> On 14.06.2011 20:54, Sean Owen wrote:

>

>> It is a similarity, not a distance. Higher values mean more

>> similarity, not less.

>>

>> I agree that similarity ought to decrease with more dimensions. That

>> is what you observe -- except that you see quite high average

>> similarity with no dimension reduction!

>>

>> An average cosine similarity of 0.87 sounds "high" to me for anything

>> but a few dimensions. What's the dimensionality of the input without

>> dimension reduction?

>>

>> Something is amiss in this pipeline. It is an interesting question!

>>

>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<[EMAIL PROTECTED]>

>> wrote:

>>

>>> Actually I'm using RowSimilarityJob() with

>>> --input input

>>> --output output

>>> --numberOfColumns documentCount

>>> --maxSimilaritiesPerRow documentCount

>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE

>>>

>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE

>>> calculates...

>>> the source says: "distributed implementation of cosine similarity that

>>> does not center its data"

>>>

>>> So... this seems to be the similarity and not the distance?

>>>

>>> Cheers,

>>> Stefan

>>>

>>>

>>>

>>> 2011/6/14 Stefan Wienert<[EMAIL PROTECTED]>:

>>>

>>>> but... why do I get the different results with cosine similarity with

>>>> no dimension reduction (with 100,000 dimensions) ?

>>>>

>>>> 2011/6/14 Fernando Fernández<[EMAIL PROTECTED]>:

>>>>

>>>>> Actually that's what your results are showing, aren't they? With rank

>>>>> 1000

>>>>> the similarity avg is the lowest...

>>>>>

>>>>>

>>>>> 2011/6/14 Jake Mannix<[EMAIL PROTECTED]>

>>>>>

>>>>> actually, wait - are your graphs showing *similarity*, or *distance*?

>>>>>> In

>>>>>> higher

>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the

>>>>>> other

>>>>>> hand,

>>>>>> *similarity* (1-cos(angle)) should go toward 0.

>>>>>>

>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<[EMAIL PROTECTED]>

>>>>>> wrote:

>>>>>>

>>>>>> Hey Guys,

>>>>>>>

>>>>>>> I have some strange results in my LSA-Pipeline.

>>>>>>>

>>>>>>> First, I explain the steps my data is making:

>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF

>>>>>>> as

>>>>>>> weighter

>>>>>>> 2) Transposing TDM

>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM

>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM

Are you sure you need to transpose the input marix? I thought that what you

get from lucene index was already document(rows)-term(columns) matrix, but

you say that you obtain term-document matrix and transpose it. Is this

correct? What are you using to obtain this matrix from Lucene? Is it

possible that you are calculating similarities with the wrong matrix in some

of the two cases? (With/without dimension reduction).

Best,

Fernando.

2011/6/14 Sebastian Schelter <[EMAIL PROTECTED]>

> Hi Stefan,

>

> I checked the implementation of RowSimilarityJob and we might still have a

> bug in the 0.5 release... (f**k). I don't know if your problem is caused by

> that, but the similarity scores might not be correct...

>

> We had this issue in 0.4 already, when someone realized that cooccurrences

> were mapped out inconsistently, so for 0.5 we made sure that we always map

> the smaller row as first value. But apparently I did not adjust the value

> setting for the Cooccurrence object...

>

> In 0.5 the code is:

>

> if (rowA <= rowB) {

> rowPair.set(rowA, rowB, weightA, weightB);

> } else {

> rowPair.set(rowB, rowA, weightB, weightA);

> }

> coocurrence.set(column.get(), valueA, valueB);

>

> But I should be (already fixed in current trunk some days ago):

>

> if (rowA <= rowB) {

> rowPair.set(rowA, rowB, weightA, weightB);

> coocurrence.set(column.get(), valueA, valueB);

> } else {

> rowPair.set(rowB, rowA, weightB, weightA);

> coocurrence.set(column.get(), valueB, valueA);

> }

>

> Maybe you could rerun your test with the current trunk?

>

> --sebastian

>

>

> On 14.06.2011 20:54, Sean Owen wrote:

>

>> It is a similarity, not a distance. Higher values mean more

>> similarity, not less.

>>

>> I agree that similarity ought to decrease with more dimensions. That

>> is what you observe -- except that you see quite high average

>> similarity with no dimension reduction!

>>

>> An average cosine similarity of 0.87 sounds "high" to me for anything

>> but a few dimensions. What's the dimensionality of the input without

>> dimension reduction?

>>

>> Something is amiss in this pipeline. It is an interesting question!

>>

>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<[EMAIL PROTECTED]>

>> wrote:

>>

>>> Actually I'm using RowSimilarityJob() with

>>> --input input

>>> --output output

>>> --numberOfColumns documentCount

>>> --maxSimilaritiesPerRow documentCount

>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE

>>>

>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE

>>> calculates...

>>> the source says: "distributed implementation of cosine similarity that

>>> does not center its data"

>>>

>>> So... this seems to be the similarity and not the distance?

>>>

>>> Cheers,

>>> Stefan

>>>

>>>

>>>

>>> 2011/6/14 Stefan Wienert<[EMAIL PROTECTED]>:

>>>

>>>> but... why do I get the different results with cosine similarity with

>>>> no dimension reduction (with 100,000 dimensions) ?

>>>>

>>>> 2011/6/14 Fernando Fernández<[EMAIL PROTECTED]>:

>>>>

>>>>> Actually that's what your results are showing, aren't they? With rank

>>>>> 1000

>>>>> the similarity avg is the lowest...

>>>>>

>>>>>

>>>>> 2011/6/14 Jake Mannix<[EMAIL PROTECTED]>

>>>>>

>>>>> actually, wait - are your graphs showing *similarity*, or *distance*?

>>>>>> In

>>>>>> higher

>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the

>>>>>> other

>>>>>> hand,

>>>>>> *similarity* (1-cos(angle)) should go toward 0.

>>>>>>

>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<[EMAIL PROTECTED]>

>>>>>> wrote:

>>>>>>

>>>>>> Hey Guys,

>>>>>>>

>>>>>>> I have some strange results in my LSA-Pipeline.

>>>>>>>

>>>>>>> First, I explain the steps my data is making:

>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF

>>>>>>> as

>>>>>>> weighter

>>>>>>> 2) Transposing TDM

>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM

>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM

So... lets check the dimensions:

First step: Lucene Output:

227 rows (=docs) and 107909 cols (=tems)

transposed to:

107909 rows and 227 cols

reduced with svd (rank 100) to:

99 rows and 227 cols

transposed to: (actually there was a bug (with no effect on the SVD

result but on NONE result))

227 rows and 99 cols

So... now the cosine results are very similar to SVD 200.

Results are added.

@Sebastian: I will check if the bug affects my results.

2011/6/14 Fernando Fernández <[EMAIL PROTECTED]>:

> Hi Stefan,

>

> Are you sure you need to transpose the input marix? I thought that what you

> get from lucene index was already document(rows)-term(columns) matrix, but

> you say that you obtain term-document matrix and transpose it. Is this

> correct? What are you using to obtain this matrix from Lucene? Is it

> possible that you are calculating similarities with the wrong matrix in some

> of the two cases? (With/without dimension reduction).

>

> Best,

> Fernando.

>

> 2011/6/14 Sebastian Schelter <[EMAIL PROTECTED]>

>

>> Hi Stefan,

>>

>> I checked the implementation of RowSimilarityJob and we might still have a

>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by

>> that, but the similarity scores might not be correct...

>>

>> We had this issue in 0.4 already, when someone realized that cooccurrences

>> were mapped out inconsistently, so for 0.5 we made sure that we always map

>> the smaller row as first value. But apparently I did not adjust the value

>> setting for the Cooccurrence object...

>>

>> In 0.5 the code is:

>>

>> if (rowA <= rowB) {

>> rowPair.set(rowA, rowB, weightA, weightB);

>> } else {

>> rowPair.set(rowB, rowA, weightB, weightA);

>> }

>> coocurrence.set(column.get(), valueA, valueB);

>>

>> But I should be (already fixed in current trunk some days ago):

>>

>> if (rowA <= rowB) {

>> rowPair.set(rowA, rowB, weightA, weightB);

>> coocurrence.set(column.get(), valueA, valueB);

>> } else {

>> rowPair.set(rowB, rowA, weightB, weightA);

>> coocurrence.set(column.get(), valueB, valueA);

>> }

>>

>> Maybe you could rerun your test with the current trunk?

>>

>> --sebastian

>>

>>

>> On 14.06.2011 20:54, Sean Owen wrote:

>>

>>> It is a similarity, not a distance. Higher values mean more

>>> similarity, not less.

>>>

>>> I agree that similarity ought to decrease with more dimensions. That

>>> is what you observe -- except that you see quite high average

>>> similarity with no dimension reduction!

>>>

>>> An average cosine similarity of 0.87 sounds "high" to me for anything

>>> but a few dimensions. What's the dimensionality of the input without

>>> dimension reduction?

>>>

>>> Something is amiss in this pipeline. It is an interesting question!

>>>

>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<[EMAIL PROTECTED]>

>>> wrote:

>>>

>>>> Actually I'm using RowSimilarityJob() with

>>>> --input input

>>>> --output output

>>>> --numberOfColumns documentCount

>>>> --maxSimilaritiesPerRow documentCount

>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE

>>>>

>>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE

>>>> calculates...

>>>> the source says: "distributed implementation of cosine similarity that

>>>> does not center its data"

>>>>

>>>> So... this seems to be the similarity and not the distance?

>>>>

>>>> Cheers,

>>>> Stefan

>>>>

>>>>

>>>>

>>>> 2011/6/14 Stefan Wienert<[EMAIL PROTECTED]>:

>>>>

>>>>> but... why do I get the different results with cosine similarity with

>>>>> no dimension reduction (with 100,000 dimensions) ?

>>>>>

>>>>> 2011/6/14 Fernando Fernández<[EMAIL PROTECTED]>:

>>>>>

>>>>>> Actually that's what your results are showing, aren't they? With rank

>>>>>> 1000

>>>>>> the similarity avg is the lowest...

>>>>>>

>>>>>>

>>>>>> 2011/6/14 Jake Mannix<[EMAIL PROTECTED]>

>>>>>>

>>>>>> actually, wait - are your graphs showing *similarity*, or *distance*?

>>>>>>> In

>>>>>>> higher

>>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the

First step: Lucene Output:

227 rows (=docs) and 107909 cols (=tems)

transposed to:

107909 rows and 227 cols

reduced with svd (rank 100) to:

99 rows and 227 cols

transposed to: (actually there was a bug (with no effect on the SVD

result but on NONE result))

227 rows and 99 cols

So... now the cosine results are very similar to SVD 200.

Results are added.

@Sebastian: I will check if the bug affects my results.

2011/6/14 Fernando Fernández <[EMAIL PROTECTED]>:

> Hi Stefan,

>

> Are you sure you need to transpose the input marix? I thought that what you

> get from lucene index was already document(rows)-term(columns) matrix, but

> you say that you obtain term-document matrix and transpose it. Is this

> correct? What are you using to obtain this matrix from Lucene? Is it

> possible that you are calculating similarities with the wrong matrix in some

> of the two cases? (With/without dimension reduction).

>

> Best,

> Fernando.

>

> 2011/6/14 Sebastian Schelter <[EMAIL PROTECTED]>

>

>> Hi Stefan,

>>

>> I checked the implementation of RowSimilarityJob and we might still have a

>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by

>> that, but the similarity scores might not be correct...

>>

>> We had this issue in 0.4 already, when someone realized that cooccurrences

>> were mapped out inconsistently, so for 0.5 we made sure that we always map

>> the smaller row as first value. But apparently I did not adjust the value

>> setting for the Cooccurrence object...

>>

>> In 0.5 the code is:

>>

>> if (rowA <= rowB) {

>> rowPair.set(rowA, rowB, weightA, weightB);

>> } else {

>> rowPair.set(rowB, rowA, weightB, weightA);

>> }

>> coocurrence.set(column.get(), valueA, valueB);

>>

>> But I should be (already fixed in current trunk some days ago):

>>

>> if (rowA <= rowB) {

>> rowPair.set(rowA, rowB, weightA, weightB);

>> coocurrence.set(column.get(), valueA, valueB);

>> } else {

>> rowPair.set(rowB, rowA, weightB, weightA);

>> coocurrence.set(column.get(), valueB, valueA);

>> }

>>

>> Maybe you could rerun your test with the current trunk?

>>

>> --sebastian

>>

>>

>> On 14.06.2011 20:54, Sean Owen wrote:

>>

>>> It is a similarity, not a distance. Higher values mean more

>>> similarity, not less.

>>>

>>> I agree that similarity ought to decrease with more dimensions. That

>>> is what you observe -- except that you see quite high average

>>> similarity with no dimension reduction!

>>>

>>> An average cosine similarity of 0.87 sounds "high" to me for anything

>>> but a few dimensions. What's the dimensionality of the input without

>>> dimension reduction?

>>>

>>> Something is amiss in this pipeline. It is an interesting question!

>>>

>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<[EMAIL PROTECTED]>

>>> wrote:

>>>

>>>> Actually I'm using RowSimilarityJob() with

>>>> --input input

>>>> --output output

>>>> --numberOfColumns documentCount

>>>> --maxSimilaritiesPerRow documentCount

>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE

>>>>

>>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE

>>>> calculates...

>>>> the source says: "distributed implementation of cosine similarity that

>>>> does not center its data"

>>>>

>>>> So... this seems to be the similarity and not the distance?

>>>>

>>>> Cheers,

>>>> Stefan

>>>>

>>>>

>>>>

>>>> 2011/6/14 Stefan Wienert<[EMAIL PROTECTED]>:

>>>>

>>>>> but... why do I get the different results with cosine similarity with

>>>>> no dimension reduction (with 100,000 dimensions) ?

>>>>>

>>>>> 2011/6/14 Fernando Fernández<[EMAIL PROTECTED]>:

>>>>>

>>>>>> Actually that's what your results are showing, aren't they? With rank

>>>>>> 1000

>>>>>> the similarity avg is the lowest...

>>>>>>

>>>>>>

>>>>>> 2011/6/14 Jake Mannix<[EMAIL PROTECTED]>

>>>>>>

>>>>>> actually, wait - are your graphs showing *similarity*, or *distance*?

>>>>>>> In

>>>>>>> higher

>>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the

Hi Sebastian,

the bug does not affect me with:

NONE > bugcheck.pdf

SVD > bugcheck2.pdf

(although it was active)

Cheers,

Stefan

2011/6/14 Sebastian Schelter <[EMAIL PROTECTED]>:

> Hi Stefan,

>

> I checked the implementation of RowSimilarityJob and we might still have a

> bug in the 0.5 release... (f**k). I don't know if your problem is caused by

> that, but the similarity scores might not be correct...

>

> We had this issue in 0.4 already, when someone realized that cooccurrences

> were mapped out inconsistently, so for 0.5 we made sure that we always map

> the smaller row as first value. But apparently I did not adjust the value

> setting for the Cooccurrence object...

>

> In 0.5 the code is:

>

> if (rowA <= rowB) {

> rowPair.set(rowA, rowB, weightA, weightB);

> } else {

> rowPair.set(rowB, rowA, weightB, weightA);

> }

> coocurrence.set(column.get(), valueA, valueB);

>

> But I should be (already fixed in current trunk some days ago):

>

> if (rowA <= rowB) {

> rowPair.set(rowA, rowB, weightA, weightB);

> coocurrence.set(column.get(), valueA, valueB);

> } else {

> rowPair.set(rowB, rowA, weightB, weightA);

> coocurrence.set(column.get(), valueB, valueA);

> }

>

> Maybe you could rerun your test with the current trunk?

>

> --sebastian

>

> On 14.06.2011 20:54, Sean Owen wrote:

>>

>> It is a similarity, not a distance. Higher values mean more

>> similarity, not less.

>>

>> I agree that similarity ought to decrease with more dimensions. That

>> is what you observe -- except that you see quite high average

>> similarity with no dimension reduction!

>>

>> An average cosine similarity of 0.87 sounds "high" to me for anything

>> but a few dimensions. What's the dimensionality of the input without

>> dimension reduction?

>>

>> Something is amiss in this pipeline. It is an interesting question!

>>

>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<[EMAIL PROTECTED]> wrote:

>>>

>>> Actually I'm using RowSimilarityJob() with

>>> --input input

>>> --output output

>>> --numberOfColumns documentCount

>>> --maxSimilaritiesPerRow documentCount

>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE

>>>

>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE

>>> calculates...

>>> the source says: "distributed implementation of cosine similarity that

>>> does not center its data"

>>>

>>> So... this seems to be the similarity and not the distance?

>>>

>>> Cheers,

>>> Stefan

>>>

>>>

>>>

>>> 2011/6/14 Stefan Wienert<[EMAIL PROTECTED]>:

>>>>

>>>> but... why do I get the different results with cosine similarity with

>>>> no dimension reduction (with 100,000 dimensions) ?

>>>>

>>>> 2011/6/14 Fernando Fernández<[EMAIL PROTECTED]>:

>>>>>

>>>>> Actually that's what your results are showing, aren't they? With rank

>>>>> 1000

>>>>> the similarity avg is the lowest...

>>>>>

>>>>>

>>>>> 2011/6/14 Jake Mannix<[EMAIL PROTECTED]>

>>>>>

>>>>>> actually, wait - are your graphs showing *similarity*, or *distance*?

>>>>>> In

>>>>>> higher

>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the

>>>>>> other

>>>>>> hand,

>>>>>> *similarity* (1-cos(angle)) should go toward 0.

>>>>>>

>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<[EMAIL PROTECTED]>

>>>>>> wrote:

>>>>>>

>>>>>>> Hey Guys,

>>>>>>>

>>>>>>> I have some strange results in my LSA-Pipeline.

>>>>>>>

>>>>>>> First, I explain the steps my data is making:

>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF

>>>>>>> as

>>>>>>> weighter

>>>>>>> 2) Transposing TDM

>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM

>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM

>>>>>>> 3c) Using no dimension reduction (for testing purpose)

>>>>>>> 4) Transpose result (ONLY none / svd)

>>>>>>> 5) Calculating Cosine Similarty (from Mahout)

>>>>>>>

>>>>>>> Now... Some strange thinks happen:

>>>>>>> First of all: The demo data shows the similarity from document 1 to

>>>>>>> all other documents.

>

the bug does not affect me with:

NONE > bugcheck.pdf

SVD > bugcheck2.pdf

(although it was active)

Cheers,

Stefan

2011/6/14 Sebastian Schelter <[EMAIL PROTECTED]>:

> Hi Stefan,

>

> I checked the implementation of RowSimilarityJob and we might still have a

> bug in the 0.5 release... (f**k). I don't know if your problem is caused by

> that, but the similarity scores might not be correct...

>

> We had this issue in 0.4 already, when someone realized that cooccurrences

> were mapped out inconsistently, so for 0.5 we made sure that we always map

> the smaller row as first value. But apparently I did not adjust the value

> setting for the Cooccurrence object...

>

> In 0.5 the code is:

>

> if (rowA <= rowB) {

> rowPair.set(rowA, rowB, weightA, weightB);

> } else {

> rowPair.set(rowB, rowA, weightB, weightA);

> }

> coocurrence.set(column.get(), valueA, valueB);

>

> But I should be (already fixed in current trunk some days ago):

>

> if (rowA <= rowB) {

> rowPair.set(rowA, rowB, weightA, weightB);

> coocurrence.set(column.get(), valueA, valueB);

> } else {

> rowPair.set(rowB, rowA, weightB, weightA);

> coocurrence.set(column.get(), valueB, valueA);

> }

>

> Maybe you could rerun your test with the current trunk?

>

> --sebastian

>

> On 14.06.2011 20:54, Sean Owen wrote:

>>

>> It is a similarity, not a distance. Higher values mean more

>> similarity, not less.

>>

>> I agree that similarity ought to decrease with more dimensions. That

>> is what you observe -- except that you see quite high average

>> similarity with no dimension reduction!

>>

>> An average cosine similarity of 0.87 sounds "high" to me for anything

>> but a few dimensions. What's the dimensionality of the input without

>> dimension reduction?

>>

>> Something is amiss in this pipeline. It is an interesting question!

>>

>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<[EMAIL PROTECTED]> wrote:

>>>

>>> Actually I'm using RowSimilarityJob() with

>>> --input input

>>> --output output

>>> --numberOfColumns documentCount

>>> --maxSimilaritiesPerRow documentCount

>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE

>>>

>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE

>>> calculates...

>>> the source says: "distributed implementation of cosine similarity that

>>> does not center its data"

>>>

>>> So... this seems to be the similarity and not the distance?

>>>

>>> Cheers,

>>> Stefan

>>>

>>>

>>>

>>> 2011/6/14 Stefan Wienert<[EMAIL PROTECTED]>:

>>>>

>>>> but... why do I get the different results with cosine similarity with

>>>> no dimension reduction (with 100,000 dimensions) ?

>>>>

>>>> 2011/6/14 Fernando Fernández<[EMAIL PROTECTED]>:

>>>>>

>>>>> Actually that's what your results are showing, aren't they? With rank

>>>>> 1000

>>>>> the similarity avg is the lowest...

>>>>>

>>>>>

>>>>> 2011/6/14 Jake Mannix<[EMAIL PROTECTED]>

>>>>>

>>>>>> actually, wait - are your graphs showing *similarity*, or *distance*?

>>>>>> In

>>>>>> higher

>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the

>>>>>> other

>>>>>> hand,

>>>>>> *similarity* (1-cos(angle)) should go toward 0.

>>>>>>

>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<[EMAIL PROTECTED]>

>>>>>> wrote:

>>>>>>

>>>>>>> Hey Guys,

>>>>>>>

>>>>>>> I have some strange results in my LSA-Pipeline.

>>>>>>>

>>>>>>> First, I explain the steps my data is making:

>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF

>>>>>>> as

>>>>>>> weighter

>>>>>>> 2) Transposing TDM

>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM

>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM

>>>>>>> 3c) Using no dimension reduction (for testing purpose)

>>>>>>> 4) Transpose result (ONLY none / svd)

>>>>>>> 5) Calculating Cosine Similarty (from Mahout)

>>>>>>>

>>>>>>> Now... Some strange thinks happen:

>>>>>>> First of all: The demo data shows the similarity from document 1 to

>>>>>>> all other documents.

>

one last question: for cosine similarity, sometimes the results are

negative (which means angel between vectors is greater than 90°). but

what does this means for the similarity?

Cheers,

Stefan

2011/6/14 Stefan Wienert <[EMAIL PROTECTED]>:

> So... lets check the dimensions:

>

> First step: Lucene Output:

> 227 rows (=docs) and 107909 cols (=tems)

>

> transposed to:

> 107909 rows and 227 cols

>

> reduced with svd (rank 100) to:

> 99 rows and 227 cols

>

> transposed to: (actually there was a bug (with no effect on the SVD

> result but on NONE result))

> 227 rows and 99 cols

>

> So... now the cosine results are very similar to SVD 200.

>

> Results are added.

>

> @Sebastian: I will check if the bug affects my results.

>

> 2011/6/14 Fernando Fernández <[EMAIL PROTECTED]>:

>> Hi Stefan,

>>

>> Are you sure you need to transpose the input marix? I thought that what you

>> get from lucene index was already document(rows)-term(columns) matrix, but

>> you say that you obtain term-document matrix and transpose it. Is this

>> correct? What are you using to obtain this matrix from Lucene? Is it

>> possible that you are calculating similarities with the wrong matrix in some

>> of the two cases? (With/without dimension reduction).

>>

>> Best,

>> Fernando.

>>

>> 2011/6/14 Sebastian Schelter <[EMAIL PROTECTED]>

>>

>>> Hi Stefan,

>>>

>>> I checked the implementation of RowSimilarityJob and we might still have a

>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by

>>> that, but the similarity scores might not be correct...

>>>

>>> We had this issue in 0.4 already, when someone realized that cooccurrences

>>> were mapped out inconsistently, so for 0.5 we made sure that we always map

>>> the smaller row as first value. But apparently I did not adjust the value

>>> setting for the Cooccurrence object...

>>>

>>> In 0.5 the code is:

>>>

>>> if (rowA <= rowB) {

>>> rowPair.set(rowA, rowB, weightA, weightB);

>>> } else {

>>> rowPair.set(rowB, rowA, weightB, weightA);

>>> }

>>> coocurrence.set(column.get(), valueA, valueB);

>>>

>>> But I should be (already fixed in current trunk some days ago):

>>>

>>> if (rowA <= rowB) {

>>> rowPair.set(rowA, rowB, weightA, weightB);

>>> coocurrence.set(column.get(), valueA, valueB);

>>> } else {

>>> rowPair.set(rowB, rowA, weightB, weightA);

>>> coocurrence.set(column.get(), valueB, valueA);

>>> }

>>>

>>> Maybe you could rerun your test with the current trunk?

>>>

>>> --sebastian

>>>

>>>

>>> On 14.06.2011 20:54, Sean Owen wrote:

>>>

>>>> It is a similarity, not a distance. Higher values mean more

>>>> similarity, not less.

>>>>

>>>> I agree that similarity ought to decrease with more dimensions. That

>>>> is what you observe -- except that you see quite high average

>>>> similarity with no dimension reduction!

>>>>

>>>> An average cosine similarity of 0.87 sounds "high" to me for anything

>>>> but a few dimensions. What's the dimensionality of the input without

>>>> dimension reduction?

>>>>

>>>> Something is amiss in this pipeline. It is an interesting question!

>>>>

>>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<[EMAIL PROTECTED]>

>>>> wrote:

>>>>

>>>>> Actually I'm using RowSimilarityJob() with

>>>>> --input input

>>>>> --output output

>>>>> --numberOfColumns documentCount

>>>>> --maxSimilaritiesPerRow documentCount

>>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE

>>>>>

>>>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE

>>>>> calculates...

>>>>> the source says: "distributed implementation of cosine similarity that

>>>>> does not center its data"

>>>>>

>>>>> So... this seems to be the similarity and not the distance?

>>>>>

>>>>> Cheers,

>>>>> Stefan

>>>>>

>>>>>

>>>>>

>>>>> 2011/6/14 Stefan Wienert<[EMAIL PROTECTED]>:

>>>>>

>>>>>> but... why do I get the different results with cosine similarity with

>>>>>> no dimension reduction (with 100,000 dimensions) ?

>>>>>>

>>>>>> 2011/6/14 Fernando Fernández<[EMAIL PROTECTED]>:

negative (which means angel between vectors is greater than 90°). but

what does this means for the similarity?

Cheers,

Stefan

2011/6/14 Stefan Wienert <[EMAIL PROTECTED]>:

> So... lets check the dimensions:

>

> First step: Lucene Output:

> 227 rows (=docs) and 107909 cols (=tems)

>

> transposed to:

> 107909 rows and 227 cols

>

> reduced with svd (rank 100) to:

> 99 rows and 227 cols

>

> transposed to: (actually there was a bug (with no effect on the SVD

> result but on NONE result))

> 227 rows and 99 cols

>

> So... now the cosine results are very similar to SVD 200.

>

> Results are added.

>

> @Sebastian: I will check if the bug affects my results.

>

> 2011/6/14 Fernando Fernández <[EMAIL PROTECTED]>:

>> Hi Stefan,

>>

>> Are you sure you need to transpose the input marix? I thought that what you

>> get from lucene index was already document(rows)-term(columns) matrix, but

>> you say that you obtain term-document matrix and transpose it. Is this

>> correct? What are you using to obtain this matrix from Lucene? Is it

>> possible that you are calculating similarities with the wrong matrix in some

>> of the two cases? (With/without dimension reduction).

>>

>> Best,

>> Fernando.

>>

>> 2011/6/14 Sebastian Schelter <[EMAIL PROTECTED]>

>>

>>> Hi Stefan,

>>>

>>> I checked the implementation of RowSimilarityJob and we might still have a

>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by

>>> that, but the similarity scores might not be correct...

>>>

>>> We had this issue in 0.4 already, when someone realized that cooccurrences

>>> were mapped out inconsistently, so for 0.5 we made sure that we always map

>>> the smaller row as first value. But apparently I did not adjust the value

>>> setting for the Cooccurrence object...

>>>

>>> In 0.5 the code is:

>>>

>>> if (rowA <= rowB) {

>>> rowPair.set(rowA, rowB, weightA, weightB);

>>> } else {

>>> rowPair.set(rowB, rowA, weightB, weightA);

>>> }

>>> coocurrence.set(column.get(), valueA, valueB);

>>>

>>> But I should be (already fixed in current trunk some days ago):

>>>

>>> if (rowA <= rowB) {

>>> rowPair.set(rowA, rowB, weightA, weightB);

>>> coocurrence.set(column.get(), valueA, valueB);

>>> } else {

>>> rowPair.set(rowB, rowA, weightB, weightA);

>>> coocurrence.set(column.get(), valueB, valueA);

>>> }

>>>

>>> Maybe you could rerun your test with the current trunk?

>>>

>>> --sebastian

>>>

>>>

>>> On 14.06.2011 20:54, Sean Owen wrote:

>>>

>>>> It is a similarity, not a distance. Higher values mean more

>>>> similarity, not less.

>>>>

>>>> I agree that similarity ought to decrease with more dimensions. That

>>>> is what you observe -- except that you see quite high average

>>>> similarity with no dimension reduction!

>>>>

>>>> An average cosine similarity of 0.87 sounds "high" to me for anything

>>>> but a few dimensions. What's the dimensionality of the input without

>>>> dimension reduction?

>>>>

>>>> Something is amiss in this pipeline. It is an interesting question!

>>>>

>>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<[EMAIL PROTECTED]>

>>>> wrote:

>>>>

>>>>> Actually I'm using RowSimilarityJob() with

>>>>> --input input

>>>>> --output output

>>>>> --numberOfColumns documentCount

>>>>> --maxSimilaritiesPerRow documentCount

>>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE

>>>>>

>>>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE

>>>>> calculates...

>>>>> the source says: "distributed implementation of cosine similarity that

>>>>> does not center its data"

>>>>>

>>>>> So... this seems to be the similarity and not the distance?

>>>>>

>>>>> Cheers,

>>>>> Stefan

>>>>>

>>>>>

>>>>>

>>>>> 2011/6/14 Stefan Wienert<[EMAIL PROTECTED]>:

>>>>>

>>>>>> but... why do I get the different results with cosine similarity with

>>>>>> no dimension reduction (with 100,000 dimensions) ?

>>>>>>

>>>>>> 2011/6/14 Fernando Fernández<[EMAIL PROTECTED]>:

Interesting.

(I have one confusion of mine RE: lanczos -- is it computing U

eigenvectors or V? The doc says "eigenvectors" but doesn't say left or

right. if it's V (right eigenvectors) this sequence should be fine).

With ssvd i don't do transpose, i just do coputation of U which will

produce document singular vectors directly.

Also, i am not sure that Lanczos actually normalizes the eigenvectors,

but SSVD does (or multiplies normalized version by square root of a

singlular value, whichever requested). So depending on which space

your rotate results in, cosine similarities may be different. I assume

you used normalized (true) eigenvectors from ssvd.

Also would be interesting to know what oversampling parameter you (p) you used.

Thanks.

-d

On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> So... lets check the dimensions:

>

> First step: Lucene Output:

> 227 rows (=docs) and 107909 cols (=tems)

>

> transposed to:

> 107909 rows and 227 cols

>

> reduced with svd (rank 100) to:

> 99 rows and 227 cols

>

> transposed to: (actually there was a bug (with no effect on the SVD

> result but on NONE result))

> 227 rows and 99 cols

>

> So... now the cosine results are very similar to SVD 200.

>

> Results are added.

>

> @Sebastian: I will check if the bug affects my results.

>

> 2011/6/14 Fernando Fernández <[EMAIL PROTECTED]>:

>> Hi Stefan,

>>

>> Are you sure you need to transpose the input marix? I thought that what you

>> get from lucene index was already document(rows)-term(columns) matrix, but

>> you say that you obtain term-document matrix and transpose it. Is this

>> correct? What are you using to obtain this matrix from Lucene? Is it

>> possible that you are calculating similarities with the wrong matrix in some

>> of the two cases? (With/without dimension reduction).

>>

>> Best,

>> Fernando.

>>

>> 2011/6/14 Sebastian Schelter <[EMAIL PROTECTED]>

>>

>>> Hi Stefan,

>>>

>>> I checked the implementation of RowSimilarityJob and we might still have a

>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by

>>> that, but the similarity scores might not be correct...

>>>

>>> We had this issue in 0.4 already, when someone realized that cooccurrences

>>> were mapped out inconsistently, so for 0.5 we made sure that we always map

>>> the smaller row as first value. But apparently I did not adjust the value

>>> setting for the Cooccurrence object...

>>>

>>> In 0.5 the code is:

>>>

>>> if (rowA <= rowB) {

>>> rowPair.set(rowA, rowB, weightA, weightB);

>>> } else {

>>> rowPair.set(rowB, rowA, weightB, weightA);

>>> }

>>> coocurrence.set(column.get(), valueA, valueB);

>>>

>>> But I should be (already fixed in current trunk some days ago):

>>>

>>> if (rowA <= rowB) {

>>> rowPair.set(rowA, rowB, weightA, weightB);

>>> coocurrence.set(column.get(), valueA, valueB);

>>> } else {

>>> rowPair.set(rowB, rowA, weightB, weightA);

>>> coocurrence.set(column.get(), valueB, valueA);

>>> }

>>>

>>> Maybe you could rerun your test with the current trunk?

>>>

>>> --sebastian

>>>

>>>

>>> On 14.06.2011 20:54, Sean Owen wrote:

>>>

>>>> It is a similarity, not a distance. Higher values mean more

>>>> similarity, not less.

>>>>

>>>> I agree that similarity ought to decrease with more dimensions. That

>>>> is what you observe -- except that you see quite high average

>>>> similarity with no dimension reduction!

>>>>

>>>> An average cosine similarity of 0.87 sounds "high" to me for anything

>>>> but a few dimensions. What's the dimensionality of the input without

>>>> dimension reduction?

>>>>

>>>> Something is amiss in this pipeline. It is an interesting question!

>>>>

>>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<[EMAIL PROTECTED]>

>>>> wrote:

>>>>

>>>>> Actually I'm using RowSimilarityJob() with

>>>>> --input input

>>>>> --output output

>>>>> --numberOfColumns documentCount

>>>>> --maxSimilaritiesPerRow documentCount

>>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE

(I have one confusion of mine RE: lanczos -- is it computing U

eigenvectors or V? The doc says "eigenvectors" but doesn't say left or

right. if it's V (right eigenvectors) this sequence should be fine).

With ssvd i don't do transpose, i just do coputation of U which will

produce document singular vectors directly.

Also, i am not sure that Lanczos actually normalizes the eigenvectors,

but SSVD does (or multiplies normalized version by square root of a

singlular value, whichever requested). So depending on which space

your rotate results in, cosine similarities may be different. I assume

you used normalized (true) eigenvectors from ssvd.

Also would be interesting to know what oversampling parameter you (p) you used.

Thanks.

-d

On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> So... lets check the dimensions:

>

> First step: Lucene Output:

> 227 rows (=docs) and 107909 cols (=tems)

>

> transposed to:

> 107909 rows and 227 cols

>

> reduced with svd (rank 100) to:

> 99 rows and 227 cols

>

> transposed to: (actually there was a bug (with no effect on the SVD

> result but on NONE result))

> 227 rows and 99 cols

>

> So... now the cosine results are very similar to SVD 200.

>

> Results are added.

>

> @Sebastian: I will check if the bug affects my results.

>

> 2011/6/14 Fernando Fernández <[EMAIL PROTECTED]>:

>> Hi Stefan,

>>

>> Are you sure you need to transpose the input marix? I thought that what you

>> get from lucene index was already document(rows)-term(columns) matrix, but

>> you say that you obtain term-document matrix and transpose it. Is this

>> correct? What are you using to obtain this matrix from Lucene? Is it

>> possible that you are calculating similarities with the wrong matrix in some

>> of the two cases? (With/without dimension reduction).

>>

>> Best,

>> Fernando.

>>

>> 2011/6/14 Sebastian Schelter <[EMAIL PROTECTED]>

>>

>>> Hi Stefan,

>>>

>>> I checked the implementation of RowSimilarityJob and we might still have a

>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by

>>> that, but the similarity scores might not be correct...

>>>

>>> We had this issue in 0.4 already, when someone realized that cooccurrences

>>> were mapped out inconsistently, so for 0.5 we made sure that we always map

>>> the smaller row as first value. But apparently I did not adjust the value

>>> setting for the Cooccurrence object...

>>>

>>> In 0.5 the code is:

>>>

>>> if (rowA <= rowB) {

>>> rowPair.set(rowA, rowB, weightA, weightB);

>>> } else {

>>> rowPair.set(rowB, rowA, weightB, weightA);

>>> }

>>> coocurrence.set(column.get(), valueA, valueB);

>>>

>>> But I should be (already fixed in current trunk some days ago):

>>>

>>> if (rowA <= rowB) {

>>> rowPair.set(rowA, rowB, weightA, weightB);

>>> coocurrence.set(column.get(), valueA, valueB);

>>> } else {

>>> rowPair.set(rowB, rowA, weightB, weightA);

>>> coocurrence.set(column.get(), valueB, valueA);

>>> }

>>>

>>> Maybe you could rerun your test with the current trunk?

>>>

>>> --sebastian

>>>

>>>

>>> On 14.06.2011 20:54, Sean Owen wrote:

>>>

>>>> It is a similarity, not a distance. Higher values mean more

>>>> similarity, not less.

>>>>

>>>> I agree that similarity ought to decrease with more dimensions. That

>>>> is what you observe -- except that you see quite high average

>>>> similarity with no dimension reduction!

>>>>

>>>> An average cosine similarity of 0.87 sounds "high" to me for anything

>>>> but a few dimensions. What's the dimensionality of the input without

>>>> dimension reduction?

>>>>

>>>> Something is amiss in this pipeline. It is an interesting question!

>>>>

>>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<[EMAIL PROTECTED]>

>>>> wrote:

>>>>

>>>>> Actually I'm using RowSimilarityJob() with

>>>>> --input input

>>>>> --output output

>>>>> --numberOfColumns documentCount

>>>>> --maxSimilaritiesPerRow documentCount

>>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE

Still, indeed, i am purplexed by amount of documents in SVD results

that in cosine terms are >0.9. That basically means they all should

have almost identical set of infrequent words. which from your graph

looks like almost 20-30% or so.

On Tue, Jun 14, 2011 at 2:35 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

> Interesting.

>

> (I have one confusion of mine RE: lanczos -- is it computing U

> eigenvectors or V? The doc says "eigenvectors" but doesn't say left or

> right. if it's V (right eigenvectors) this sequence should be fine).

>

> With ssvd i don't do transpose, i just do coputation of U which will

> produce document singular vectors directly.

>

> Also, i am not sure that Lanczos actually normalizes the eigenvectors,

> but SSVD does (or multiplies normalized version by square root of a

> singlular value, whichever requested). So depending on which space

> your rotate results in, cosine similarities may be different. I assume

> you used normalized (true) eigenvectors from ssvd.

>

> Also would be interesting to know what oversampling parameter you (p) you used.

>

> Thanks.

> -d

>

>

> On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>> So... lets check the dimensions:

>>

>> First step: Lucene Output:

>> 227 rows (=docs) and 107909 cols (=tems)

>>

>> transposed to:

>> 107909 rows and 227 cols

>>

>> reduced with svd (rank 100) to:

>> 99 rows and 227 cols

>>

>> transposed to: (actually there was a bug (with no effect on the SVD

>> result but on NONE result))

>> 227 rows and 99 cols

>>

>> So... now the cosine results are very similar to SVD 200.

>>

>> Results are added.

>>

>> @Sebastian: I will check if the bug affects my results.

>>

>> 2011/6/14 Fernando Fernández <[EMAIL PROTECTED]>:

>>> Hi Stefan,

>>>

>>> Are you sure you need to transpose the input marix? I thought that what you

>>> get from lucene index was already document(rows)-term(columns) matrix, but

>>> you say that you obtain term-document matrix and transpose it. Is this

>>> correct? What are you using to obtain this matrix from Lucene? Is it

>>> possible that you are calculating similarities with the wrong matrix in some

>>> of the two cases? (With/without dimension reduction).

>>>

>>> Best,

>>> Fernando.

>>>

>>> 2011/6/14 Sebastian Schelter <[EMAIL PROTECTED]>

>>>

>>>> Hi Stefan,

>>>>

>>>> I checked the implementation of RowSimilarityJob and we might still have a

>>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by

>>>> that, but the similarity scores might not be correct...

>>>>

>>>> We had this issue in 0.4 already, when someone realized that cooccurrences

>>>> were mapped out inconsistently, so for 0.5 we made sure that we always map

>>>> the smaller row as first value. But apparently I did not adjust the value

>>>> setting for the Cooccurrence object...

>>>>

>>>> In 0.5 the code is:

>>>>

>>>> if (rowA <= rowB) {

>>>> rowPair.set(rowA, rowB, weightA, weightB);

>>>> } else {

>>>> rowPair.set(rowB, rowA, weightB, weightA);

>>>> }

>>>> coocurrence.set(column.get(), valueA, valueB);

>>>>

>>>> But I should be (already fixed in current trunk some days ago):

>>>>

>>>> if (rowA <= rowB) {

>>>> rowPair.set(rowA, rowB, weightA, weightB);

>>>> coocurrence.set(column.get(), valueA, valueB);

>>>> } else {

>>>> rowPair.set(rowB, rowA, weightB, weightA);

>>>> coocurrence.set(column.get(), valueB, valueA);

>>>> }

>>>>

>>>> Maybe you could rerun your test with the current trunk?

>>>>

>>>> --sebastian

>>>>

>>>>

>>>> On 14.06.2011 20:54, Sean Owen wrote:

>>>>

>>>>> It is a similarity, not a distance. Higher values mean more

>>>>> similarity, not less.

>>>>>

>>>>> I agree that similarity ought to decrease with more dimensions. That

>>>>> is what you observe -- except that you see quite high average

>>>>> similarity with no dimension reduction!

>>>>>

>>>>> An average cosine similarity of 0.87 sounds "high" to me for anything

>>>>> but a few dimensions. What's the dimensionality of the input without

that in cosine terms are >0.9. That basically means they all should

have almost identical set of infrequent words. which from your graph

looks like almost 20-30% or so.

On Tue, Jun 14, 2011 at 2:35 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

> Interesting.

>

> (I have one confusion of mine RE: lanczos -- is it computing U

> eigenvectors or V? The doc says "eigenvectors" but doesn't say left or

> right. if it's V (right eigenvectors) this sequence should be fine).

>

> With ssvd i don't do transpose, i just do coputation of U which will

> produce document singular vectors directly.

>

> Also, i am not sure that Lanczos actually normalizes the eigenvectors,

> but SSVD does (or multiplies normalized version by square root of a

> singlular value, whichever requested). So depending on which space

> your rotate results in, cosine similarities may be different. I assume

> you used normalized (true) eigenvectors from ssvd.

>

> Also would be interesting to know what oversampling parameter you (p) you used.

>

> Thanks.

> -d

>

>

> On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>> So... lets check the dimensions:

>>

>> First step: Lucene Output:

>> 227 rows (=docs) and 107909 cols (=tems)

>>

>> transposed to:

>> 107909 rows and 227 cols

>>

>> reduced with svd (rank 100) to:

>> 99 rows and 227 cols

>>

>> transposed to: (actually there was a bug (with no effect on the SVD

>> result but on NONE result))

>> 227 rows and 99 cols

>>

>> So... now the cosine results are very similar to SVD 200.

>>

>> Results are added.

>>

>> @Sebastian: I will check if the bug affects my results.

>>

>> 2011/6/14 Fernando Fernández <[EMAIL PROTECTED]>:

>>> Hi Stefan,

>>>

>>> Are you sure you need to transpose the input marix? I thought that what you

>>> get from lucene index was already document(rows)-term(columns) matrix, but

>>> you say that you obtain term-document matrix and transpose it. Is this

>>> correct? What are you using to obtain this matrix from Lucene? Is it

>>> possible that you are calculating similarities with the wrong matrix in some

>>> of the two cases? (With/without dimension reduction).

>>>

>>> Best,

>>> Fernando.

>>>

>>> 2011/6/14 Sebastian Schelter <[EMAIL PROTECTED]>

>>>

>>>> Hi Stefan,

>>>>

>>>> I checked the implementation of RowSimilarityJob and we might still have a

>>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by

>>>> that, but the similarity scores might not be correct...

>>>>

>>>> We had this issue in 0.4 already, when someone realized that cooccurrences

>>>> were mapped out inconsistently, so for 0.5 we made sure that we always map

>>>> the smaller row as first value. But apparently I did not adjust the value

>>>> setting for the Cooccurrence object...

>>>>

>>>> In 0.5 the code is:

>>>>

>>>> if (rowA <= rowB) {

>>>> rowPair.set(rowA, rowB, weightA, weightB);

>>>> } else {

>>>> rowPair.set(rowB, rowA, weightB, weightA);

>>>> }

>>>> coocurrence.set(column.get(), valueA, valueB);

>>>>

>>>> But I should be (already fixed in current trunk some days ago):

>>>>

>>>> if (rowA <= rowB) {

>>>> rowPair.set(rowA, rowB, weightA, weightB);

>>>> coocurrence.set(column.get(), valueA, valueB);

>>>> } else {

>>>> rowPair.set(rowB, rowA, weightB, weightA);

>>>> coocurrence.set(column.get(), valueB, valueA);

>>>> }

>>>>

>>>> Maybe you could rerun your test with the current trunk?

>>>>

>>>> --sebastian

>>>>

>>>>

>>>> On 14.06.2011 20:54, Sean Owen wrote:

>>>>

>>>>> It is a similarity, not a distance. Higher values mean more

>>>>> similarity, not less.

>>>>>

>>>>> I agree that similarity ought to decrease with more dimensions. That

>>>>> is what you observe -- except that you see quite high average

>>>>> similarity with no dimension reduction!

>>>>>

>>>>> An average cosine similarity of 0.87 sounds "high" to me for anything

>>>>> but a few dimensions. What's the dimensionality of the input without

Dmitriy, be aware: there was a bug in none.png... I deleted it a minute ago, see

http://the-lord.de/img/beispielwerte.pdf

for better results.

First... U or V are the singular values not the eigenvectors ;)

Lanczos-SVD in mahout is computing the eigenvectors of M*M (it

multiplies the input matrix with the transposed one)

As a fact, I don't need U, just V, so I need to transpose M (because

the eigenvectors of MM* = V).

So... normalizing the eigenvectors: Is the cosine similarity not doing

this? or ignoring the length of the vectors?

http://en.wikipedia.org/wiki/Cosine_similarity

my parameters for ssvd:

--rank 100

--oversampling 10

--blockHeight 227

--computeU false

--input

--output

the rest should be on default.

acutally I do not really know what these oversampling parameter means...

2011/6/14 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

> Interesting.

>

> (I have one confusion of mine RE: lanczos -- is it computing U

> eigenvectors or V? The doc says "eigenvectors" but doesn't say left or

> right. if it's V (right eigenvectors) this sequence should be fine).

>

> With ssvd i don't do transpose, i just do coputation of U which will

> produce document singular vectors directly.

>

> Also, i am not sure that Lanczos actually normalizes the eigenvectors,

> but SSVD does (or multiplies normalized version by square root of a

> singlular value, whichever requested). So depending on which space

> your rotate results in, cosine similarities may be different. I assume

> you used normalized (true) eigenvectors from ssvd.

>

> Also would be interesting to know what oversampling parameter you (p) you used.

>

> Thanks.

> -d

>

>

> On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>> So... lets check the dimensions:

>>

>> First step: Lucene Output:

>> 227 rows (=docs) and 107909 cols (=tems)

>>

>> transposed to:

>> 107909 rows and 227 cols

>>

>> reduced with svd (rank 100) to:

>> 99 rows and 227 cols

>>

>> transposed to: (actually there was a bug (with no effect on the SVD

>> result but on NONE result))

>> 227 rows and 99 cols

>>

>> So... now the cosine results are very similar to SVD 200.

>>

>> Results are added.

>>

>> @Sebastian: I will check if the bug affects my results.

>>

>> 2011/6/14 Fernando Fernández <[EMAIL PROTECTED]>:

>>> Hi Stefan,

>>>

>>> Are you sure you need to transpose the input marix? I thought that what you

>>> get from lucene index was already document(rows)-term(columns) matrix, but

>>> you say that you obtain term-document matrix and transpose it. Is this

>>> correct? What are you using to obtain this matrix from Lucene? Is it

>>> possible that you are calculating similarities with the wrong matrix in some

>>> of the two cases? (With/without dimension reduction).

>>>

>>> Best,

>>> Fernando.

>>>

>>> 2011/6/14 Sebastian Schelter <[EMAIL PROTECTED]>

>>>

>>>> Hi Stefan,

>>>>

>>>> I checked the implementation of RowSimilarityJob and we might still have a

>>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by

>>>> that, but the similarity scores might not be correct...

>>>>

>>>> We had this issue in 0.4 already, when someone realized that cooccurrences

>>>> were mapped out inconsistently, so for 0.5 we made sure that we always map

>>>> the smaller row as first value. But apparently I did not adjust the value

>>>> setting for the Cooccurrence object...

>>>>

>>>> In 0.5 the code is:

>>>>

>>>> if (rowA <= rowB) {

>>>> rowPair.set(rowA, rowB, weightA, weightB);

>>>> } else {

>>>> rowPair.set(rowB, rowA, weightB, weightA);

>>>> }

>>>> coocurrence.set(column.get(), valueA, valueB);

>>>>

>>>> But I should be (already fixed in current trunk some days ago):

>>>>

>>>> if (rowA <= rowB) {

>>>> rowPair.set(rowA, rowB, weightA, weightB);

>>>> coocurrence.set(column.get(), valueA, valueB);

>>>> } else {

>>>> rowPair.set(rowB, rowA, weightB, weightA);

>>>> coocurrence.set(column.get(), valueB, valueA);

>>>> }

>>>>

>>>> Maybe you could rerun your test with the current trunk?

http://the-lord.de/img/beispielwerte.pdf

for better results.

First... U or V are the singular values not the eigenvectors ;)

Lanczos-SVD in mahout is computing the eigenvectors of M*M (it

multiplies the input matrix with the transposed one)

As a fact, I don't need U, just V, so I need to transpose M (because

the eigenvectors of MM* = V).

So... normalizing the eigenvectors: Is the cosine similarity not doing

this? or ignoring the length of the vectors?

http://en.wikipedia.org/wiki/Cosine_similarity

my parameters for ssvd:

--rank 100

--oversampling 10

--blockHeight 227

--computeU false

--input

--output

the rest should be on default.

acutally I do not really know what these oversampling parameter means...

2011/6/14 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

> Interesting.

>

> (I have one confusion of mine RE: lanczos -- is it computing U

> eigenvectors or V? The doc says "eigenvectors" but doesn't say left or

> right. if it's V (right eigenvectors) this sequence should be fine).

>

> With ssvd i don't do transpose, i just do coputation of U which will

> produce document singular vectors directly.

>

> Also, i am not sure that Lanczos actually normalizes the eigenvectors,

> but SSVD does (or multiplies normalized version by square root of a

> singlular value, whichever requested). So depending on which space

> your rotate results in, cosine similarities may be different. I assume

> you used normalized (true) eigenvectors from ssvd.

>

> Also would be interesting to know what oversampling parameter you (p) you used.

>

> Thanks.

> -d

>

>

> On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>> So... lets check the dimensions:

>>

>> First step: Lucene Output:

>> 227 rows (=docs) and 107909 cols (=tems)

>>

>> transposed to:

>> 107909 rows and 227 cols

>>

>> reduced with svd (rank 100) to:

>> 99 rows and 227 cols

>>

>> transposed to: (actually there was a bug (with no effect on the SVD

>> result but on NONE result))

>> 227 rows and 99 cols

>>

>> So... now the cosine results are very similar to SVD 200.

>>

>> Results are added.

>>

>> @Sebastian: I will check if the bug affects my results.

>>

>> 2011/6/14 Fernando Fernández <[EMAIL PROTECTED]>:

>>> Hi Stefan,

>>>

>>> Are you sure you need to transpose the input marix? I thought that what you

>>> get from lucene index was already document(rows)-term(columns) matrix, but

>>> you say that you obtain term-document matrix and transpose it. Is this

>>> correct? What are you using to obtain this matrix from Lucene? Is it

>>> possible that you are calculating similarities with the wrong matrix in some

>>> of the two cases? (With/without dimension reduction).

>>>

>>> Best,

>>> Fernando.

>>>

>>> 2011/6/14 Sebastian Schelter <[EMAIL PROTECTED]>

>>>

>>>> Hi Stefan,

>>>>

>>>> I checked the implementation of RowSimilarityJob and we might still have a

>>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by

>>>> that, but the similarity scores might not be correct...

>>>>

>>>> We had this issue in 0.4 already, when someone realized that cooccurrences

>>>> were mapped out inconsistently, so for 0.5 we made sure that we always map

>>>> the smaller row as first value. But apparently I did not adjust the value

>>>> setting for the Cooccurrence object...

>>>>

>>>> In 0.5 the code is:

>>>>

>>>> if (rowA <= rowB) {

>>>> rowPair.set(rowA, rowB, weightA, weightB);

>>>> } else {

>>>> rowPair.set(rowB, rowA, weightB, weightA);

>>>> }

>>>> coocurrence.set(column.get(), valueA, valueB);

>>>>

>>>> But I should be (already fixed in current trunk some days ago):

>>>>

>>>> if (rowA <= rowB) {

>>>> rowPair.set(rowA, rowB, weightA, weightB);

>>>> coocurrence.set(column.get(), valueA, valueB);

>>>> } else {

>>>> rowPair.set(rowB, rowA, weightB, weightA);

>>>> coocurrence.set(column.get(), valueB, valueA);

>>>> }

>>>>

>>>> Maybe you could rerun your test with the current trunk?

I beg to differ... U and V are left and right eigenvectors, and

singular values is denoted as Sigma (which is a square root of eigen

values of the AA' as you correctly pointed out) .

Yes so i figured Lanczos must be doing V (otherwise your dimensions

wouldn't match) . Also i guess eigenvector implies the right ones not

the left ones by default.

Normalization means that second norm of columns in the eigenvector

matrix (i.e. all columns) is 1. In classic SVD A=U*Sigma*V', even if

it is a thin one, U and V are orthonormal. I might be wrong but i was

under impression that i saw some discussion saying Lanczos singular

vector matrix is not necessarily orthonormal (although columns do form

orthogonal basis). I might be wrong about it.

Anyway i know for sure that SSVD gives option to rotate in both

eigenspace and the space scaled by square roots of eigenvalues. The

latter allows single space for row items and column items and enables

similarity measures among them.

Oversampling parameter is parameter -p you give to SSVD. (didn't you

give it? ) What's your command line for SSVD was?

Basically it means that for 10-rank thin SVD you need to give

something like k=10 p=90 which means the algorithm actually computes

100 dimentional random projection and computes SVD on it (or rather

actaully indeed eigendecomposition of BB') and then throws away 90

singular values and 90 latent factors as well from the result.

On Tue, Jun 14, 2011 at 3:09 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Dmitriy, be aware: there was a bug in none.png... I deleted it a minute ago, see

> http://the-lord.de/img/beispielwerte.pdf

> for better results.

>

> First... U or V are the singular values not the eigenvectors ;)

>

> Lanczos-SVD in mahout is computing the eigenvectors of M*M (it

> multiplies the input matrix with the transposed one)

>

> As a fact, I don't need U, just V, so I need to transpose M (because

> the eigenvectors of MM* = V).

>

> So... normalizing the eigenvectors: Is the cosine similarity not doing

> this? or ignoring the length of the vectors?

> http://en.wikipedia.org/wiki/Cosine_similarity

>

> my parameters for ssvd:

> --rank 100

> --oversampling 10

> --blockHeight 227

> --computeU false

> --input

> --output

>

> the rest should be on default.

>

> acutally I do not really know what these oversampling parameter means...

>

> 2011/6/14 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>> Interesting.

>>

>> (I have one confusion of mine RE: lanczos -- is it computing U

>> eigenvectors or V? The doc says "eigenvectors" but doesn't say left or

>> right. if it's V (right eigenvectors) this sequence should be fine).

>>

>> With ssvd i don't do transpose, i just do coputation of U which will

>> produce document singular vectors directly.

>>

>> Also, i am not sure that Lanczos actually normalizes the eigenvectors,

>> but SSVD does (or multiplies normalized version by square root of a

>> singlular value, whichever requested). So depending on which space

>> your rotate results in, cosine similarities may be different. I assume

>> you used normalized (true) eigenvectors from ssvd.

>>

>> Also would be interesting to know what oversampling parameter you (p) you used.

>>

>> Thanks.

>> -d

>>

>>

>> On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>> So... lets check the dimensions:

>>>

>>> First step: Lucene Output:

>>> 227 rows (=docs) and 107909 cols (=tems)

>>>

>>> transposed to:

>>> 107909 rows and 227 cols

>>>

>>> reduced with svd (rank 100) to:

>>> 99 rows and 227 cols

>>>

>>> transposed to: (actually there was a bug (with no effect on the SVD

>>> result but on NONE result))

>>> 227 rows and 99 cols

>>>

>>> So... now the cosine results are very similar to SVD 200.

>>>

>>> Results are added.

>>>

>>> @Sebastian: I will check if the bug affects my results.

>>>

>>> 2011/6/14 Fernando Fernández <[EMAIL PROTECTED]>:

>>>> Hi Stefan,

>>>>

>>>> Are you sure you need to transpose the input marix? I thought that what you

>>>> get from lucene index was already document(rows)-term(columns) matrix, but

singular values is denoted as Sigma (which is a square root of eigen

values of the AA' as you correctly pointed out) .

Yes so i figured Lanczos must be doing V (otherwise your dimensions

wouldn't match) . Also i guess eigenvector implies the right ones not

the left ones by default.

Normalization means that second norm of columns in the eigenvector

matrix (i.e. all columns) is 1. In classic SVD A=U*Sigma*V', even if

it is a thin one, U and V are orthonormal. I might be wrong but i was

under impression that i saw some discussion saying Lanczos singular

vector matrix is not necessarily orthonormal (although columns do form

orthogonal basis). I might be wrong about it.

Anyway i know for sure that SSVD gives option to rotate in both

eigenspace and the space scaled by square roots of eigenvalues. The

latter allows single space for row items and column items and enables

similarity measures among them.

Oversampling parameter is parameter -p you give to SSVD. (didn't you

give it? ) What's your command line for SSVD was?

Basically it means that for 10-rank thin SVD you need to give

something like k=10 p=90 which means the algorithm actually computes

100 dimentional random projection and computes SVD on it (or rather

actaully indeed eigendecomposition of BB') and then throws away 90

singular values and 90 latent factors as well from the result.

On Tue, Jun 14, 2011 at 3:09 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Dmitriy, be aware: there was a bug in none.png... I deleted it a minute ago, see

> http://the-lord.de/img/beispielwerte.pdf

> for better results.

>

> First... U or V are the singular values not the eigenvectors ;)

>

> Lanczos-SVD in mahout is computing the eigenvectors of M*M (it

> multiplies the input matrix with the transposed one)

>

> As a fact, I don't need U, just V, so I need to transpose M (because

> the eigenvectors of MM* = V).

>

> So... normalizing the eigenvectors: Is the cosine similarity not doing

> this? or ignoring the length of the vectors?

> http://en.wikipedia.org/wiki/Cosine_similarity

>

> my parameters for ssvd:

> --rank 100

> --oversampling 10

> --blockHeight 227

> --computeU false

> --input

> --output

>

> the rest should be on default.

>

> acutally I do not really know what these oversampling parameter means...

>

> 2011/6/14 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>> Interesting.

>>

>> (I have one confusion of mine RE: lanczos -- is it computing U

>> eigenvectors or V? The doc says "eigenvectors" but doesn't say left or

>> right. if it's V (right eigenvectors) this sequence should be fine).

>>

>> With ssvd i don't do transpose, i just do coputation of U which will

>> produce document singular vectors directly.

>>

>> Also, i am not sure that Lanczos actually normalizes the eigenvectors,

>> but SSVD does (or multiplies normalized version by square root of a

>> singlular value, whichever requested). So depending on which space

>> your rotate results in, cosine similarities may be different. I assume

>> you used normalized (true) eigenvectors from ssvd.

>>

>> Also would be interesting to know what oversampling parameter you (p) you used.

>>

>> Thanks.

>> -d

>>

>>

>> On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>> So... lets check the dimensions:

>>>

>>> First step: Lucene Output:

>>> 227 rows (=docs) and 107909 cols (=tems)

>>>

>>> transposed to:

>>> 107909 rows and 227 cols

>>>

>>> reduced with svd (rank 100) to:

>>> 99 rows and 227 cols

>>>

>>> transposed to: (actually there was a bug (with no effect on the SVD

>>> result but on NONE result))

>>> 227 rows and 99 cols

>>>

>>> So... now the cosine results are very similar to SVD 200.

>>>

>>> Results are added.

>>>

>>> @Sebastian: I will check if the bug affects my results.

>>>

>>> 2011/6/14 Fernando Fernández <[EMAIL PROTECTED]>:

>>>> Hi Stefan,

>>>>

>>>> Are you sure you need to transpose the input marix? I thought that what you

>>>> get from lucene index was already document(rows)-term(columns) matrix, but

that actually looks more like it. Not so many documents similar to a

randomly picked one.

On Tue, Jun 14, 2011 at 3:09 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Dmitriy, be aware: there was a bug in none.png... I deleted it a minute ago, see

> http://the-lord.de/img/beispielwerte.pdf

> for better results.

>

> First... U or V are the singular values not the eigenvectors ;)

>

> Lanczos-SVD in mahout is computing the eigenvectors of M*M (it

> multiplies the input matrix with the transposed one)

>

> As a fact, I don't need U, just V, so I need to transpose M (because

> the eigenvectors of MM* = V).

>

> So... normalizing the eigenvectors: Is the cosine similarity not doing

> this? or ignoring the length of the vectors?

> http://en.wikipedia.org/wiki/Cosine_similarity

>

> my parameters for ssvd:

> --rank 100

> --oversampling 10

> --blockHeight 227

> --computeU false

> --input

> --output

>

> the rest should be on default.

>

> acutally I do not really know what these oversampling parameter means...

>

> 2011/6/14 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>> Interesting.

>>

>> (I have one confusion of mine RE: lanczos -- is it computing U

>> eigenvectors or V? The doc says "eigenvectors" but doesn't say left or

>> right. if it's V (right eigenvectors) this sequence should be fine).

>>

>> With ssvd i don't do transpose, i just do coputation of U which will

>> produce document singular vectors directly.

>>

>> Also, i am not sure that Lanczos actually normalizes the eigenvectors,

>> but SSVD does (or multiplies normalized version by square root of a

>> singlular value, whichever requested). So depending on which space

>> your rotate results in, cosine similarities may be different. I assume

>> you used normalized (true) eigenvectors from ssvd.

>>

>> Also would be interesting to know what oversampling parameter you (p) you used.

>>

>> Thanks.

>> -d

>>

>>

>> On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>> So... lets check the dimensions:

>>>

>>> First step: Lucene Output:

>>> 227 rows (=docs) and 107909 cols (=tems)

>>>

>>> transposed to:

>>> 107909 rows and 227 cols

>>>

>>> reduced with svd (rank 100) to:

>>> 99 rows and 227 cols

>>>

>>> transposed to: (actually there was a bug (with no effect on the SVD

>>> result but on NONE result))

>>> 227 rows and 99 cols

>>>

>>> So... now the cosine results are very similar to SVD 200.

>>>

>>> Results are added.

>>>

>>> @Sebastian: I will check if the bug affects my results.

>>>

>>> 2011/6/14 Fernando Fernández <[EMAIL PROTECTED]>:

>>>> Hi Stefan,

>>>>

>>>> Are you sure you need to transpose the input marix? I thought that what you

>>>> get from lucene index was already document(rows)-term(columns) matrix, but

>>>> you say that you obtain term-document matrix and transpose it. Is this

>>>> correct? What are you using to obtain this matrix from Lucene? Is it

>>>> possible that you are calculating similarities with the wrong matrix in some

>>>> of the two cases? (With/without dimension reduction).

>>>>

>>>> Best,

>>>> Fernando.

>>>>

>>>> 2011/6/14 Sebastian Schelter <[EMAIL PROTECTED]>

>>>>

>>>>> Hi Stefan,

>>>>>

>>>>> I checked the implementation of RowSimilarityJob and we might still have a

>>>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by

>>>>> that, but the similarity scores might not be correct...

>>>>>

>>>>> We had this issue in 0.4 already, when someone realized that cooccurrences

>>>>> were mapped out inconsistently, so for 0.5 we made sure that we always map

>>>>> the smaller row as first value. But apparently I did not adjust the value

>>>>> setting for the Cooccurrence object...

>>>>>

>>>>> In 0.5 the code is:

>>>>>

>>>>> if (rowA <= rowB) {

>>>>> rowPair.set(rowA, rowB, weightA, weightB);

>>>>> } else {

>>>>> rowPair.set(rowB, rowA, weightB, weightA);

>>>>> }

>>>>> coocurrence.set(column.get(), valueA, valueB);

>>>>>

>>>>> But I should be (already fixed in current trunk some days ago):

>>>>>

>>>>> if (rowA <= rowB) {

randomly picked one.

On Tue, Jun 14, 2011 at 3:09 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Dmitriy, be aware: there was a bug in none.png... I deleted it a minute ago, see

> http://the-lord.de/img/beispielwerte.pdf

> for better results.

>

> First... U or V are the singular values not the eigenvectors ;)

>

> Lanczos-SVD in mahout is computing the eigenvectors of M*M (it

> multiplies the input matrix with the transposed one)

>

> As a fact, I don't need U, just V, so I need to transpose M (because

> the eigenvectors of MM* = V).

>

> So... normalizing the eigenvectors: Is the cosine similarity not doing

> this? or ignoring the length of the vectors?

> http://en.wikipedia.org/wiki/Cosine_similarity

>

> my parameters for ssvd:

> --rank 100

> --oversampling 10

> --blockHeight 227

> --computeU false

> --input

> --output

>

> the rest should be on default.

>

> acutally I do not really know what these oversampling parameter means...

>

> 2011/6/14 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>> Interesting.

>>

>> (I have one confusion of mine RE: lanczos -- is it computing U

>> eigenvectors or V? The doc says "eigenvectors" but doesn't say left or

>> right. if it's V (right eigenvectors) this sequence should be fine).

>>

>> With ssvd i don't do transpose, i just do coputation of U which will

>> produce document singular vectors directly.

>>

>> Also, i am not sure that Lanczos actually normalizes the eigenvectors,

>> but SSVD does (or multiplies normalized version by square root of a

>> singlular value, whichever requested). So depending on which space

>> your rotate results in, cosine similarities may be different. I assume

>> you used normalized (true) eigenvectors from ssvd.

>>

>> Also would be interesting to know what oversampling parameter you (p) you used.

>>

>> Thanks.

>> -d

>>

>>

>> On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>> So... lets check the dimensions:

>>>

>>> First step: Lucene Output:

>>> 227 rows (=docs) and 107909 cols (=tems)

>>>

>>> transposed to:

>>> 107909 rows and 227 cols

>>>

>>> reduced with svd (rank 100) to:

>>> 99 rows and 227 cols

>>>

>>> transposed to: (actually there was a bug (with no effect on the SVD

>>> result but on NONE result))

>>> 227 rows and 99 cols

>>>

>>> So... now the cosine results are very similar to SVD 200.

>>>

>>> Results are added.

>>>

>>> @Sebastian: I will check if the bug affects my results.

>>>

>>> 2011/6/14 Fernando Fernández <[EMAIL PROTECTED]>:

>>>> Hi Stefan,

>>>>

>>>> Are you sure you need to transpose the input marix? I thought that what you

>>>> get from lucene index was already document(rows)-term(columns) matrix, but

>>>> you say that you obtain term-document matrix and transpose it. Is this

>>>> correct? What are you using to obtain this matrix from Lucene? Is it

>>>> possible that you are calculating similarities with the wrong matrix in some

>>>> of the two cases? (With/without dimension reduction).

>>>>

>>>> Best,

>>>> Fernando.

>>>>

>>>> 2011/6/14 Sebastian Schelter <[EMAIL PROTECTED]>

>>>>

>>>>> Hi Stefan,

>>>>>

>>>>> I checked the implementation of RowSimilarityJob and we might still have a

>>>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by

>>>>> that, but the similarity scores might not be correct...

>>>>>

>>>>> We had this issue in 0.4 already, when someone realized that cooccurrences

>>>>> were mapped out inconsistently, so for 0.5 we made sure that we always map

>>>>> the smaller row as first value. But apparently I did not adjust the value

>>>>> setting for the Cooccurrence object...

>>>>>

>>>>> In 0.5 the code is:

>>>>>

>>>>> if (rowA <= rowB) {

>>>>> rowPair.set(rowA, rowB, weightA, weightB);

>>>>> } else {

>>>>> rowPair.set(rowB, rowA, weightB, weightA);

>>>>> }

>>>>> coocurrence.set(column.get(), valueA, valueB);

>>>>>

>>>>> But I should be (already fixed in current trunk some days ago):

>>>>>

>>>>> if (rowA <= rowB) {

On Tue, Jun 14, 2011 at 3:35 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

>

> Normalization means that second norm of columns in the eigenvector

> matrix (i.e. all columns) is 1. In classic SVD A=U*Sigma*V', even if

> it is a thin one, U and V are orthonormal. I might be wrong but i was

> under impression that i saw some discussion saying Lanczos singular

> vector matrix is not necessarily orthonormal (although columns do form

> orthogonal basis). I might be wrong about it.

>

LanczosSolver normalizes the singular vectors (LanczosSolver.java, line

162),

and yes, returns V, not U: if U is documents x latent factors (so gives the

projection of each input document onto the reduced basis), and V is

latent factors x terms (and has rows which gives each show which

latent factors are made up of what terms). Lanczos solver doesn't keep

track

of documents (partly for scalability: documents can be thought of as

"training" your latent factor model), but they instead return the latent

factor by term "model": V.

-jake

>

> Normalization means that second norm of columns in the eigenvector

> matrix (i.e. all columns) is 1. In classic SVD A=U*Sigma*V', even if

> it is a thin one, U and V are orthonormal. I might be wrong but i was

> under impression that i saw some discussion saying Lanczos singular

> vector matrix is not necessarily orthonormal (although columns do form

> orthogonal basis). I might be wrong about it.

>

LanczosSolver normalizes the singular vectors (LanczosSolver.java, line

162),

and yes, returns V, not U: if U is documents x latent factors (so gives the

projection of each input document onto the reduced basis), and V is

latent factors x terms (and has rows which gives each show which

latent factors are made up of what terms). Lanczos solver doesn't keep

track

of documents (partly for scalability: documents can be thought of as

"training" your latent factor model), but they instead return the latent

factor by term "model": V.

-jake

On Tue, Jun 14, 2011 at 4:09 PM, Jake Mannix <[EMAIL PROTECTED]> wrote:

> On Tue, Jun 14, 2011 at 3:35 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

>>

>> Normalization means that second norm of columns in the eigenvector

>> matrix (i.e. all columns) is 1. In classic SVD A=U*Sigma*V', even if

>> it is a thin one, U and V are orthonormal. I might be wrong but i was

>> under impression that i saw some discussion saying Lanczos singular

>> vector matrix is not necessarily orthonormal (although columns do form

>> orthogonal basis). I might be wrong about it.

>>

>

> LanczosSolver normalizes the singular vectors (LanczosSolver.java, line

> 162),

> and yes, returns V, not U: if U is documents x latent factors (so gives the

> projection of each input document onto the reduced basis), and V is

> latent factors x terms (and has rows which gives each show which

> latent factors are made up of what terms). Lanczos solver doesn't keep

> track

> of documents (partly for scalability: documents can be thought of as

> "training" your latent factor model), but they instead return the latent

> factor by term "model": V.

>

> -jake

One question that I think it has not been answered yet is that of the

negative simliarities. In literature you can find that similiarity=-1 means

that "documents talk about opposite topics", but I think this is a quite

abstract idea... I just ignore them, when I'm trying to find top-k similar

documents these surely won't be useful. I read recently that this has to do

with the assumptions in SVD which is designed for normal distributions (This

implies the posibility of negative values). There are other techniques

(Non-negative factorization) that tries to solve this. I don't know if

there's something in mahout about this.

Best,

Fernando.

2011/6/15 Ted Dunning <[EMAIL PROTECTED]>

> The normal terminology is to name U and V in SVD as "singular vectors" as

> opposed to eigenvectors. The term eigenvectors is normally reserved for

> the

> symmetric case of U S U' (more generally, the Hermitian case, but we only

> support real values).

>

> On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov <[EMAIL PROTECTED]

> >wrote:

>

> > I beg to differ... U and V are left and right eigenvectors, and

> > singular values is denoted as Sigma (which is a square root of eigen

> > values of the AA' as you correctly pointed out) .

> >

>

negative simliarities. In literature you can find that similiarity=-1 means

that "documents talk about opposite topics", but I think this is a quite

abstract idea... I just ignore them, when I'm trying to find top-k similar

documents these surely won't be useful. I read recently that this has to do

with the assumptions in SVD which is designed for normal distributions (This

implies the posibility of negative values). There are other techniques

(Non-negative factorization) that tries to solve this. I don't know if

there's something in mahout about this.

Best,

Fernando.

2011/6/15 Ted Dunning <[EMAIL PROTECTED]>

> The normal terminology is to name U and V in SVD as "singular vectors" as

> opposed to eigenvectors. The term eigenvectors is normally reserved for

> the

> symmetric case of U S U' (more generally, the Hermitian case, but we only

> support real values).

>

> On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov <[EMAIL PROTECTED]

> >wrote:

>

> > I beg to differ... U and V are left and right eigenvectors, and

> > singular values is denoted as Sigma (which is a square root of eigen

> > values of the AA' as you correctly pointed out) .

> >

>

Ignoring is no option... so I have to interpret these values.

Can one say that documents with similarity = -1 are the less similar

documents? I don't think this is right.

Any other assumptions?

2011/6/15 Fernando Fernández <[EMAIL PROTECTED]>:

> One question that I think it has not been answered yet is that of the

> negative simliarities. In literature you can find that similiarity=-1 means

> that "documents talk about opposite topics", but I think this is a quite

> abstract idea... I just ignore them, when I'm trying to find top-k similar

> documents these surely won't be useful. I read recently that this has to do

> with the assumptions in SVD which is designed for normal distributions (This

> implies the posibility of negative values). There are other techniques

> (Non-negative factorization) that tries to solve this. I don't know if

> there's something in mahout about this.

>

> Best,

>

> Fernando.

>

> 2011/6/15 Ted Dunning <[EMAIL PROTECTED]>

>

>> The normal terminology is to name U and V in SVD as "singular vectors" as

>> opposed to eigenvectors. The term eigenvectors is normally reserved for

>> the

>> symmetric case of U S U' (more generally, the Hermitian case, but we only

>> support real values).

>>

>> On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov <[EMAIL PROTECTED]

>> >wrote:

>>

>> > I beg to differ... U and V are left and right eigenvectors, and

>> > singular values is denoted as Sigma (which is a square root of eigen

>> > values of the AA' as you correctly pointed out) .

>> >

>>

>

Can one say that documents with similarity = -1 are the less similar

documents? I don't think this is right.

Any other assumptions?

2011/6/15 Fernando Fernández <[EMAIL PROTECTED]>:

> One question that I think it has not been answered yet is that of the

> negative simliarities. In literature you can find that similiarity=-1 means

> that "documents talk about opposite topics", but I think this is a quite

> abstract idea... I just ignore them, when I'm trying to find top-k similar

> documents these surely won't be useful. I read recently that this has to do

> with the assumptions in SVD which is designed for normal distributions (This

> implies the posibility of negative values). There are other techniques

> (Non-negative factorization) that tries to solve this. I don't know if

> there's something in mahout about this.

>

> Best,

>

> Fernando.

>

> 2011/6/15 Ted Dunning <[EMAIL PROTECTED]>

>

>> The normal terminology is to name U and V in SVD as "singular vectors" as

>> opposed to eigenvectors. The term eigenvectors is normally reserved for

>> the

>> symmetric case of U S U' (more generally, the Hermitian case, but we only

>> support real values).

>>

>> On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov <[EMAIL PROTECTED]

>> >wrote:

>>

>> > I beg to differ... U and V are left and right eigenvectors, and

>> > singular values is denoted as Sigma (which is a square root of eigen

>> > values of the AA' as you correctly pointed out) .

>> >

>>

>

The features all take on non-negative values here, right?

Then the cosine can't be negative.

In another context, where features could be negative, cosine could

indeed be negative. -1 means most dissimilar of all -- the feature

vectors are exactly opposed.

On Wed, Jun 15, 2011 at 10:10 AM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Ignoring is no option... so I have to interpret these values.

> Can one say that documents with similarity = -1 are the less similar

> documents? I don't think this is right.

> Any other assumptions?

Then the cosine can't be negative.

In another context, where features could be negative, cosine could

indeed be negative. -1 means most dissimilar of all -- the feature

vectors are exactly opposed.

On Wed, Jun 15, 2011 at 10:10 AM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Ignoring is no option... so I have to interpret these values.

> Can one say that documents with similarity = -1 are the less similar

> documents? I don't think this is right.

> Any other assumptions?

I think that LanczosSolver provides negative values as well, I don't know

about SSVD.

I guess that if similarity has a high negative value, you can say that

documents talk about things that almost never appear together in the same

text (if term A appears, then term B won't appear), but I think this is

almost impossible in practice (at least the most extreme case with

similiarity=-1), as there are always common expressions that appear in many

documents. I think that's why avg(similiarity) is always above 0 in your

case.

2011/6/15 Sean Owen <[EMAIL PROTECTED]>

> The features all take on non-negative values here, right?

> Then the cosine can't be negative.

>

> In another context, where features could be negative, cosine could

> indeed be negative. -1 means most dissimilar of all -- the feature

> vectors are exactly opposed.

>

> On Wed, Jun 15, 2011 at 10:10 AM, Stefan Wienert <[EMAIL PROTECTED]>

> wrote:

> > Ignoring is no option... so I have to interpret these values.

> > Can one say that documents with similarity = -1 are the less similar

> > documents? I don't think this is right.

> > Any other assumptions?

>

about SSVD.

I guess that if similarity has a high negative value, you can say that

documents talk about things that almost never appear together in the same

text (if term A appears, then term B won't appear), but I think this is

almost impossible in practice (at least the most extreme case with

similiarity=-1), as there are always common expressions that appear in many

documents. I think that's why avg(similiarity) is always above 0 in your

case.

2011/6/15 Sean Owen <[EMAIL PROTECTED]>

> The features all take on non-negative values here, right?

> Then the cosine can't be negative.

>

> In another context, where features could be negative, cosine could

> indeed be negative. -1 means most dissimilar of all -- the feature

> vectors are exactly opposed.

>

> On Wed, Jun 15, 2011 at 10:10 AM, Stefan Wienert <[EMAIL PROTECTED]>

> wrote:

> > Ignoring is no option... so I have to interpret these values.

> > Can one say that documents with similarity = -1 are the less similar

> > documents? I don't think this is right.

> > Any other assumptions?

>

While your original vectors never had similarity less than zero, after

projection onto the SVD space, you may "project away" similarities

between two vectors, and they are now negatively correlated in this

space (think about projecting (1,0,1) and (0,1,1) onto the 1-d vector

space spanned by (1,-1,0) - they go from having similarity +1/sqrt(2)

to similarity -1).

I always interpret all similarities <= 0 as "maximally dissimilar",

even if technically -1 is where this is exactly true.

-jake

On Wed, Jun 15, 2011 at 2:10 AM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Ignoring is no option... so I have to interpret these values.

> Can one say that documents with similarity = -1 are the less similar

> documents? I don't think this is right.

> Any other assumptions?

>

> 2011/6/15 Fernando Fernández <[EMAIL PROTECTED]>:

> > One question that I think it has not been answered yet is that of the

> > negative simliarities. In literature you can find that similiarity=-1

> means

> > that "documents talk about opposite topics", but I think this is a quite

> > abstract idea... I just ignore them, when I'm trying to find top-k

> similar

> > documents these surely won't be useful. I read recently that this has to

> do

> > with the assumptions in SVD which is designed for normal distributions

> (This

> > implies the posibility of negative values). There are other techniques

> > (Non-negative factorization) that tries to solve this. I don't know if

> > there's something in mahout about this.

> >

> > Best,

> >

> > Fernando.

> >

> > 2011/6/15 Ted Dunning <[EMAIL PROTECTED]>

> >

> >> The normal terminology is to name U and V in SVD as "singular vectors"

> as

> >> opposed to eigenvectors. The term eigenvectors is normally reserved for

> >> the

> >> symmetric case of U S U' (more generally, the Hermitian case, but we

> only

> >> support real values).

> >>

> >> On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov <[EMAIL PROTECTED]

> >> >wrote:

> >>

> >> > I beg to differ... U and V are left and right eigenvectors, and

> >> > singular values is denoted as Sigma (which is a square root of eigen

> >> > values of the AA' as you correctly pointed out) .

> >> >

> >>

> >

>

>

>

> --

> Stefan Wienert

>

> http://www.wienert.cc

> [EMAIL PROTECTED]

>

> Telefon: +495251-2026838

> Mobil: +49176-40170270

>

projection onto the SVD space, you may "project away" similarities

between two vectors, and they are now negatively correlated in this

space (think about projecting (1,0,1) and (0,1,1) onto the 1-d vector

space spanned by (1,-1,0) - they go from having similarity +1/sqrt(2)

to similarity -1).

I always interpret all similarities <= 0 as "maximally dissimilar",

even if technically -1 is where this is exactly true.

-jake

On Wed, Jun 15, 2011 at 2:10 AM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Ignoring is no option... so I have to interpret these values.

> Can one say that documents with similarity = -1 are the less similar

> documents? I don't think this is right.

> Any other assumptions?

>

> 2011/6/15 Fernando Fernández <[EMAIL PROTECTED]>:

> > One question that I think it has not been answered yet is that of the

> > negative simliarities. In literature you can find that similiarity=-1

> means

> > that "documents talk about opposite topics", but I think this is a quite

> > abstract idea... I just ignore them, when I'm trying to find top-k

> similar

> > documents these surely won't be useful. I read recently that this has to

> do

> > with the assumptions in SVD which is designed for normal distributions

> (This

> > implies the posibility of negative values). There are other techniques

> > (Non-negative factorization) that tries to solve this. I don't know if

> > there's something in mahout about this.

> >

> > Best,

> >

> > Fernando.

> >

> > 2011/6/15 Ted Dunning <[EMAIL PROTECTED]>

> >

> >> The normal terminology is to name U and V in SVD as "singular vectors"

> as

> >> opposed to eigenvectors. The term eigenvectors is normally reserved for

> >> the

> >> symmetric case of U S U' (more generally, the Hermitian case, but we

> only

> >> support real values).

> >>

> >> On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov <[EMAIL PROTECTED]

> >> >wrote:

> >>

> >> > I beg to differ... U and V are left and right eigenvectors, and

> >> > singular values is denoted as Sigma (which is a square root of eigen

> >> > values of the AA' as you correctly pointed out) .

> >> >

> >>

> >

>

>

>

> --

> Stefan Wienert

>

> http://www.wienert.cc

> [EMAIL PROTECTED]

>

> Telefon: +495251-2026838

> Mobil: +49176-40170270

>

Hmm. Seems I have plenty of negative results (nearly half of the

similarity). I can add +0.3 then the greatest negative results are

near 0. This is not optimal...

I can project the results to [0..1].

Any other suggestions or comments?

Cheers

Stefan

2011/6/15 Jake Mannix <[EMAIL PROTECTED]>:

> While your original vectors never had similarity less than zero, after

> projection onto the SVD space, you may "project away" similarities

> between two vectors, and they are now negatively correlated in this

> space (think about projecting (1,0,1) and (0,1,1) onto the 1-d vector

> space spanned by (1,-1,0) - they go from having similarity +1/sqrt(2)

> to similarity -1).

>

> I always interpret all similarities <= 0 as "maximally dissimilar",

> even if technically -1 is where this is exactly true.

>

> -jake

>

> On Wed, Jun 15, 2011 at 2:10 AM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>

>> Ignoring is no option... so I have to interpret these values.

>> Can one say that documents with similarity = -1 are the less similar

>> documents? I don't think this is right.

>> Any other assumptions?

>>

>> 2011/6/15 Fernando Fernández <[EMAIL PROTECTED]>:

>> > One question that I think it has not been answered yet is that of the

>> > negative simliarities. In literature you can find that similiarity=-1

>> means

>> > that "documents talk about opposite topics", but I think this is a quite

>> > abstract idea... I just ignore them, when I'm trying to find top-k

>> similar

>> > documents these surely won't be useful. I read recently that this has to

>> do

>> > with the assumptions in SVD which is designed for normal distributions

>> (This

>> > implies the posibility of negative values). There are other techniques

>> > (Non-negative factorization) that tries to solve this. I don't know if

>> > there's something in mahout about this.

>> >

>> > Best,

>> >

>> > Fernando.

>> >

>> > 2011/6/15 Ted Dunning <[EMAIL PROTECTED]>

>> >

>> >> The normal terminology is to name U and V in SVD as "singular vectors"

>> as

>> >> opposed to eigenvectors. The term eigenvectors is normally reserved for

>> >> the

>> >> symmetric case of U S U' (more generally, the Hermitian case, but we

>> only

>> >> support real values).

>> >>

>> >> On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov <[EMAIL PROTECTED]

>> >> >wrote:

>> >>

>> >> > I beg to differ... U and V are left and right eigenvectors, and

>> >> > singular values is denoted as Sigma (which is a square root of eigen

>> >> > values of the AA' as you correctly pointed out) .

>> >> >

>> >>

>> >

>>

>>

>>

>> --

>> Stefan Wienert

>>

>> http://www.wienert.cc

>> [EMAIL PROTECTED]

>>

>> Telefon: +495251-2026838

>> Mobil: +49176-40170270

>>

>

similarity). I can add +0.3 then the greatest negative results are

near 0. This is not optimal...

I can project the results to [0..1].

Any other suggestions or comments?

Cheers

Stefan

2011/6/15 Jake Mannix <[EMAIL PROTECTED]>:

> While your original vectors never had similarity less than zero, after

> projection onto the SVD space, you may "project away" similarities

> between two vectors, and they are now negatively correlated in this

> space (think about projecting (1,0,1) and (0,1,1) onto the 1-d vector

> space spanned by (1,-1,0) - they go from having similarity +1/sqrt(2)

> to similarity -1).

>

> I always interpret all similarities <= 0 as "maximally dissimilar",

> even if technically -1 is where this is exactly true.

>

> -jake

>

> On Wed, Jun 15, 2011 at 2:10 AM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>

>> Ignoring is no option... so I have to interpret these values.

>> Can one say that documents with similarity = -1 are the less similar

>> documents? I don't think this is right.

>> Any other assumptions?

>>

>> 2011/6/15 Fernando Fernández <[EMAIL PROTECTED]>:

>> > One question that I think it has not been answered yet is that of the

>> > negative simliarities. In literature you can find that similiarity=-1

>> means

>> > that "documents talk about opposite topics", but I think this is a quite

>> > abstract idea... I just ignore them, when I'm trying to find top-k

>> similar

>> > documents these surely won't be useful. I read recently that this has to

>> do

>> > with the assumptions in SVD which is designed for normal distributions

>> (This

>> > implies the posibility of negative values). There are other techniques

>> > (Non-negative factorization) that tries to solve this. I don't know if

>> > there's something in mahout about this.

>> >

>> > Best,

>> >

>> > Fernando.

>> >

>> > 2011/6/15 Ted Dunning <[EMAIL PROTECTED]>

>> >

>> >> The normal terminology is to name U and V in SVD as "singular vectors"

>> as

>> >> opposed to eigenvectors. The term eigenvectors is normally reserved for

>> >> the

>> >> symmetric case of U S U' (more generally, the Hermitian case, but we

>> only

>> >> support real values).

>> >>

>> >> On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov <[EMAIL PROTECTED]

>> >> >wrote:

>> >>

>> >> > I beg to differ... U and V are left and right eigenvectors, and

>> >> > singular values is denoted as Sigma (which is a square root of eigen

>> >> > values of the AA' as you correctly pointed out) .

>> >> >

>> >>

>> >

>>

>>

>>

>> --

>> Stefan Wienert

>>

>> http://www.wienert.cc

>> [EMAIL PROTECTED]

>>

>> Telefon: +495251-2026838

>> Mobil: +49176-40170270

>>

>

On Wed, Jun 15, 2011 at 10:06 AM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Hmm. Seems I have plenty of negative results (nearly half of the

> similarity). I can add +0.3 then the greatest negative results are

> near 0. This is not optimal...

> I can project the results to [0..1].

>

Looking for *dissimilar* results seems odd. What are you trying to do?

What people normally do is look for clusters of similar documents, or

just the top-N most similar documents to each document. In both of these

cases, you don't care about the documents whose similarity to anyone is

zero, or less than zero.

-jake

> Any other suggestions or comments?

>

> Cheers

> Stefan

>

> 2011/6/15 Jake Mannix <[EMAIL PROTECTED]>:

> > While your original vectors never had similarity less than zero, after

> > projection onto the SVD space, you may "project away" similarities

> > between two vectors, and they are now negatively correlated in this

> > space (think about projecting (1,0,1) and (0,1,1) onto the 1-d vector

> > space spanned by (1,-1,0) - they go from having similarity +1/sqrt(2)

> > to similarity -1).

> >

> > I always interpret all similarities <= 0 as "maximally dissimilar",

> > even if technically -1 is where this is exactly true.

> >

> > -jake

> >

> > On Wed, Jun 15, 2011 at 2:10 AM, Stefan Wienert <[EMAIL PROTECTED]>

> wrote:

> >

> >> Ignoring is no option... so I have to interpret these values.

> >> Can one say that documents with similarity = -1 are the less similar

> >> documents? I don't think this is right.

> >> Any other assumptions?

> >>

> >> 2011/6/15 Fernando Fernández <[EMAIL PROTECTED]>:

> >> > One question that I think it has not been answered yet is that of the

> >> > negative simliarities. In literature you can find that similiarity=-1

> >> means

> >> > that "documents talk about opposite topics", but I think this is a

> quite

> >> > abstract idea... I just ignore them, when I'm trying to find top-k

> >> similar

> >> > documents these surely won't be useful. I read recently that this has

> to

> >> do

> >> > with the assumptions in SVD which is designed for normal distributions

> >> (This

> >> > implies the posibility of negative values). There are other techniques

> >> > (Non-negative factorization) that tries to solve this. I don't know if

> >> > there's something in mahout about this.

> >> >

> >> > Best,

> >> >

> >> > Fernando.

> >> >

> >> > 2011/6/15 Ted Dunning <[EMAIL PROTECTED]>

> >> >

> >> >> The normal terminology is to name U and V in SVD as "singular

> vectors"

> >> as

> >> >> opposed to eigenvectors. The term eigenvectors is normally reserved

> for

> >> >> the

> >> >> symmetric case of U S U' (more generally, the Hermitian case, but we

> >> only

> >> >> support real values).

> >> >>

> >> >> On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov <

> [EMAIL PROTECTED]

> >> >> >wrote:

> >> >>

> >> >> > I beg to differ... U and V are left and right eigenvectors, and

> >> >> > singular values is denoted as Sigma (which is a square root of

> eigen

> >> >> > values of the AA' as you correctly pointed out) .

> >> >> >

> >> >>

> >> >

> >>

> >>

> >>

> >> --

> >> Stefan Wienert

> >>

> >> http://www.wienert.cc

> >> [EMAIL PROTECTED]

> >>

> >> Telefon: +495251-2026838

> >> Mobil: +49176-40170270

> >>

> >

>

>

>

> --

> Stefan Wienert

>

> http://www.wienert.cc

> [EMAIL PROTECTED]

>

> Telefon: +495251-2026838

> Mobil: +49176-40170270

>

> Hmm. Seems I have plenty of negative results (nearly half of the

> similarity). I can add +0.3 then the greatest negative results are

> near 0. This is not optimal...

> I can project the results to [0..1].

>

Looking for *dissimilar* results seems odd. What are you trying to do?

What people normally do is look for clusters of similar documents, or

just the top-N most similar documents to each document. In both of these

cases, you don't care about the documents whose similarity to anyone is

zero, or less than zero.

-jake

> Any other suggestions or comments?

>

> Cheers

> Stefan

>

> 2011/6/15 Jake Mannix <[EMAIL PROTECTED]>:

> > While your original vectors never had similarity less than zero, after

> > projection onto the SVD space, you may "project away" similarities

> > between two vectors, and they are now negatively correlated in this

> > space (think about projecting (1,0,1) and (0,1,1) onto the 1-d vector

> > space spanned by (1,-1,0) - they go from having similarity +1/sqrt(2)

> > to similarity -1).

> >

> > I always interpret all similarities <= 0 as "maximally dissimilar",

> > even if technically -1 is where this is exactly true.

> >

> > -jake

> >

> > On Wed, Jun 15, 2011 at 2:10 AM, Stefan Wienert <[EMAIL PROTECTED]>

> wrote:

> >

> >> Ignoring is no option... so I have to interpret these values.

> >> Can one say that documents with similarity = -1 are the less similar

> >> documents? I don't think this is right.

> >> Any other assumptions?

> >>

> >> 2011/6/15 Fernando Fernández <[EMAIL PROTECTED]>:

> >> > One question that I think it has not been answered yet is that of the

> >> > negative simliarities. In literature you can find that similiarity=-1

> >> means

> >> > that "documents talk about opposite topics", but I think this is a

> quite

> >> > abstract idea... I just ignore them, when I'm trying to find top-k

> >> similar

> >> > documents these surely won't be useful. I read recently that this has

> to

> >> do

> >> > with the assumptions in SVD which is designed for normal distributions

> >> (This

> >> > implies the posibility of negative values). There are other techniques

> >> > (Non-negative factorization) that tries to solve this. I don't know if

> >> > there's something in mahout about this.

> >> >

> >> > Best,

> >> >

> >> > Fernando.

> >> >

> >> > 2011/6/15 Ted Dunning <[EMAIL PROTECTED]>

> >> >

> >> >> The normal terminology is to name U and V in SVD as "singular

> vectors"

> >> as

> >> >> opposed to eigenvectors. The term eigenvectors is normally reserved

> for

> >> >> the

> >> >> symmetric case of U S U' (more generally, the Hermitian case, but we

> >> only

> >> >> support real values).

> >> >>

> >> >> On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov <

> [EMAIL PROTECTED]

> >> >> >wrote:

> >> >>

> >> >> > I beg to differ... U and V are left and right eigenvectors, and

> >> >> > singular values is denoted as Sigma (which is a square root of

> eigen

> >> >> > values of the AA' as you correctly pointed out) .

> >> >> >

> >> >>

> >> >

> >>

> >>

> >>

> >> --

> >> Stefan Wienert

> >>

> >> http://www.wienert.cc

> >> [EMAIL PROTECTED]

> >>

> >> Telefon: +495251-2026838

> >> Mobil: +49176-40170270

> >>

> >

>

>

>

> --

> Stefan Wienert

>

> http://www.wienert.cc

> [EMAIL PROTECTED]

>

> Telefon: +495251-2026838

> Mobil: +49176-40170270

>

I have been intermittently following this point.

Some folks have said that having higher dimensional SVD's should change the

distribution of distances.

Actually, that isn't quite true. SVD preserves dot products as much as

possible. With lower dimensional projections you lose some information, but

as the singular values decline, you lose less and less information.

It *is* however true that *random* unit vectors in higher dimension have a

dot product that is more and more tightly clustered around zero. This is a

different case entirely from the case that we are talking about where you

have real data projected down into a lower dimensional space.

On Wed, Jun 15, 2011 at 7:44 PM, Jake Mannix <[EMAIL PROTECTED]> wrote:

> On Wed, Jun 15, 2011 at 10:06 AM, Stefan Wienert <[EMAIL PROTECTED]>

> wrote:

>

> > Hmm. Seems I have plenty of negative results (nearly half of the

> > similarity). I can add +0.3 then the greatest negative results are

> > near 0. This is not optimal...

> > I can project the results to [0..1].

> >

>

> Looking for *dissimilar* results seems odd. What are you trying to do?

>

> What people normally do is look for clusters of similar documents, or

> just the top-N most similar documents to each document. In both of these

> cases, you don't care about the documents whose similarity to anyone is

> zero, or less than zero.

>

> -jake

>

>

> > Any other suggestions or comments?

> >

> > Cheers

> > Stefan

> >

> > 2011/6/15 Jake Mannix <[EMAIL PROTECTED]>:

> > > While your original vectors never had similarity less than zero, after

> > > projection onto the SVD space, you may "project away" similarities

> > > between two vectors, and they are now negatively correlated in this

> > > space (think about projecting (1,0,1) and (0,1,1) onto the 1-d vector

> > > space spanned by (1,-1,0) - they go from having similarity +1/sqrt(2)

> > > to similarity -1).

> > >

> > > I always interpret all similarities <= 0 as "maximally dissimilar",

> > > even if technically -1 is where this is exactly true.

> > >

> > > -jake

> > >

> > > On Wed, Jun 15, 2011 at 2:10 AM, Stefan Wienert <[EMAIL PROTECTED]>

> > wrote:

> > >

> > >> Ignoring is no option... so I have to interpret these values.

> > >> Can one say that documents with similarity = -1 are the less similar

> > >> documents? I don't think this is right.

> > >> Any other assumptions?

> > >>

> > >> 2011/6/15 Fernando Fernández <[EMAIL PROTECTED]>:

> > >> > One question that I think it has not been answered yet is that of

> the

> > >> > negative simliarities. In literature you can find that

> similiarity=-1

> > >> means

> > >> > that "documents talk about opposite topics", but I think this is a

> > quite

> > >> > abstract idea... I just ignore them, when I'm trying to find top-k

> > >> similar

> > >> > documents these surely won't be useful. I read recently that this

> has

> > to

> > >> do

> > >> > with the assumptions in SVD which is designed for normal

> distributions

> > >> (This

> > >> > implies the posibility of negative values). There are other

> techniques

> > >> > (Non-negative factorization) that tries to solve this. I don't know

> if

> > >> > there's something in mahout about this.

> > >> >

> > >> > Best,

> > >> >

> > >> > Fernando.

> > >> >

> > >> > 2011/6/15 Ted Dunning <[EMAIL PROTECTED]>

> > >> >

> > >> >> The normal terminology is to name U and V in SVD as "singular

> > vectors"

> > >> as

> > >> >> opposed to eigenvectors. The term eigenvectors is normally

> reserved

> > for

> > >> >> the

> > >> >> symmetric case of U S U' (more generally, the Hermitian case, but

> we

> > >> only

> > >> >> support real values).

> > >> >>

> > >> >> On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov <

> > [EMAIL PROTECTED]

> > >> >> >wrote:

> > >> >>

> > >> >> > I beg to differ... U and V are left and right eigenvectors, and

> > >> >> > singular values is denoted as Sigma (which is a square root of

> > eigen

> > >> >> > values of the AA' as you correctly pointed out) .

Some folks have said that having higher dimensional SVD's should change the

distribution of distances.

Actually, that isn't quite true. SVD preserves dot products as much as

possible. With lower dimensional projections you lose some information, but

as the singular values decline, you lose less and less information.

It *is* however true that *random* unit vectors in higher dimension have a

dot product that is more and more tightly clustered around zero. This is a

different case entirely from the case that we are talking about where you

have real data projected down into a lower dimensional space.

On Wed, Jun 15, 2011 at 7:44 PM, Jake Mannix <[EMAIL PROTECTED]> wrote:

> On Wed, Jun 15, 2011 at 10:06 AM, Stefan Wienert <[EMAIL PROTECTED]>

> wrote:

>

> > Hmm. Seems I have plenty of negative results (nearly half of the

> > similarity). I can add +0.3 then the greatest negative results are

> > near 0. This is not optimal...

> > I can project the results to [0..1].

> >

>

> Looking for *dissimilar* results seems odd. What are you trying to do?

>

> What people normally do is look for clusters of similar documents, or

> just the top-N most similar documents to each document. In both of these

> cases, you don't care about the documents whose similarity to anyone is

> zero, or less than zero.

>

> -jake

>

>

> > Any other suggestions or comments?

> >

> > Cheers

> > Stefan

> >

> > 2011/6/15 Jake Mannix <[EMAIL PROTECTED]>:

> > > While your original vectors never had similarity less than zero, after

> > > projection onto the SVD space, you may "project away" similarities

> > > between two vectors, and they are now negatively correlated in this

> > > space (think about projecting (1,0,1) and (0,1,1) onto the 1-d vector

> > > space spanned by (1,-1,0) - they go from having similarity +1/sqrt(2)

> > > to similarity -1).

> > >

> > > I always interpret all similarities <= 0 as "maximally dissimilar",

> > > even if technically -1 is where this is exactly true.

> > >

> > > -jake

> > >

> > > On Wed, Jun 15, 2011 at 2:10 AM, Stefan Wienert <[EMAIL PROTECTED]>

> > wrote:

> > >

> > >> Ignoring is no option... so I have to interpret these values.

> > >> Can one say that documents with similarity = -1 are the less similar

> > >> documents? I don't think this is right.

> > >> Any other assumptions?

> > >>

> > >> 2011/6/15 Fernando Fernández <[EMAIL PROTECTED]>:

> > >> > One question that I think it has not been answered yet is that of

> the

> > >> > negative simliarities. In literature you can find that

> similiarity=-1

> > >> means

> > >> > that "documents talk about opposite topics", but I think this is a

> > quite

> > >> > abstract idea... I just ignore them, when I'm trying to find top-k

> > >> similar

> > >> > documents these surely won't be useful. I read recently that this

> has

> > to

> > >> do

> > >> > with the assumptions in SVD which is designed for normal

> distributions

> > >> (This

> > >> > implies the posibility of negative values). There are other

> techniques

> > >> > (Non-negative factorization) that tries to solve this. I don't know

> if

> > >> > there's something in mahout about this.

> > >> >

> > >> > Best,

> > >> >

> > >> > Fernando.

> > >> >

> > >> > 2011/6/15 Ted Dunning <[EMAIL PROTECTED]>

> > >> >

> > >> >> The normal terminology is to name U and V in SVD as "singular

> > vectors"

> > >> as

> > >> >> opposed to eigenvectors. The term eigenvectors is normally

> reserved

> > for

> > >> >> the

> > >> >> symmetric case of U S U' (more generally, the Hermitian case, but

> we

> > >> only

> > >> >> support real values).

> > >> >>

> > >> >> On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov <

> > [EMAIL PROTECTED]

> > >> >> >wrote:

> > >> >>

> > >> >> > I beg to differ... U and V are left and right eigenvectors, and

> > >> >> > singular values is denoted as Sigma (which is a square root of

> > eigen

> > >> >> > values of the AA' as you correctly pointed out) .