Stefan Wienert

2011-06-05, 20:16

Ted Dunning

2011-06-06, 10:08

Stefan Wienert

2011-06-06, 11:44

Stefan Wienert

2011-06-06, 12:22

Stefan Wienert

2011-06-06, 13:35

Danny Bickson

2011-06-06, 14:34

Stefan Wienert

2011-06-06, 16:57

Stefan Wienert

2011-06-06, 16:58

Danny Bickson

2011-06-06, 17:08

Stefan Wienert

2011-06-06, 17:30

Danny Bickson

2011-06-06, 17:36

Ted Dunning

2011-06-06, 18:11

Stefan Wienert

2011-06-06, 19:08

Ted Dunning

2011-06-06, 21:41

Dmitriy Lyubimov

2011-06-07, 18:57

Stefan Wienert

2011-06-07, 19:55

Dmitriy Lyubimov

2011-06-07, 19:58

Stefan Wienert

2011-06-07, 20:01

Dmitriy Lyubimov

2011-06-07, 20:08

Dmitriy Lyubimov

2011-06-07, 20:14

Dmitriy Lyubimov

2011-06-07, 20:17

Stefan Wienert

2011-06-07, 20:23

Lance Norskog

2011-06-08, 01:46

Sean Owen

2011-06-08, 07:27

Dmitriy Lyubimov

2011-06-08, 18:33

Hi,

after reading this:

https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

This looks familiar to LSA/LSI, but I have some questions:

In this example, the "tfidf-vectors"-matrix has 6,076,937 rows and

20,444 columns.

My first question is: Do the rows represent the documents and the

columns the terms in a traditional term-document-matrix?

so after the svd job, you got these 87 eigenvectors with each 20,444

columns (representing the terms).

These seem to be the eigenvectors of tfidf-vectors but reduced to only

87 documents? What is this mathematically?

and so, why do you calculate tfidf-vectors^T * svdOut^T? I do not find

myself an explanation

compared to SVD, is the result is the "right singular value"?

I know it works, but I don't understand some of these steps. Please help... :)

--

Stefan Wienert

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838 (neue Nummer seit 20.06.10)

Mobil: +49176-40170270

after reading this:

https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

This looks familiar to LSA/LSI, but I have some questions:

In this example, the "tfidf-vectors"-matrix has 6,076,937 rows and

20,444 columns.

My first question is: Do the rows represent the documents and the

columns the terms in a traditional term-document-matrix?

so after the svd job, you got these 87 eigenvectors with each 20,444

columns (representing the terms).

These seem to be the eigenvectors of tfidf-vectors but reduced to only

87 documents? What is this mathematically?

and so, why do you calculate tfidf-vectors^T * svdOut^T? I do not find

myself an explanation

compared to SVD, is the result is the "right singular value"?

I know it works, but I don't understand some of these steps. Please help... :)

--

Stefan Wienert

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838 (neue Nummer seit 20.06.10)

Mobil: +49176-40170270

Yes. These are term vectors, not document vectors.

There is an additional step that can be run to produce document vectors.

On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> compared to SVD, is the result is the "right singular value"?

>

There is an additional step that can be run to produce document vectors.

On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> compared to SVD, is the result is the "right singular value"?

>

https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

the last step is the matrix multiplication:

--arg --numRowsA --arg 20444 \

--arg --numColsA --arg 6076937 \

--arg --numRowsB --arg 20444 \

--arg --numColsB --arg 87 \

so the result is a 6,076,937 x 87 matrix

the input has 6,076,937 (each with 20,444 terms). so the result of

matrix multiplication has to be the right singular value regarding to

the dimensions.

so the result is the "concept-document vector matrix" (as I think,

these is also called "document vectors" ?)

2011/6/6 Ted Dunning <[EMAIL PROTECTED]>:

> Yes. These are term vectors, not document vectors.

>

> There is an additional step that can be run to produce document vectors.

>

> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>

>> compared to SVD, is the result is the "right singular value"?

>>

>

--

Stefan Wienert

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838

Mobil: +49176-40170270

the last step is the matrix multiplication:

--arg --numRowsA --arg 20444 \

--arg --numColsA --arg 6076937 \

--arg --numRowsB --arg 20444 \

--arg --numColsB --arg 87 \

so the result is a 6,076,937 x 87 matrix

the input has 6,076,937 (each with 20,444 terms). so the result of

matrix multiplication has to be the right singular value regarding to

the dimensions.

so the result is the "concept-document vector matrix" (as I think,

these is also called "document vectors" ?)

2011/6/6 Ted Dunning <[EMAIL PROTECTED]>:

> Yes. These are term vectors, not document vectors.

>

> There is an additional step that can be run to produce document vectors.

>

> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>

>> compared to SVD, is the result is the "right singular value"?

>>

>

--

Stefan Wienert

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838

Mobil: +49176-40170270

https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

What is done:

Input:

tf-idf-matrix (docs x terms) 6076937 x 20444

"SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and

eigenvalues) of tf-idf-matrix, called:

svd (concepts x terms) 87 x 20444

transpose tf-idf-matrix:

tf-idf-matrix-transpose (terms x docs) 20444 x 6076937

transpose svd:

svd-transpose (terms x concepts) 20444 x 87

matrix multiply:

tf-idf-matrix-transpose x svd-transpose = result

(terms x docs) x (terms x concepts) = (docs x concepts)

so... I do understand, that the "svd" here is not SVD from wikipedia.

It only does the Lanczos algorithm and some magic which produces the

> Instead either the left or right (but usually the right) eigenvectors premultiplied by the diagonal or the square root of the

> diagonal element.

from http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+[EMAIL PROTECTED]%3E

so my question: what is the output of the SVD in mahout. And what do I

have to calculate to get the "right singular value" from svd?

Thanks,

Stefan

2011/6/6 Stefan Wienert <[EMAIL PROTECTED]>:

> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

>

> the last step is the matrix multiplication:

> --arg --numRowsA --arg 20444 \

> --arg --numColsA --arg 6076937 \

> --arg --numRowsB --arg 20444 \

> --arg --numColsB --arg 87 \

> so the result is a 6,076,937 x 87 matrix

>

> the input has 6,076,937 (each with 20,444 terms). so the result of

> matrix multiplication has to be the right singular value regarding to

> the dimensions.

>

> so the result is the "concept-document vector matrix" (as I think,

> these is also called "document vectors" ?)

>

> 2011/6/6 Ted Dunning <[EMAIL PROTECTED]>:

>> Yes. These are term vectors, not document vectors.

>>

>> There is an additional step that can be run to produce document vectors.

>>

>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>

>>> compared to SVD, is the result is the "right singular value"?

>>>

>>

>

>

>

> --

> Stefan Wienert

>

> http://www.wienert.cc

> [EMAIL PROTECTED]

>

> Telefon: +495251-2026838

> Mobil: +49176-40170270

>

--

Stefan Wienert

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838

Mobil: +49176-40170270

What is done:

Input:

tf-idf-matrix (docs x terms) 6076937 x 20444

"SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and

eigenvalues) of tf-idf-matrix, called:

svd (concepts x terms) 87 x 20444

transpose tf-idf-matrix:

tf-idf-matrix-transpose (terms x docs) 20444 x 6076937

transpose svd:

svd-transpose (terms x concepts) 20444 x 87

matrix multiply:

tf-idf-matrix-transpose x svd-transpose = result

(terms x docs) x (terms x concepts) = (docs x concepts)

so... I do understand, that the "svd" here is not SVD from wikipedia.

It only does the Lanczos algorithm and some magic which produces the

> Instead either the left or right (but usually the right) eigenvectors premultiplied by the diagonal or the square root of the

> diagonal element.

so my question: what is the output of the SVD in mahout. And what do I

have to calculate to get the "right singular value" from svd?

Thanks,

Stefan

2011/6/6 Stefan Wienert <[EMAIL PROTECTED]>:

> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

>

> the last step is the matrix multiplication:

> --arg --numRowsA --arg 20444 \

> --arg --numColsA --arg 6076937 \

> --arg --numRowsB --arg 20444 \

> --arg --numColsB --arg 87 \

> so the result is a 6,076,937 x 87 matrix

>

> the input has 6,076,937 (each with 20,444 terms). so the result of

> matrix multiplication has to be the right singular value regarding to

> the dimensions.

>

> so the result is the "concept-document vector matrix" (as I think,

> these is also called "document vectors" ?)

>

> 2011/6/6 Ted Dunning <[EMAIL PROTECTED]>:

>> Yes. These are term vectors, not document vectors.

>>

>> There is an additional step that can be run to produce document vectors.

>>

>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>

>>> compared to SVD, is the result is the "right singular value"?

>>>

>>

>

>

>

> --

> Stefan Wienert

>

> http://www.wienert.cc

> [EMAIL PROTECTED]

>

> Telefon: +495251-2026838

> Mobil: +49176-40170270

>

--

Stefan Wienert

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838

Mobil: +49176-40170270

After reading this thread:

http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%[EMAIL PROTECTED]%3E

Wiki-SVD: M = U S V* (* = transposed)

The output of Mahout-SVD is (U S) right?

So... How do I get V from (U S) and M?

Is V = M (U S)* (because this is, what the calculation in the example is)?

Thanks

Stefan

2011/6/6 Stefan Wienert <[EMAIL PROTECTED]>:

> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

>

> What is done:

>

> Input:

> tf-idf-matrix (docs x terms) 6076937 x 20444

>

> "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and

> eigenvalues) of tf-idf-matrix, called:

> svd (concepts x terms) 87 x 20444

>

> transpose tf-idf-matrix:

> tf-idf-matrix-transpose (terms x docs) 20444 x 6076937

>

> transpose svd:

> svd-transpose (terms x concepts) 20444 x 87

>

> matrix multiply:

> tf-idf-matrix-transpose x svd-transpose = result

> (terms x docs) x (terms x concepts) = (docs x concepts)

>

> so... I do understand, that the "svd" here is not SVD from wikipedia.

> It only does the Lanczos algorithm and some magic which produces the

>> Instead either the left or right (but usually the right) eigenvectors premultiplied by the diagonal or the square root of the

>> diagonal element.

> from http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+[EMAIL PROTECTED]%3E

>

> so my question: what is the output of the SVD in mahout. And what do I

> have to calculate to get the "right singular value" from svd?

>

> Thanks,

> Stefan

>

> 2011/6/6 Stefan Wienert <[EMAIL PROTECTED]>:

>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

>>

>> the last step is the matrix multiplication:

>> --arg --numRowsA --arg 20444 \

>> --arg --numColsA --arg 6076937 \

>> --arg --numRowsB --arg 20444 \

>> --arg --numColsB --arg 87 \

>> so the result is a 6,076,937 x 87 matrix

>>

>> the input has 6,076,937 (each with 20,444 terms). so the result of

>> matrix multiplication has to be the right singular value regarding to

>> the dimensions.

>>

>> so the result is the "concept-document vector matrix" (as I think,

>> these is also called "document vectors" ?)

>>

>> 2011/6/6 Ted Dunning <[EMAIL PROTECTED]>:

>>> Yes. These are term vectors, not document vectors.

>>>

>>> There is an additional step that can be run to produce document vectors.

>>>

>>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>

>>>> compared to SVD, is the result is the "right singular value"?

>>>>

>>>

>>

>>

>>

>> --

>> Stefan Wienert

>>

>> http://www.wienert.cc

>> [EMAIL PROTECTED]

>>

>> Telefon: +495251-2026838

>> Mobil: +49176-40170270

>>

>

>

>

> --

> Stefan Wienert

>

> http://www.wienert.cc

> [EMAIL PROTECTED]

>

> Telefon: +495251-2026838

> Mobil: +49176-40170270

>

--

Stefan Wienert

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838

Mobil: +49176-40170270

http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%[EMAIL PROTECTED]%3E

Wiki-SVD: M = U S V* (* = transposed)

The output of Mahout-SVD is (U S) right?

So... How do I get V from (U S) and M?

Is V = M (U S)* (because this is, what the calculation in the example is)?

Thanks

Stefan

2011/6/6 Stefan Wienert <[EMAIL PROTECTED]>:

> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

>

> What is done:

>

> Input:

> tf-idf-matrix (docs x terms) 6076937 x 20444

>

> "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and

> eigenvalues) of tf-idf-matrix, called:

> svd (concepts x terms) 87 x 20444

>

> transpose tf-idf-matrix:

> tf-idf-matrix-transpose (terms x docs) 20444 x 6076937

>

> transpose svd:

> svd-transpose (terms x concepts) 20444 x 87

>

> matrix multiply:

> tf-idf-matrix-transpose x svd-transpose = result

> (terms x docs) x (terms x concepts) = (docs x concepts)

>

> so... I do understand, that the "svd" here is not SVD from wikipedia.

> It only does the Lanczos algorithm and some magic which produces the

>> Instead either the left or right (but usually the right) eigenvectors premultiplied by the diagonal or the square root of the

>> diagonal element.

> from http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+[EMAIL PROTECTED]%3E

>

> so my question: what is the output of the SVD in mahout. And what do I

> have to calculate to get the "right singular value" from svd?

>

> Thanks,

> Stefan

>

> 2011/6/6 Stefan Wienert <[EMAIL PROTECTED]>:

>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

>>

>> the last step is the matrix multiplication:

>> --arg --numRowsA --arg 20444 \

>> --arg --numColsA --arg 6076937 \

>> --arg --numRowsB --arg 20444 \

>> --arg --numColsB --arg 87 \

>> so the result is a 6,076,937 x 87 matrix

>>

>> the input has 6,076,937 (each with 20,444 terms). so the result of

>> matrix multiplication has to be the right singular value regarding to

>> the dimensions.

>>

>> so the result is the "concept-document vector matrix" (as I think,

>> these is also called "document vectors" ?)

>>

>> 2011/6/6 Ted Dunning <[EMAIL PROTECTED]>:

>>> Yes. These are term vectors, not document vectors.

>>>

>>> There is an additional step that can be run to produce document vectors.

>>>

>>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>

>>>> compared to SVD, is the result is the "right singular value"?

>>>>

>>>

>>

>>

>>

>> --

>> Stefan Wienert

>>

>> http://www.wienert.cc

>> [EMAIL PROTECTED]

>>

>> Telefon: +495251-2026838

>> Mobil: +49176-40170270

>>

>

>

>

> --

> Stefan Wienert

>

> http://www.wienert.cc

> [EMAIL PROTECTED]

>

> Telefon: +495251-2026838

> Mobil: +49176-40170270

>

--

Stefan Wienert

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838

Mobil: +49176-40170270

Hi

Mahout SVD implementation computes the Lanzcos iteration:

http://en.wikipedia.org/wiki/Lanczos_algorithm

Denote the non-square input matrix as M. First a symmetric matrix A is

computed by A=M'*M

Then an approximating tridiagonal matrix T and a vector matrix V are

computed such that A =~ V*T*V'

(this process is done in a distributed way).

Next the matrix T is next decomposed into eigenvectors and eignevalues.

Which is the returned result. (This process

is serial).

The third step makes the returned eigenvectors orthogonal to each other

(which is optional IMHO).

The heart of the code is found at:

./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java

At least that is where it was in version 0.4 I am not sure if there are

changes in version 0.5

Anyway, Mahout does not compute directly SVD. If you are interested in

learning more about the relation to SVD

look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,

subsection: relation to eigenvalue decomposition.

Hope this helps,

Danny Bickson

On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> After reading this thread:

>

> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%[EMAIL PROTECTED]%3E

>

> Wiki-SVD: M = U S V* (* = transposed)

>

> The output of Mahout-SVD is (U S) right?

>

> So... How do I get V from (U S) and M?

>

> Is V = M (U S)* (because this is, what the calculation in the example is)?

>

> Thanks

> Stefan

>

> 2011/6/6 Stefan Wienert <[EMAIL PROTECTED]>:

> > https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

> >

> > What is done:

> >

> > Input:

> > tf-idf-matrix (docs x terms) 6076937 x 20444

> >

> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and

> > eigenvalues) of tf-idf-matrix, called:

> > svd (concepts x terms) 87 x 20444

> >

> > transpose tf-idf-matrix:

> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937

> >

> > transpose svd:

> > svd-transpose (terms x concepts) 20444 x 87

> >

> > matrix multiply:

> > tf-idf-matrix-transpose x svd-transpose = result

> > (terms x docs) x (terms x concepts) = (docs x concepts)

> >

> > so... I do understand, that the "svd" here is not SVD from wikipedia.

> > It only does the Lanczos algorithm and some magic which produces the

> >> Instead either the left or right (but usually the right) eigenvectors

> premultiplied by the diagonal or the square root of the

> >> diagonal element.

> > from

> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+[EMAIL PROTECTED]%3E

> >

> > so my question: what is the output of the SVD in mahout. And what do I

> > have to calculate to get the "right singular value" from svd?

> >

> > Thanks,

> > Stefan

> >

> > 2011/6/6 Stefan Wienert <[EMAIL PROTECTED]>:

> >>

> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

> >>

> >> the last step is the matrix multiplication:

> >> --arg --numRowsA --arg 20444 \

> >> --arg --numColsA --arg 6076937 \

> >> --arg --numRowsB --arg 20444 \

> >> --arg --numColsB --arg 87 \

> >> so the result is a 6,076,937 x 87 matrix

> >>

> >> the input has 6,076,937 (each with 20,444 terms). so the result of

> >> matrix multiplication has to be the right singular value regarding to

> >> the dimensions.

> >>

> >> so the result is the "concept-document vector matrix" (as I think,

> >> these is also called "document vectors" ?)

> >>

> >> 2011/6/6 Ted Dunning <[EMAIL PROTECTED]>:

> >>> Yes. These are term vectors, not document vectors.

> >>>

> >>> There is an additional step that can be run to produce document

> vectors.

> >>>

> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <[EMAIL PROTECTED]>

> wrote:

> >>>

> >>>> compared to SVD, is the result is the "right singular value"?

> >>>>

> >>>

> >>

> >>

> >>

> >> --

> >> Stefan Wienert

> >>

> >> http://www.wienert.cc

> >> [EMAIL PROTECTED]

> >>

> >> Telefon: +495251-2026838

> >> Mobil: +49176-40170270

Mahout SVD implementation computes the Lanzcos iteration:

http://en.wikipedia.org/wiki/Lanczos_algorithm

Denote the non-square input matrix as M. First a symmetric matrix A is

computed by A=M'*M

Then an approximating tridiagonal matrix T and a vector matrix V are

computed such that A =~ V*T*V'

(this process is done in a distributed way).

Next the matrix T is next decomposed into eigenvectors and eignevalues.

Which is the returned result. (This process

is serial).

The third step makes the returned eigenvectors orthogonal to each other

(which is optional IMHO).

The heart of the code is found at:

./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java

At least that is where it was in version 0.4 I am not sure if there are

changes in version 0.5

Anyway, Mahout does not compute directly SVD. If you are interested in

learning more about the relation to SVD

look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,

subsection: relation to eigenvalue decomposition.

Hope this helps,

Danny Bickson

On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> After reading this thread:

>

> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%[EMAIL PROTECTED]%3E

>

> Wiki-SVD: M = U S V* (* = transposed)

>

> The output of Mahout-SVD is (U S) right?

>

> So... How do I get V from (U S) and M?

>

> Is V = M (U S)* (because this is, what the calculation in the example is)?

>

> Thanks

> Stefan

>

> 2011/6/6 Stefan Wienert <[EMAIL PROTECTED]>:

> > https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

> >

> > What is done:

> >

> > Input:

> > tf-idf-matrix (docs x terms) 6076937 x 20444

> >

> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and

> > eigenvalues) of tf-idf-matrix, called:

> > svd (concepts x terms) 87 x 20444

> >

> > transpose tf-idf-matrix:

> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937

> >

> > transpose svd:

> > svd-transpose (terms x concepts) 20444 x 87

> >

> > matrix multiply:

> > tf-idf-matrix-transpose x svd-transpose = result

> > (terms x docs) x (terms x concepts) = (docs x concepts)

> >

> > so... I do understand, that the "svd" here is not SVD from wikipedia.

> > It only does the Lanczos algorithm and some magic which produces the

> >> Instead either the left or right (but usually the right) eigenvectors

> premultiplied by the diagonal or the square root of the

> >> diagonal element.

> > from

> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+[EMAIL PROTECTED]%3E

> >

> > so my question: what is the output of the SVD in mahout. And what do I

> > have to calculate to get the "right singular value" from svd?

> >

> > Thanks,

> > Stefan

> >

> > 2011/6/6 Stefan Wienert <[EMAIL PROTECTED]>:

> >>

> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

> >>

> >> the last step is the matrix multiplication:

> >> --arg --numRowsA --arg 20444 \

> >> --arg --numColsA --arg 6076937 \

> >> --arg --numRowsB --arg 20444 \

> >> --arg --numColsB --arg 87 \

> >> so the result is a 6,076,937 x 87 matrix

> >>

> >> the input has 6,076,937 (each with 20,444 terms). so the result of

> >> matrix multiplication has to be the right singular value regarding to

> >> the dimensions.

> >>

> >> so the result is the "concept-document vector matrix" (as I think,

> >> these is also called "document vectors" ?)

> >>

> >> 2011/6/6 Ted Dunning <[EMAIL PROTECTED]>:

> >>> Yes. These are term vectors, not document vectors.

> >>>

> >>> There is an additional step that can be run to produce document

> vectors.

> >>>

> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert <[EMAIL PROTECTED]>

> wrote:

> >>>

> >>>> compared to SVD, is the result is the "right singular value"?

> >>>>

> >>>

> >>

> >>

> >>

> >> --

> >> Stefan Wienert

> >>

> >> http://www.wienert.cc

> >> [EMAIL PROTECTED]

> >>

> >> Telefon: +495251-2026838

> >> Mobil: +49176-40170270

Hi.

Thanks for the help.

The important points from wikipedia are:

- The left singular vectors of M are eigenvectors of M*M' .

- The right singular vectors of M are eigenvectors of M'*M.

as you describe, the mahout lanczos solver calculate A=M'*M (I think

it does A=M*M', but it is not a problem). Therefore it does already

calculate the right (or left) singular vector of M.

But my question is, how can I get the other singular vector? I can

transpose M, but then I have to calculated two SVDs, one for the right

and one for the left singular value... I think there is a better way

:)

Hope you can help me with this...

Thanks

Stefan

2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> Hi

> Mahout SVD implementation computes the Lanzcos iteration:

> http://en.wikipedia.org/wiki/Lanczos_algorithm

> Denote the non-square input matrix as M. First a symmetric matrix A is

> computed by A=M'*M

> Then an approximating tridiagonal matrix T and a vector matrix V are

> computed such that A =~ V*T*V'

> (this process is done in a distributed way).

>

> Next the matrix T is next decomposed into eigenvectors and eignevalues.

> Which is the returned result. (This process

> is serial).

>

> The third step makes the returned eigenvectors orthogonal to each other

> (which is optional IMHO).

>

> The heart of the code is found at:

> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java

> At least that is where it was in version 0.4 I am not sure if there are

> changes in version 0.5

>

> Anyway, Mahout does not compute directly SVD. If you are interested in

> learning more about the relation to SVD

> look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,

> subsection: relation to eigenvalue decomposition.

>

> Hope this helps,

>

> Danny Bickson

>

> On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>

>> After reading this thread:

>>

>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%[EMAIL PROTECTED]%3E

>>

>> Wiki-SVD: M = U S V* (* = transposed)

>>

>> The output of Mahout-SVD is (U S) right?

>>

>> So... How do I get V from (U S) and M?

>>

>> Is V = M (U S)* (because this is, what the calculation in the example is)?

>>

>> Thanks

>> Stefan

>>

>> 2011/6/6 Stefan Wienert <[EMAIL PROTECTED]>:

>> > https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

>> >

>> > What is done:

>> >

>> > Input:

>> > tf-idf-matrix (docs x terms) 6076937 x 20444

>> >

>> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and

>> > eigenvalues) of tf-idf-matrix, called:

>> > svd (concepts x terms) 87 x 20444

>> >

>> > transpose tf-idf-matrix:

>> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937

>> >

>> > transpose svd:

>> > svd-transpose (terms x concepts) 20444 x 87

>> >

>> > matrix multiply:

>> > tf-idf-matrix-transpose x svd-transpose = result

>> > (terms x docs) x (terms x concepts) = (docs x concepts)

>> >

>> > so... I do understand, that the "svd" here is not SVD from wikipedia.

>> > It only does the Lanczos algorithm and some magic which produces the

>> >> Instead either the left or right (but usually the right) eigenvectors

>> premultiplied by the diagonal or the square root of the

>> >> diagonal element.

>> > from

>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+[EMAIL PROTECTED]%3E

>> >

>> > so my question: what is the output of the SVD in mahout. And what do I

>> > have to calculate to get the "right singular value" from svd?

>> >

>> > Thanks,

>> > Stefan

>> >

>> > 2011/6/6 Stefan Wienert <[EMAIL PROTECTED]>:

>> >>

>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

>> >>

>> >> the last step is the matrix multiplication:

>> >> --arg --numRowsA --arg 20444 \

>> >> --arg --numColsA --arg 6076937 \

>> >> --arg --numRowsB --arg 20444 \

>> >> --arg --numColsB --arg 87 \

>> >> so the result is a 6,076,937 x 87 matrix

>> >>

>

Stefan Wienert

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838

Mobil: +49176-40170270

Thanks for the help.

The important points from wikipedia are:

- The left singular vectors of M are eigenvectors of M*M' .

- The right singular vectors of M are eigenvectors of M'*M.

as you describe, the mahout lanczos solver calculate A=M'*M (I think

it does A=M*M', but it is not a problem). Therefore it does already

calculate the right (or left) singular vector of M.

But my question is, how can I get the other singular vector? I can

transpose M, but then I have to calculated two SVDs, one for the right

and one for the left singular value... I think there is a better way

:)

Hope you can help me with this...

Thanks

Stefan

2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> Hi

> Mahout SVD implementation computes the Lanzcos iteration:

> http://en.wikipedia.org/wiki/Lanczos_algorithm

> Denote the non-square input matrix as M. First a symmetric matrix A is

> computed by A=M'*M

> Then an approximating tridiagonal matrix T and a vector matrix V are

> computed such that A =~ V*T*V'

> (this process is done in a distributed way).

>

> Next the matrix T is next decomposed into eigenvectors and eignevalues.

> Which is the returned result. (This process

> is serial).

>

> The third step makes the returned eigenvectors orthogonal to each other

> (which is optional IMHO).

>

> The heart of the code is found at:

> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java

> At least that is where it was in version 0.4 I am not sure if there are

> changes in version 0.5

>

> Anyway, Mahout does not compute directly SVD. If you are interested in

> learning more about the relation to SVD

> look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,

> subsection: relation to eigenvalue decomposition.

>

> Hope this helps,

>

> Danny Bickson

>

> On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>

>> After reading this thread:

>>

>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%[EMAIL PROTECTED]%3E

>>

>> Wiki-SVD: M = U S V* (* = transposed)

>>

>> The output of Mahout-SVD is (U S) right?

>>

>> So... How do I get V from (U S) and M?

>>

>> Is V = M (U S)* (because this is, what the calculation in the example is)?

>>

>> Thanks

>> Stefan

>>

>> 2011/6/6 Stefan Wienert <[EMAIL PROTECTED]>:

>> > https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

>> >

>> > What is done:

>> >

>> > Input:

>> > tf-idf-matrix (docs x terms) 6076937 x 20444

>> >

>> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and

>> > eigenvalues) of tf-idf-matrix, called:

>> > svd (concepts x terms) 87 x 20444

>> >

>> > transpose tf-idf-matrix:

>> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937

>> >

>> > transpose svd:

>> > svd-transpose (terms x concepts) 20444 x 87

>> >

>> > matrix multiply:

>> > tf-idf-matrix-transpose x svd-transpose = result

>> > (terms x docs) x (terms x concepts) = (docs x concepts)

>> >

>> > so... I do understand, that the "svd" here is not SVD from wikipedia.

>> > It only does the Lanczos algorithm and some magic which produces the

>> >> Instead either the left or right (but usually the right) eigenvectors

>> premultiplied by the diagonal or the square root of the

>> >> diagonal element.

>> > from

>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+[EMAIL PROTECTED]%3E

>> >

>> > so my question: what is the output of the SVD in mahout. And what do I

>> > have to calculate to get the "right singular value" from svd?

>> >

>> > Thanks,

>> > Stefan

>> >

>> > 2011/6/6 Stefan Wienert <[EMAIL PROTECTED]>:

>> >>

>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

>> >>

>> >> the last step is the matrix multiplication:

>> >> --arg --numRowsA --arg 20444 \

>> >> --arg --numColsA --arg 6076937 \

>> >> --arg --numRowsB --arg 20444 \

>> >> --arg --numColsB --arg 87 \

>> >> so the result is a 6,076,937 x 87 matrix

>> >>

>

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838

Mobil: +49176-40170270

Thanks for the help.

The important points from wikipedia are:

- The left singular vectors of M are eigenvectors of M*M' .

- The right singular vectors of M are eigenvectors of M'*M.

as you describe, the mahout lanczos solver calculate A=M'*M (I think

it does A=M*M', but it is not a problem). Therefore it does already

calculate the right (or left) singular vector of M.

But my question is, how can I get the other singular vector? I can

transpose M, but then I have to calculated two SVDs, one for the right

and one for the left singular value... I think there is a better way

:)

Hope you can help me with this...

Thanks

Stefan

2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> Hi

> Mahout SVD implementation computes the Lanzcos iteration:

> http://en.wikipedia.org/wiki/Lanczos_algorithm

> Denote the non-square input matrix as M. First a symmetric matrix A is

> computed by A=M'*M

> Then an approximating tridiagonal matrix T and a vector matrix V are

> computed such that A =~ V*T*V'

> (this process is done in a distributed way).

>

> Next the matrix T is next decomposed into eigenvectors and eignevalues.

> Which is the returned result. (This process

> is serial).

>

> The third step makes the returned eigenvectors orthogonal to each other

> (which is optional IMHO).

>

> The heart of the code is found at:

> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java

> At least that is where it was in version 0.4 I am not sure if there are

> changes in version 0.5

>

> Anyway, Mahout does not compute directly SVD. If you are interested in

> learning more about the relation to SVD

> look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,

> subsection: relation to eigenvalue decomposition.

>

> Hope this helps,

>

> Danny Bickson

>

> On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>

>> After reading this thread:

>>

>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%[EMAIL PROTECTED]%3E

>>

>> Wiki-SVD: M = U S V* (* = transposed)

>>

>> The output of Mahout-SVD is (U S) right?

>>

>> So... How do I get V from (U S) and M?

>>

>> Is V = M (U S)* (because this is, what the calculation in the example is)?

>>

>> Thanks

>> Stefan

>>

>> 2011/6/6 Stefan Wienert <[EMAIL PROTECTED]>:

>> > https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

>> >

>> > What is done:

>> >

>> > Input:

>> > tf-idf-matrix (docs x terms) 6076937 x 20444

>> >

>> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and

>> > eigenvalues) of tf-idf-matrix, called:

>> > svd (concepts x terms) 87 x 20444

>> >

>> > transpose tf-idf-matrix:

>> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937

>> >

>> > transpose svd:

>> > svd-transpose (terms x concepts) 20444 x 87

>> >

>> > matrix multiply:

>> > tf-idf-matrix-transpose x svd-transpose = result

>> > (terms x docs) x (terms x concepts) = (docs x concepts)

>> >

>> > so... I do understand, that the "svd" here is not SVD from wikipedia.

>> > It only does the Lanczos algorithm and some magic which produces the

>> >> Instead either the left or right (but usually the right) eigenvectors

>> premultiplied by the diagonal or the square root of the

>> >> diagonal element.

>> > from

>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+[EMAIL PROTECTED]%3E

>> >

>> > so my question: what is the output of the SVD in mahout. And what do I

>> > have to calculate to get the "right singular value" from svd?

>> >

>> > Thanks,

>> > Stefan

>> >

>> > 2011/6/6 Stefan Wienert <[EMAIL PROTECTED]>:

>> >>

>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

>> >>

>> >> the last step is the matrix multiplication:

>> >> --arg --numRowsA --arg 20444 \

>> >> --arg --numColsA --arg 6076937 \

>> >> --arg --numRowsB --arg 20444 \

>> >> --arg --numColsB --arg 87 \

>> >> so the result is a 6,076,937 x 87 matrix

>> >>

>

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838

Mobil: +49176-40170270

Hi Stefan!

For a positive semidefinite matrix, the lest and right eigenvectors are

identical.

See SVD wikipeida text: When *M* is also positive

semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,

the decomposition *M* = *U**D**U* * is also a singular value decomposition.

So you don't need to be worried about the other singular vectors.

Hope this helps!

On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Hi.

>

> Thanks for the help.

>

> The important points from wikipedia are:

> - The left singular vectors of M are eigenvectors of M*M' .

> - The right singular vectors of M are eigenvectors of M'*M.

>

> as you describe, the mahout lanczos solver calculate A=M'*M (I think

> it does A=M*M', but it is not a problem). Therefore it does already

> calculate the right (or left) singular vector of M.

>

> But my question is, how can I get the other singular vector? I can

> transpose M, but then I have to calculated two SVDs, one for the right

> and one for the left singular value... I think there is a better way

> :)

>

> Hope you can help me with this...

> Thanks

> Stefan

>

>

> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> > Hi

> > Mahout SVD implementation computes the Lanzcos iteration:

> > http://en.wikipedia.org/wiki/Lanczos_algorithm

> > Denote the non-square input matrix as M. First a symmetric matrix A is

> > computed by A=M'*M

> > Then an approximating tridiagonal matrix T and a vector matrix V are

> > computed such that A =~ V*T*V'

> > (this process is done in a distributed way).

> >

> > Next the matrix T is next decomposed into eigenvectors and eignevalues.

> > Which is the returned result. (This process

> > is serial).

> >

> > The third step makes the returned eigenvectors orthogonal to each other

> > (which is optional IMHO).

> >

> > The heart of the code is found at:

> >

> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java

> > At least that is where it was in version 0.4 I am not sure if there are

> > changes in version 0.5

> >

> > Anyway, Mahout does not compute directly SVD. If you are interested in

> > learning more about the relation to SVD

> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,

> > subsection: relation to eigenvalue decomposition.

> >

> > Hope this helps,

> >

> > Danny Bickson

> >

> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <[EMAIL PROTECTED]>

> wrote:

> >

> >> After reading this thread:

> >>

> >>

> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%[EMAIL PROTECTED]%3E

> >>

> >> Wiki-SVD: M = U S V* (* = transposed)

> >>

> >> The output of Mahout-SVD is (U S) right?

> >>

> >> So... How do I get V from (U S) and M?

> >>

> >> Is V = M (U S)* (because this is, what the calculation in the example

> is)?

> >>

> >> Thanks

> >> Stefan

> >>

> >> 2011/6/6 Stefan Wienert <[EMAIL PROTECTED]>:

> >> >

> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

> >> >

> >> > What is done:

> >> >

> >> > Input:

> >> > tf-idf-matrix (docs x terms) 6076937 x 20444

> >> >

> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and

> >> > eigenvalues) of tf-idf-matrix, called:

> >> > svd (concepts x terms) 87 x 20444

> >> >

> >> > transpose tf-idf-matrix:

> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937

> >> >

> >> > transpose svd:

> >> > svd-transpose (terms x concepts) 20444 x 87

> >> >

> >> > matrix multiply:

> >> > tf-idf-matrix-transpose x svd-transpose = result

> >> > (terms x docs) x (terms x concepts) = (docs x concepts)

> >> >

> >> > so... I do understand, that the "svd" here is not SVD from wikipedia.

> >> > It only does the Lanczos algorithm and some magic which produces the

> >> >> Instead either the left or right (but usually the right) eigenvectors

> >> premultiplied by the diagonal or the square root of the

> >> >> diagonal element.

> >> > from

> >>

> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+[EMAIL PROTECTED]%3E

For a positive semidefinite matrix, the lest and right eigenvectors are

identical.

See SVD wikipeida text: When *M* is also positive

semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,

the decomposition *M* = *U**D**U* * is also a singular value decomposition.

So you don't need to be worried about the other singular vectors.

Hope this helps!

On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Hi.

>

> Thanks for the help.

>

> The important points from wikipedia are:

> - The left singular vectors of M are eigenvectors of M*M' .

> - The right singular vectors of M are eigenvectors of M'*M.

>

> as you describe, the mahout lanczos solver calculate A=M'*M (I think

> it does A=M*M', but it is not a problem). Therefore it does already

> calculate the right (or left) singular vector of M.

>

> But my question is, how can I get the other singular vector? I can

> transpose M, but then I have to calculated two SVDs, one for the right

> and one for the left singular value... I think there is a better way

> :)

>

> Hope you can help me with this...

> Thanks

> Stefan

>

>

> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> > Hi

> > Mahout SVD implementation computes the Lanzcos iteration:

> > http://en.wikipedia.org/wiki/Lanczos_algorithm

> > Denote the non-square input matrix as M. First a symmetric matrix A is

> > computed by A=M'*M

> > Then an approximating tridiagonal matrix T and a vector matrix V are

> > computed such that A =~ V*T*V'

> > (this process is done in a distributed way).

> >

> > Next the matrix T is next decomposed into eigenvectors and eignevalues.

> > Which is the returned result. (This process

> > is serial).

> >

> > The third step makes the returned eigenvectors orthogonal to each other

> > (which is optional IMHO).

> >

> > The heart of the code is found at:

> >

> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java

> > At least that is where it was in version 0.4 I am not sure if there are

> > changes in version 0.5

> >

> > Anyway, Mahout does not compute directly SVD. If you are interested in

> > learning more about the relation to SVD

> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,

> > subsection: relation to eigenvalue decomposition.

> >

> > Hope this helps,

> >

> > Danny Bickson

> >

> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <[EMAIL PROTECTED]>

> wrote:

> >

> >> After reading this thread:

> >>

> >>

> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%[EMAIL PROTECTED]%3E

> >>

> >> Wiki-SVD: M = U S V* (* = transposed)

> >>

> >> The output of Mahout-SVD is (U S) right?

> >>

> >> So... How do I get V from (U S) and M?

> >>

> >> Is V = M (U S)* (because this is, what the calculation in the example

> is)?

> >>

> >> Thanks

> >> Stefan

> >>

> >> 2011/6/6 Stefan Wienert <[EMAIL PROTECTED]>:

> >> >

> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

> >> >

> >> > What is done:

> >> >

> >> > Input:

> >> > tf-idf-matrix (docs x terms) 6076937 x 20444

> >> >

> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and

> >> > eigenvalues) of tf-idf-matrix, called:

> >> > svd (concepts x terms) 87 x 20444

> >> >

> >> > transpose tf-idf-matrix:

> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937

> >> >

> >> > transpose svd:

> >> > svd-transpose (terms x concepts) 20444 x 87

> >> >

> >> > matrix multiply:

> >> > tf-idf-matrix-transpose x svd-transpose = result

> >> > (terms x docs) x (terms x concepts) = (docs x concepts)

> >> >

> >> > so... I do understand, that the "svd" here is not SVD from wikipedia.

> >> > It only does the Lanczos algorithm and some magic which produces the

> >> >> Instead either the left or right (but usually the right) eigenvectors

> >> premultiplied by the diagonal or the square root of the

> >> >> diagonal element.

> >> > from

> >>

> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+[EMAIL PROTECTED]%3E

Hi Danny!

I understand that for M*M' (and for M'*M) the left and right

eigenvectors are identical. But that is not exactly what I want. The

lanczos solver from mahout gives me the eigenvectors of M*M', which

are the left singular vectors of M. But I need the right singular

vectors of M (and not M*M'). How do I get them?

Sorry, my matrix math is not as good as it should be, but I hope you

can help me!

Thanks,

Stefan

2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> Hi Stefan!

> For a positive semidefinite matrix, the lest and right eigenvectors are

> identical.

> See SVD wikipeida text: When *M* is also positive

> semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,

> the decomposition *M* = *U**D**U* * is also a singular value decomposition.

> So you don't need to be worried about the other singular vectors.

>

> Hope this helps!

>

> On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>

>> Hi.

>>

>> Thanks for the help.

>>

>> The important points from wikipedia are:

>> - The left singular vectors of M are eigenvectors of M*M' .

>> - The right singular vectors of M are eigenvectors of M'*M.

>>

>> as you describe, the mahout lanczos solver calculate A=M'*M (I think

>> it does A=M*M', but it is not a problem). Therefore it does already

>> calculate the right (or left) singular vector of M.

>>

>> But my question is, how can I get the other singular vector? I can

>> transpose M, but then I have to calculated two SVDs, one for the right

>> and one for the left singular value... I think there is a better way

>> :)

>>

>> Hope you can help me with this...

>> Thanks

>> Stefan

>>

>>

>> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>> > Hi

>> > Mahout SVD implementation computes the Lanzcos iteration:

>> > http://en.wikipedia.org/wiki/Lanczos_algorithm

>> > Denote the non-square input matrix as M. First a symmetric matrix A is

>> > computed by A=M'*M

>> > Then an approximating tridiagonal matrix T and a vector matrix V are

>> > computed such that A =~ V*T*V'

>> > (this process is done in a distributed way).

>> >

>> > Next the matrix T is next decomposed into eigenvectors and eignevalues.

>> > Which is the returned result. (This process

>> > is serial).

>> >

>> > The third step makes the returned eigenvectors orthogonal to each other

>> > (which is optional IMHO).

>> >

>> > The heart of the code is found at:

>> >

>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java

>> > At least that is where it was in version 0.4 I am not sure if there are

>> > changes in version 0.5

>> >

>> > Anyway, Mahout does not compute directly SVD. If you are interested in

>> > learning more about the relation to SVD

>> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,

>> > subsection: relation to eigenvalue decomposition.

>> >

>> > Hope this helps,

>> >

>> > Danny Bickson

>> >

>> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <[EMAIL PROTECTED]>

>> wrote:

>> >

>> >> After reading this thread:

>> >>

>> >>

>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%[EMAIL PROTECTED]%3E

>> >>

>> >> Wiki-SVD: M = U S V* (* = transposed)

>> >>

>> >> The output of Mahout-SVD is (U S) right?

>> >>

>> >> So... How do I get V from (U S) and M?

>> >>

>> >> Is V = M (U S)* (because this is, what the calculation in the example

>> is)?

>> >>

>> >> Thanks

>> >> Stefan

>> >>

>> >> 2011/6/6 Stefan Wienert <[EMAIL PROTECTED]>:

>> >> >

>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

>> >> >

>> >> > What is done:

>> >> >

>> >> > Input:

>> >> > tf-idf-matrix (docs x terms) 6076937 x 20444

>> >> >

>> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and

>> >> > eigenvalues) of tf-idf-matrix, called:

>> >> > svd (concepts x terms) 87 x 20444

>> >> >

>> >> > transpose tf-idf-matrix:

>> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937

>> >> >

>> >> > transpose svd:

>> >> > svd-transpose (terms x concepts) 20444 x 87

Stefan Wienert

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838

Mobil: +49176-40170270

I understand that for M*M' (and for M'*M) the left and right

eigenvectors are identical. But that is not exactly what I want. The

lanczos solver from mahout gives me the eigenvectors of M*M', which

are the left singular vectors of M. But I need the right singular

vectors of M (and not M*M'). How do I get them?

Sorry, my matrix math is not as good as it should be, but I hope you

can help me!

Thanks,

Stefan

2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> Hi Stefan!

> For a positive semidefinite matrix, the lest and right eigenvectors are

> identical.

> See SVD wikipeida text: When *M* is also positive

> semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,

> the decomposition *M* = *U**D**U* * is also a singular value decomposition.

> So you don't need to be worried about the other singular vectors.

>

> Hope this helps!

>

> On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>

>> Hi.

>>

>> Thanks for the help.

>>

>> The important points from wikipedia are:

>> - The left singular vectors of M are eigenvectors of M*M' .

>> - The right singular vectors of M are eigenvectors of M'*M.

>>

>> as you describe, the mahout lanczos solver calculate A=M'*M (I think

>> it does A=M*M', but it is not a problem). Therefore it does already

>> calculate the right (or left) singular vector of M.

>>

>> But my question is, how can I get the other singular vector? I can

>> transpose M, but then I have to calculated two SVDs, one for the right

>> and one for the left singular value... I think there is a better way

>> :)

>>

>> Hope you can help me with this...

>> Thanks

>> Stefan

>>

>>

>> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>> > Hi

>> > Mahout SVD implementation computes the Lanzcos iteration:

>> > http://en.wikipedia.org/wiki/Lanczos_algorithm

>> > Denote the non-square input matrix as M. First a symmetric matrix A is

>> > computed by A=M'*M

>> > Then an approximating tridiagonal matrix T and a vector matrix V are

>> > computed such that A =~ V*T*V'

>> > (this process is done in a distributed way).

>> >

>> > Next the matrix T is next decomposed into eigenvectors and eignevalues.

>> > Which is the returned result. (This process

>> > is serial).

>> >

>> > The third step makes the returned eigenvectors orthogonal to each other

>> > (which is optional IMHO).

>> >

>> > The heart of the code is found at:

>> >

>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java

>> > At least that is where it was in version 0.4 I am not sure if there are

>> > changes in version 0.5

>> >

>> > Anyway, Mahout does not compute directly SVD. If you are interested in

>> > learning more about the relation to SVD

>> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,

>> > subsection: relation to eigenvalue decomposition.

>> >

>> > Hope this helps,

>> >

>> > Danny Bickson

>> >

>> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <[EMAIL PROTECTED]>

>> wrote:

>> >

>> >> After reading this thread:

>> >>

>> >>

>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%[EMAIL PROTECTED]%3E

>> >>

>> >> Wiki-SVD: M = U S V* (* = transposed)

>> >>

>> >> The output of Mahout-SVD is (U S) right?

>> >>

>> >> So... How do I get V from (U S) and M?

>> >>

>> >> Is V = M (U S)* (because this is, what the calculation in the example

>> is)?

>> >>

>> >> Thanks

>> >> Stefan

>> >>

>> >> 2011/6/6 Stefan Wienert <[EMAIL PROTECTED]>:

>> >> >

>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction

>> >> >

>> >> > What is done:

>> >> >

>> >> > Input:

>> >> > tf-idf-matrix (docs x terms) 6076937 x 20444

>> >> >

>> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and

>> >> > eigenvalues) of tf-idf-matrix, called:

>> >> > svd (concepts x terms) 87 x 20444

>> >> >

>> >> > transpose tf-idf-matrix:

>> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937

>> >> >

>> >> > transpose svd:

>> >> > svd-transpose (terms x concepts) 20444 x 87

Stefan Wienert

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838

Mobil: +49176-40170270

If I understand your question correctly, you need to simply transpose M

before you start the run and that way

you will get the other singular vectors.

May I ask what is the problem you are working on and why do you need the

singular vectors?

Can you consider using another matrix decomposition technique for example

alternating least squares

which gives you two lower rank matrices which simulates the large decomposed

matrix?

On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Hi Danny!

>

> I understand that for M*M' (and for M'*M) the left and right

> eigenvectors are identical. But that is not exactly what I want. The

> lanczos solver from mahout gives me the eigenvectors of M*M', which

> are the left singular vectors of M. But I need the right singular

> vectors of M (and not M*M'). How do I get them?

>

> Sorry, my matrix math is not as good as it should be, but I hope you

> can help me!

>

> Thanks,

> Stefan

>

> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> > Hi Stefan!

> > For a positive semidefinite matrix, the lest and right eigenvectors are

> > identical.

> > See SVD wikipeida text: When *M* is also positive

> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,

> > the decomposition *M* = *U**D**U* * is also a singular value

> decomposition.

> > So you don't need to be worried about the other singular vectors.

> >

> > Hope this helps!

> >

> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <[EMAIL PROTECTED]>

> wrote:

> >

> >> Hi.

> >>

> >> Thanks for the help.

> >>

> >> The important points from wikipedia are:

> >> - The left singular vectors of M are eigenvectors of M*M' .

> >> - The right singular vectors of M are eigenvectors of M'*M.

> >>

> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think

> >> it does A=M*M', but it is not a problem). Therefore it does already

> >> calculate the right (or left) singular vector of M.

> >>

> >> But my question is, how can I get the other singular vector? I can

> >> transpose M, but then I have to calculated two SVDs, one for the right

> >> and one for the left singular value... I think there is a better way

> >> :)

> >>

> >> Hope you can help me with this...

> >> Thanks

> >> Stefan

> >>

> >>

> >> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> >> > Hi

> >> > Mahout SVD implementation computes the Lanzcos iteration:

> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm

> >> > Denote the non-square input matrix as M. First a symmetric matrix A is

> >> > computed by A=M'*M

> >> > Then an approximating tridiagonal matrix T and a vector matrix V are

> >> > computed such that A =~ V*T*V'

> >> > (this process is done in a distributed way).

> >> >

> >> > Next the matrix T is next decomposed into eigenvectors and

> eignevalues.

> >> > Which is the returned result. (This process

> >> > is serial).

> >> >

> >> > The third step makes the returned eigenvectors orthogonal to each

> other

> >> > (which is optional IMHO).

> >> >

> >> > The heart of the code is found at:

> >> >

> >>

> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java

> >> > At least that is where it was in version 0.4 I am not sure if there

> are

> >> > changes in version 0.5

> >> >

> >> > Anyway, Mahout does not compute directly SVD. If you are interested in

> >> > learning more about the relation to SVD

> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,

> >> > subsection: relation to eigenvalue decomposition.

> >> >

> >> > Hope this helps,

> >> >

> >> > Danny Bickson

> >> >

> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <[EMAIL PROTECTED]>

> >> wrote:

> >> >

> >> >> After reading this thread:

> >> >>

> >> >>

> >>

> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%[EMAIL PROTECTED]%3E

> >> >>

> >> >> Wiki-SVD: M = U S V* (* = transposed)

> >> >>

> >> >> The output of Mahout-SVD is (U S) right?

> >> >>

> >> >> So... How do I get V from (U S) and M?

before you start the run and that way

you will get the other singular vectors.

May I ask what is the problem you are working on and why do you need the

singular vectors?

Can you consider using another matrix decomposition technique for example

alternating least squares

which gives you two lower rank matrices which simulates the large decomposed

matrix?

On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Hi Danny!

>

> I understand that for M*M' (and for M'*M) the left and right

> eigenvectors are identical. But that is not exactly what I want. The

> lanczos solver from mahout gives me the eigenvectors of M*M', which

> are the left singular vectors of M. But I need the right singular

> vectors of M (and not M*M'). How do I get them?

>

> Sorry, my matrix math is not as good as it should be, but I hope you

> can help me!

>

> Thanks,

> Stefan

>

> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> > Hi Stefan!

> > For a positive semidefinite matrix, the lest and right eigenvectors are

> > identical.

> > See SVD wikipeida text: When *M* is also positive

> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,

> > the decomposition *M* = *U**D**U* * is also a singular value

> decomposition.

> > So you don't need to be worried about the other singular vectors.

> >

> > Hope this helps!

> >

> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <[EMAIL PROTECTED]>

> wrote:

> >

> >> Hi.

> >>

> >> Thanks for the help.

> >>

> >> The important points from wikipedia are:

> >> - The left singular vectors of M are eigenvectors of M*M' .

> >> - The right singular vectors of M are eigenvectors of M'*M.

> >>

> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think

> >> it does A=M*M', but it is not a problem). Therefore it does already

> >> calculate the right (or left) singular vector of M.

> >>

> >> But my question is, how can I get the other singular vector? I can

> >> transpose M, but then I have to calculated two SVDs, one for the right

> >> and one for the left singular value... I think there is a better way

> >> :)

> >>

> >> Hope you can help me with this...

> >> Thanks

> >> Stefan

> >>

> >>

> >> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> >> > Hi

> >> > Mahout SVD implementation computes the Lanzcos iteration:

> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm

> >> > Denote the non-square input matrix as M. First a symmetric matrix A is

> >> > computed by A=M'*M

> >> > Then an approximating tridiagonal matrix T and a vector matrix V are

> >> > computed such that A =~ V*T*V'

> >> > (this process is done in a distributed way).

> >> >

> >> > Next the matrix T is next decomposed into eigenvectors and

> eignevalues.

> >> > Which is the returned result. (This process

> >> > is serial).

> >> >

> >> > The third step makes the returned eigenvectors orthogonal to each

> other

> >> > (which is optional IMHO).

> >> >

> >> > The heart of the code is found at:

> >> >

> >>

> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java

> >> > At least that is where it was in version 0.4 I am not sure if there

> are

> >> > changes in version 0.5

> >> >

> >> > Anyway, Mahout does not compute directly SVD. If you are interested in

> >> > learning more about the relation to SVD

> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,

> >> > subsection: relation to eigenvalue decomposition.

> >> >

> >> > Hope this helps,

> >> >

> >> > Danny Bickson

> >> >

> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <[EMAIL PROTECTED]>

> >> wrote:

> >> >

> >> >> After reading this thread:

> >> >>

> >> >>

> >>

> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%[EMAIL PROTECTED]%3E

> >> >>

> >> >> Wiki-SVD: M = U S V* (* = transposed)

> >> >>

> >> >> The output of Mahout-SVD is (U S) right?

> >> >>

> >> >> So... How do I get V from (U S) and M?

The current arrangement makes it easier to compute the singular values of

new documents. It requires an extra operation either to transpose or to do

a multiply to get the vectors for the documents (rows).

On Mon, Jun 6, 2011 at 10:36 AM, Danny Bickson <[EMAIL PROTECTED]>wrote:

> If I understand your question correctly, you need to simply transpose M

> before you start the run and that way

> you will get the other singular vectors.

>

> May I ask what is the problem you are working on and why do you need the

> singular vectors?

> Can you consider using another matrix decomposition technique for example

> alternating least squares

> which gives you two lower rank matrices which simulates the large

> decomposed

> matrix?

>

> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>

> > Hi Danny!

> >

> > I understand that for M*M' (and for M'*M) the left and right

> > eigenvectors are identical. But that is not exactly what I want. The

> > lanczos solver from mahout gives me the eigenvectors of M*M', which

> > are the left singular vectors of M. But I need the right singular

> > vectors of M (and not M*M'). How do I get them?

> >

> > Sorry, my matrix math is not as good as it should be, but I hope you

> > can help me!

> >

> > Thanks,

> > Stefan

> >

> > 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> > > Hi Stefan!

> > > For a positive semidefinite matrix, the lest and right eigenvectors are

> > > identical.

> > > See SVD wikipeida text: When *M* is also positive

> > > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,

> > > the decomposition *M* = *U**D**U* * is also a singular value

> > decomposition.

> > > So you don't need to be worried about the other singular vectors.

> > >

> > > Hope this helps!

> > >

> > > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <[EMAIL PROTECTED]>

> > wrote:

> > >

> > >> Hi.

> > >>

> > >> Thanks for the help.

> > >>

> > >> The important points from wikipedia are:

> > >> - The left singular vectors of M are eigenvectors of M*M' .

> > >> - The right singular vectors of M are eigenvectors of M'*M.

> > >>

> > >> as you describe, the mahout lanczos solver calculate A=M'*M (I think

> > >> it does A=M*M', but it is not a problem). Therefore it does already

> > >> calculate the right (or left) singular vector of M.

> > >>

> > >> But my question is, how can I get the other singular vector? I can

> > >> transpose M, but then I have to calculated two SVDs, one for the right

> > >> and one for the left singular value... I think there is a better way

> > >> :)

> > >>

> > >> Hope you can help me with this...

> > >> Thanks

> > >> Stefan

> > >>

> > >>

> > >> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> > >> > Hi

> > >> > Mahout SVD implementation computes the Lanzcos iteration:

> > >> > http://en.wikipedia.org/wiki/Lanczos_algorithm

> > >> > Denote the non-square input matrix as M. First a symmetric matrix A

> is

> > >> > computed by A=M'*M

> > >> > Then an approximating tridiagonal matrix T and a vector matrix V are

> > >> > computed such that A =~ V*T*V'

> > >> > (this process is done in a distributed way).

> > >> >

> > >> > Next the matrix T is next decomposed into eigenvectors and

> > eignevalues.

> > >> > Which is the returned result. (This process

> > >> > is serial).

> > >> >

> > >> > The third step makes the returned eigenvectors orthogonal to each

> > other

> > >> > (which is optional IMHO).

> > >> >

> > >> > The heart of the code is found at:

> > >> >

> > >>

> >

> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java

> > >> > At least that is where it was in version 0.4 I am not sure if there

> > are

> > >> > changes in version 0.5

> > >> >

> > >> > Anyway, Mahout does not compute directly SVD. If you are interested

> in

> > >> > learning more about the relation to SVD

> > >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,

> > >> > subsection: relation to eigenvalue decomposition.

> > >> >

> > >> > Hope this helps,

new documents. It requires an extra operation either to transpose or to do

a multiply to get the vectors for the documents (rows).

On Mon, Jun 6, 2011 at 10:36 AM, Danny Bickson <[EMAIL PROTECTED]>wrote:

> If I understand your question correctly, you need to simply transpose M

> before you start the run and that way

> you will get the other singular vectors.

>

> May I ask what is the problem you are working on and why do you need the

> singular vectors?

> Can you consider using another matrix decomposition technique for example

> alternating least squares

> which gives you two lower rank matrices which simulates the large

> decomposed

> matrix?

>

> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>

> > Hi Danny!

> >

> > I understand that for M*M' (and for M'*M) the left and right

> > eigenvectors are identical. But that is not exactly what I want. The

> > lanczos solver from mahout gives me the eigenvectors of M*M', which

> > are the left singular vectors of M. But I need the right singular

> > vectors of M (and not M*M'). How do I get them?

> >

> > Sorry, my matrix math is not as good as it should be, but I hope you

> > can help me!

> >

> > Thanks,

> > Stefan

> >

> > 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> > > Hi Stefan!

> > > For a positive semidefinite matrix, the lest and right eigenvectors are

> > > identical.

> > > See SVD wikipeida text: When *M* is also positive

> > > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,

> > > the decomposition *M* = *U**D**U* * is also a singular value

> > decomposition.

> > > So you don't need to be worried about the other singular vectors.

> > >

> > > Hope this helps!

> > >

> > > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <[EMAIL PROTECTED]>

> > wrote:

> > >

> > >> Hi.

> > >>

> > >> Thanks for the help.

> > >>

> > >> The important points from wikipedia are:

> > >> - The left singular vectors of M are eigenvectors of M*M' .

> > >> - The right singular vectors of M are eigenvectors of M'*M.

> > >>

> > >> as you describe, the mahout lanczos solver calculate A=M'*M (I think

> > >> it does A=M*M', but it is not a problem). Therefore it does already

> > >> calculate the right (or left) singular vector of M.

> > >>

> > >> But my question is, how can I get the other singular vector? I can

> > >> transpose M, but then I have to calculated two SVDs, one for the right

> > >> and one for the left singular value... I think there is a better way

> > >> :)

> > >>

> > >> Hope you can help me with this...

> > >> Thanks

> > >> Stefan

> > >>

> > >>

> > >> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> > >> > Hi

> > >> > Mahout SVD implementation computes the Lanzcos iteration:

> > >> > http://en.wikipedia.org/wiki/Lanczos_algorithm

> > >> > Denote the non-square input matrix as M. First a symmetric matrix A

> is

> > >> > computed by A=M'*M

> > >> > Then an approximating tridiagonal matrix T and a vector matrix V are

> > >> > computed such that A =~ V*T*V'

> > >> > (this process is done in a distributed way).

> > >> >

> > >> > Next the matrix T is next decomposed into eigenvectors and

> > eignevalues.

> > >> > Which is the returned result. (This process

> > >> > is serial).

> > >> >

> > >> > The third step makes the returned eigenvectors orthogonal to each

> > other

> > >> > (which is optional IMHO).

> > >> >

> > >> > The heart of the code is found at:

> > >> >

> > >>

> >

> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java

> > >> > At least that is where it was in version 0.4 I am not sure if there

> > are

> > >> > changes in version 0.5

> > >> >

> > >> > Anyway, Mahout does not compute directly SVD. If you are interested

> in

> > >> > learning more about the relation to SVD

> > >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,

> > >> > subsection: relation to eigenvalue decomposition.

> > >> >

> > >> > Hope this helps,

Yeha :) That's what I assumed. So this problem is solved. Thanks!

And now your questions:

I want to do LSA/LSI with Mahout. Therefore I take the TDIDF

Document-Term-Matrix and reduce it using Lanczos. So I get the

Document-Concept-Vectors. These Vectors will be compared with cosine

similarity to find similar documents. I can also cluster these

documents with k-means...

If you have any suggestions, feel free to tell me. I'm also interested

in other document-similarity-techniques.

Cheers,

Stefan

2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> If I understand your question correctly, you need to simply transpose M

> before you start the run and that way

> you will get the other singular vectors.

>

> May I ask what is the problem you are working on and why do you need the

> singular vectors?

> Can you consider using another matrix decomposition technique for example

> alternating least squares

> which gives you two lower rank matrices which simulates the large decomposed

> matrix?

>

> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>

>> Hi Danny!

>>

>> I understand that for M*M' (and for M'*M) the left and right

>> eigenvectors are identical. But that is not exactly what I want. The

>> lanczos solver from mahout gives me the eigenvectors of M*M', which

>> are the left singular vectors of M. But I need the right singular

>> vectors of M (and not M*M'). How do I get them?

>>

>> Sorry, my matrix math is not as good as it should be, but I hope you

>> can help me!

>>

>> Thanks,

>> Stefan

>>

>> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>> > Hi Stefan!

>> > For a positive semidefinite matrix, the lest and right eigenvectors are

>> > identical.

>> > See SVD wikipeida text: When *M* is also positive

>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,

>> > the decomposition *M* = *U**D**U* * is also a singular value

>> decomposition.

>> > So you don't need to be worried about the other singular vectors.

>> >

>> > Hope this helps!

>> >

>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <[EMAIL PROTECTED]>

>> wrote:

>> >

>> >> Hi.

>> >>

>> >> Thanks for the help.

>> >>

>> >> The important points from wikipedia are:

>> >> - The left singular vectors of M are eigenvectors of M*M' .

>> >> - The right singular vectors of M are eigenvectors of M'*M.

>> >>

>> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think

>> >> it does A=M*M', but it is not a problem). Therefore it does already

>> >> calculate the right (or left) singular vector of M.

>> >>

>> >> But my question is, how can I get the other singular vector? I can

>> >> transpose M, but then I have to calculated two SVDs, one for the right

>> >> and one for the left singular value... I think there is a better way

>> >> :)

>> >>

>> >> Hope you can help me with this...

>> >> Thanks

>> >> Stefan

>> >>

>> >>

>> >> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>> >> > Hi

>> >> > Mahout SVD implementation computes the Lanzcos iteration:

>> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm

>> >> > Denote the non-square input matrix as M. First a symmetric matrix A is

>> >> > computed by A=M'*M

>> >> > Then an approximating tridiagonal matrix T and a vector matrix V are

>> >> > computed such that A =~ V*T*V'

>> >> > (this process is done in a distributed way).

>> >> >

>> >> > Next the matrix T is next decomposed into eigenvectors and

>> eignevalues.

>> >> > Which is the returned result. (This process

>> >> > is serial).

>> >> >

>> >> > The third step makes the returned eigenvectors orthogonal to each

>> other

>> >> > (which is optional IMHO).

>> >> >

>> >> > The heart of the code is found at:

>> >> >

>> >>

>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java

>> >> > At least that is where it was in version 0.4 I am not sure if there

>> are

>> >> > changes in version 0.5

>> >> >

>> >> > Anyway, Mahout does not compute directly SVD. If you are interested in

>> >> > learning more about the relation to SVD

Stefan Wienert

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838

Mobil: +49176-40170270

And now your questions:

I want to do LSA/LSI with Mahout. Therefore I take the TDIDF

Document-Term-Matrix and reduce it using Lanczos. So I get the

Document-Concept-Vectors. These Vectors will be compared with cosine

similarity to find similar documents. I can also cluster these

documents with k-means...

If you have any suggestions, feel free to tell me. I'm also interested

in other document-similarity-techniques.

Cheers,

Stefan

2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> If I understand your question correctly, you need to simply transpose M

> before you start the run and that way

> you will get the other singular vectors.

>

> May I ask what is the problem you are working on and why do you need the

> singular vectors?

> Can you consider using another matrix decomposition technique for example

> alternating least squares

> which gives you two lower rank matrices which simulates the large decomposed

> matrix?

>

> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>

>> Hi Danny!

>>

>> I understand that for M*M' (and for M'*M) the left and right

>> eigenvectors are identical. But that is not exactly what I want. The

>> lanczos solver from mahout gives me the eigenvectors of M*M', which

>> are the left singular vectors of M. But I need the right singular

>> vectors of M (and not M*M'). How do I get them?

>>

>> Sorry, my matrix math is not as good as it should be, but I hope you

>> can help me!

>>

>> Thanks,

>> Stefan

>>

>> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>> > Hi Stefan!

>> > For a positive semidefinite matrix, the lest and right eigenvectors are

>> > identical.

>> > See SVD wikipeida text: When *M* is also positive

>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,

>> > the decomposition *M* = *U**D**U* * is also a singular value

>> decomposition.

>> > So you don't need to be worried about the other singular vectors.

>> >

>> > Hope this helps!

>> >

>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <[EMAIL PROTECTED]>

>> wrote:

>> >

>> >> Hi.

>> >>

>> >> Thanks for the help.

>> >>

>> >> The important points from wikipedia are:

>> >> - The left singular vectors of M are eigenvectors of M*M' .

>> >> - The right singular vectors of M are eigenvectors of M'*M.

>> >>

>> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think

>> >> it does A=M*M', but it is not a problem). Therefore it does already

>> >> calculate the right (or left) singular vector of M.

>> >>

>> >> But my question is, how can I get the other singular vector? I can

>> >> transpose M, but then I have to calculated two SVDs, one for the right

>> >> and one for the left singular value... I think there is a better way

>> >> :)

>> >>

>> >> Hope you can help me with this...

>> >> Thanks

>> >> Stefan

>> >>

>> >>

>> >> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>> >> > Hi

>> >> > Mahout SVD implementation computes the Lanzcos iteration:

>> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm

>> >> > Denote the non-square input matrix as M. First a symmetric matrix A is

>> >> > computed by A=M'*M

>> >> > Then an approximating tridiagonal matrix T and a vector matrix V are

>> >> > computed such that A =~ V*T*V'

>> >> > (this process is done in a distributed way).

>> >> >

>> >> > Next the matrix T is next decomposed into eigenvectors and

>> eignevalues.

>> >> > Which is the returned result. (This process

>> >> > is serial).

>> >> >

>> >> > The third step makes the returned eigenvectors orthogonal to each

>> other

>> >> > (which is optional IMHO).

>> >> >

>> >> > The heart of the code is found at:

>> >> >

>> >>

>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java

>> >> > At least that is where it was in version 0.4 I am not sure if there

>> are

>> >> > changes in version 0.5

>> >> >

>> >> > Anyway, Mahout does not compute directly SVD. If you are interested in

>> >> > learning more about the relation to SVD

Stefan Wienert

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838

Mobil: +49176-40170270

One thing to watch out for in clustering SVD stuff is eigenspokes.

www.cs.cmu.edu/~badityap/papers/*eigenspokes*-pakdd10.pdf

On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Yeha :) That's what I assumed. So this problem is solved. Thanks!

>

> And now your questions:

>

> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF

> Document-Term-Matrix and reduce it using Lanczos. So I get the

> Document-Concept-Vectors. These Vectors will be compared with cosine

> similarity to find similar documents. I can also cluster these

> documents with k-means...

>

> If you have any suggestions, feel free to tell me. I'm also interested

> in other document-similarity-techniques.

>

> Cheers,

> Stefan

>

> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> > If I understand your question correctly, you need to simply transpose M

> > before you start the run and that way

> > you will get the other singular vectors.

> >

> > May I ask what is the problem you are working on and why do you need the

> > singular vectors?

> > Can you consider using another matrix decomposition technique for example

> > alternating least squares

> > which gives you two lower rank matrices which simulates the large

> decomposed

> > matrix?

> >

> > On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <[EMAIL PROTECTED]>

> wrote:

> >

> >> Hi Danny!

> >>

> >> I understand that for M*M' (and for M'*M) the left and right

> >> eigenvectors are identical. But that is not exactly what I want. The

> >> lanczos solver from mahout gives me the eigenvectors of M*M', which

> >> are the left singular vectors of M. But I need the right singular

> >> vectors of M (and not M*M'). How do I get them?

> >>

> >> Sorry, my matrix math is not as good as it should be, but I hope you

> >> can help me!

> >>

> >> Thanks,

> >> Stefan

> >>

> >> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> >> > Hi Stefan!

> >> > For a positive semidefinite matrix, the lest and right eigenvectors

> are

> >> > identical.

> >> > See SVD wikipeida text: When *M* is also positive

> >> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,

> >> > the decomposition *M* = *U**D**U* * is also a singular value

> >> decomposition.

> >> > So you don't need to be worried about the other singular vectors.

> >> >

> >> > Hope this helps!

> >> >

> >> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <[EMAIL PROTECTED]>

> >> wrote:

> >> >

> >> >> Hi.

> >> >>

> >> >> Thanks for the help.

> >> >>

> >> >> The important points from wikipedia are:

> >> >> - The left singular vectors of M are eigenvectors of M*M' .

> >> >> - The right singular vectors of M are eigenvectors of M'*M.

> >> >>

> >> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think

> >> >> it does A=M*M', but it is not a problem). Therefore it does already

> >> >> calculate the right (or left) singular vector of M.

> >> >>

> >> >> But my question is, how can I get the other singular vector? I can

> >> >> transpose M, but then I have to calculated two SVDs, one for the

> right

> >> >> and one for the left singular value... I think there is a better way

> >> >> :)

> >> >>

> >> >> Hope you can help me with this...

> >> >> Thanks

> >> >> Stefan

> >> >>

> >> >>

> >> >> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> >> >> > Hi

> >> >> > Mahout SVD implementation computes the Lanzcos iteration:

> >> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm

> >> >> > Denote the non-square input matrix as M. First a symmetric matrix A

> is

> >> >> > computed by A=M'*M

> >> >> > Then an approximating tridiagonal matrix T and a vector matrix V

> are

> >> >> > computed such that A =~ V*T*V'

> >> >> > (this process is done in a distributed way).

> >> >> >

> >> >> > Next the matrix T is next decomposed into eigenvectors and

> >> eignevalues.

> >> >> > Which is the returned result. (This process

> >> >> > is serial).

> >> >> >

> >> >> > The third step makes the returned eigenvectors orthogonal to each

> >> other

> >;

www.cs.cmu.edu/~badityap/papers/*eigenspokes*-pakdd10.pdf

On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Yeha :) That's what I assumed. So this problem is solved. Thanks!

>

> And now your questions:

>

> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF

> Document-Term-Matrix and reduce it using Lanczos. So I get the

> Document-Concept-Vectors. These Vectors will be compared with cosine

> similarity to find similar documents. I can also cluster these

> documents with k-means...

>

> If you have any suggestions, feel free to tell me. I'm also interested

> in other document-similarity-techniques.

>

> Cheers,

> Stefan

>

> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> > If I understand your question correctly, you need to simply transpose M

> > before you start the run and that way

> > you will get the other singular vectors.

> >

> > May I ask what is the problem you are working on and why do you need the

> > singular vectors?

> > Can you consider using another matrix decomposition technique for example

> > alternating least squares

> > which gives you two lower rank matrices which simulates the large

> decomposed

> > matrix?

> >

> > On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <[EMAIL PROTECTED]>

> wrote:

> >

> >> Hi Danny!

> >>

> >> I understand that for M*M' (and for M'*M) the left and right

> >> eigenvectors are identical. But that is not exactly what I want. The

> >> lanczos solver from mahout gives me the eigenvectors of M*M', which

> >> are the left singular vectors of M. But I need the right singular

> >> vectors of M (and not M*M'). How do I get them?

> >>

> >> Sorry, my matrix math is not as good as it should be, but I hope you

> >> can help me!

> >>

> >> Thanks,

> >> Stefan

> >>

> >> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> >> > Hi Stefan!

> >> > For a positive semidefinite matrix, the lest and right eigenvectors

> are

> >> > identical.

> >> > See SVD wikipeida text: When *M* is also positive

> >> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,

> >> > the decomposition *M* = *U**D**U* * is also a singular value

> >> decomposition.

> >> > So you don't need to be worried about the other singular vectors.

> >> >

> >> > Hope this helps!

> >> >

> >> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <[EMAIL PROTECTED]>

> >> wrote:

> >> >

> >> >> Hi.

> >> >>

> >> >> Thanks for the help.

> >> >>

> >> >> The important points from wikipedia are:

> >> >> - The left singular vectors of M are eigenvectors of M*M' .

> >> >> - The right singular vectors of M are eigenvectors of M'*M.

> >> >>

> >> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think

> >> >> it does A=M*M', but it is not a problem). Therefore it does already

> >> >> calculate the right (or left) singular vector of M.

> >> >>

> >> >> But my question is, how can I get the other singular vector? I can

> >> >> transpose M, but then I have to calculated two SVDs, one for the

> right

> >> >> and one for the left singular value... I think there is a better way

> >> >> :)

> >> >>

> >> >> Hope you can help me with this...

> >> >> Thanks

> >> >> Stefan

> >> >>

> >> >>

> >> >> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

> >> >> > Hi

> >> >> > Mahout SVD implementation computes the Lanzcos iteration:

> >> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm

> >> >> > Denote the non-square input matrix as M. First a symmetric matrix A

> is

> >> >> > computed by A=M'*M

> >> >> > Then an approximating tridiagonal matrix T and a vector matrix V

> are

> >> >> > computed such that A =~ V*T*V'

> >> >> > (this process is done in a distributed way).

> >> >> >

> >> >> > Next the matrix T is next decomposed into eigenvectors and

> >> eignevalues.

> >> >> > Which is the returned result. (This process

> >> >> > is serial).

> >> >> >

> >> >> > The third step makes the returned eigenvectors orthogonal to each

> >> other

> >

I also do LSI/LSA and wrote a stochastic svd that is capable to

produce exact U, V and Sigma ,

or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.

On top of that, it is adjusted to be LSI-friendly if your keys are not

necessarily DoubleWritable (you can still keep your original document

paths or whatever they are identified with), that info is migrated as

keys of the output of U matrix. So you can run it directly on the

output of the Mahout's seq2sparse (I actually tested that on reuters

dataset example from Mahout In Action).

The computation has a stochastic noise to it, but LSI problems are

never exact problems anyway and their result are subject to corpus

selection.

If you are interested to help me to work kinks out in Mahout's

version, I'd be grateful since i don't run the method on 0.20.2 but on

CDH in a customized way. But be warned that it may require a number of

patches before it works for you.

Here is (a little bit too wordy) command line manual for Mahout 0.5.

http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html

Thanks.

-D

On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Yeha :) That's what I assumed. So this problem is solved. Thanks!

>

> And now your questions:

>

> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF

> Document-Term-Matrix and reduce it using Lanczos. So I get the

> Document-Concept-Vectors. These Vectors will be compared with cosine

> similarity to find similar documents. I can also cluster these

> documents with k-means...

>

> If you have any suggestions, feel free to tell me. I'm also interested

> in other document-similarity-techniques.

>

> Cheers,

> Stefan

>

> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>> If I understand your question correctly, you need to simply transpose M

>> before you start the run and that way

>> you will get the other singular vectors.

>>

>> May I ask what is the problem you are working on and why do you need the

>> singular vectors?

>> Can you consider using another matrix decomposition technique for example

>> alternating least squares

>> which gives you two lower rank matrices which simulates the large decomposed

>> matrix?

>>

>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>

>>> Hi Danny!

>>>

>>> I understand that for M*M' (and for M'*M) the left and right

>>> eigenvectors are identical. But that is not exactly what I want. The

>>> lanczos solver from mahout gives me the eigenvectors of M*M', which

>>> are the left singular vectors of M. But I need the right singular

>>> vectors of M (and not M*M'). How do I get them?

>>>

>>> Sorry, my matrix math is not as good as it should be, but I hope you

>>> can help me!

>>>

>>> Thanks,

>>> Stefan

>>>

>>> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>>> > Hi Stefan!

>>> > For a positive semidefinite matrix, the lest and right eigenvectors are

>>> > identical.

>>> > See SVD wikipeida text: When *M* is also positive

>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,

>>> > the decomposition *M* = *U**D**U* * is also a singular value

>>> decomposition.

>>> > So you don't need to be worried about the other singular vectors.

>>> >

>>> > Hope this helps!

>>> >

>>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <[EMAIL PROTECTED]>

>>> wrote:

>>> >

>>> >> Hi.

>>> >>

>>> >> Thanks for the help.

>>> >>

>>> >> The important points from wikipedia are:

>>> >> - The left singular vectors of M are eigenvectors of M*M' .

>>> >> - The right singular vectors of M are eigenvectors of M'*M.

>>> >>

>>> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think

>>> >> it does A=M*M', but it is not a problem). Therefore it does already

>>> >> calculate the right (or left) singular vector of M.

>>> >>

>>> >> But my question is, how can I get the other singular vector? I can

>>> >> transpose M, but then I have to calculated two SVDs, one for the right

>>> >> and one for the left singular value... I think there is a better way

produce exact U, V and Sigma ,

or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.

On top of that, it is adjusted to be LSI-friendly if your keys are not

necessarily DoubleWritable (you can still keep your original document

paths or whatever they are identified with), that info is migrated as

keys of the output of U matrix. So you can run it directly on the

output of the Mahout's seq2sparse (I actually tested that on reuters

dataset example from Mahout In Action).

The computation has a stochastic noise to it, but LSI problems are

never exact problems anyway and their result are subject to corpus

selection.

If you are interested to help me to work kinks out in Mahout's

version, I'd be grateful since i don't run the method on 0.20.2 but on

CDH in a customized way. But be warned that it may require a number of

patches before it works for you.

Here is (a little bit too wordy) command line manual for Mahout 0.5.

http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html

Thanks.

-D

On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Yeha :) That's what I assumed. So this problem is solved. Thanks!

>

> And now your questions:

>

> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF

> Document-Term-Matrix and reduce it using Lanczos. So I get the

> Document-Concept-Vectors. These Vectors will be compared with cosine

> similarity to find similar documents. I can also cluster these

> documents with k-means...

>

> If you have any suggestions, feel free to tell me. I'm also interested

> in other document-similarity-techniques.

>

> Cheers,

> Stefan

>

> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>> If I understand your question correctly, you need to simply transpose M

>> before you start the run and that way

>> you will get the other singular vectors.

>>

>> May I ask what is the problem you are working on and why do you need the

>> singular vectors?

>> Can you consider using another matrix decomposition technique for example

>> alternating least squares

>> which gives you two lower rank matrices which simulates the large decomposed

>> matrix?

>>

>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>

>>> Hi Danny!

>>>

>>> I understand that for M*M' (and for M'*M) the left and right

>>> eigenvectors are identical. But that is not exactly what I want. The

>>> lanczos solver from mahout gives me the eigenvectors of M*M', which

>>> are the left singular vectors of M. But I need the right singular

>>> vectors of M (and not M*M'). How do I get them?

>>>

>>> Sorry, my matrix math is not as good as it should be, but I hope you

>>> can help me!

>>>

>>> Thanks,

>>> Stefan

>>>

>>> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>>> > Hi Stefan!

>>> > For a positive semidefinite matrix, the lest and right eigenvectors are

>>> > identical.

>>> > See SVD wikipeida text: When *M* is also positive

>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,

>>> > the decomposition *M* = *U**D**U* * is also a singular value

>>> decomposition.

>>> > So you don't need to be worried about the other singular vectors.

>>> >

>>> > Hope this helps!

>>> >

>>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <[EMAIL PROTECTED]>

>>> wrote:

>>> >

>>> >> Hi.

>>> >>

>>> >> Thanks for the help.

>>> >>

>>> >> The important points from wikipedia are:

>>> >> - The left singular vectors of M are eigenvectors of M*M' .

>>> >> - The right singular vectors of M are eigenvectors of M'*M.

>>> >>

>>> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think

>>> >> it does A=M*M', but it is not a problem). Therefore it does already

>>> >> calculate the right (or left) singular vector of M.

>>> >>

>>> >> But my question is, how can I get the other singular vector? I can

>>> >> transpose M, but then I have to calculated two SVDs, one for the right

>>> >> and one for the left singular value... I think there is a better way

Hmm... looks nice...

So there is a Lanczos implementation of SVD and the stochastic version

of SVD. Both produce the doc-concept vectors that I need.

I acctualy get my TFIDF Vectors directly from a lucene index (and have

to do some magic to get IntWritable, VectorWritable).

Still, I do not exactly understand what you trying to say. Your SSVD

does not run with mahout but on CDH (what is this btw?)? Or is it

available for mahout? So what has to be modified to run it with

mahout?

Thanks

Stefan

2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

> I also do LSI/LSA and wrote a stochastic svd that is capable to

> produce exact U, V and Sigma ,

> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.

>

> On top of that, it is adjusted to be LSI-friendly if your keys are not

> necessarily DoubleWritable (you can still keep your original document

> paths or whatever they are identified with), that info is migrated as

> keys of the output of U matrix. So you can run it directly on the

> output of the Mahout's seq2sparse (I actually tested that on reuters

> dataset example from Mahout In Action).

>

> The computation has a stochastic noise to it, but LSI problems are

> never exact problems anyway and their result are subject to corpus

> selection.

>

> If you are interested to help me to work kinks out in Mahout's

> version, I'd be grateful since i don't run the method on 0.20.2 but on

> CDH in a customized way. But be warned that it may require a number of

> patches before it works for you.

>

> Here is (a little bit too wordy) command line manual for Mahout 0.5.

> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html

>

> Thanks.

>

> -D

>

>

> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>> Yeha :) That's what I assumed. So this problem is solved. Thanks!

>>

>> And now your questions:

>>

>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF

>> Document-Term-Matrix and reduce it using Lanczos. So I get the

>> Document-Concept-Vectors. These Vectors will be compared with cosine

>> similarity to find similar documents. I can also cluster these

>> documents with k-means...

>>

>> If you have any suggestions, feel free to tell me. I'm also interested

>> in other document-similarity-techniques.

>>

>> Cheers,

>> Stefan

>>

>> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>>> If I understand your question correctly, you need to simply transpose M

>>> before you start the run and that way

>>> you will get the other singular vectors.

>>>

>>> May I ask what is the problem you are working on and why do you need the

>>> singular vectors?

>>> Can you consider using another matrix decomposition technique for example

>>> alternating least squares

>>> which gives you two lower rank matrices which simulates the large decomposed

>>> matrix?

>>>

>>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>

>>>> Hi Danny!

>>>>

>>>> I understand that for M*M' (and for M'*M) the left and right

>>>> eigenvectors are identical. But that is not exactly what I want. The

>>>> lanczos solver from mahout gives me the eigenvectors of M*M', which

>>>> are the left singular vectors of M. But I need the right singular

>>>> vectors of M (and not M*M'). How do I get them?

>>>>

>>>> Sorry, my matrix math is not as good as it should be, but I hope you

>>>> can help me!

>>>>

>>>> Thanks,

>>>> Stefan

>>>>

>>>> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>>>> > Hi Stefan!

>>>> > For a positive semidefinite matrix, the lest and right eigenvectors are

>>>> > identical.

>>>> > See SVD wikipeida text: When *M* is also positive

>>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,

>>>> > the decomposition *M* = *U**D**U* * is also a singular value

>>>> decomposition.

>>>> > So you don't need to be worried about the other singular vectors.

>>>> >

>>>> > Hope this helps!

>>>> >

>>>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <[EMAIL PROTECTED]>

>>>> wrote:

>>>> >

Stefan Wienert

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838

Mobil: +49176-40170270

So there is a Lanczos implementation of SVD and the stochastic version

of SVD. Both produce the doc-concept vectors that I need.

I acctualy get my TFIDF Vectors directly from a lucene index (and have

to do some magic to get IntWritable, VectorWritable).

Still, I do not exactly understand what you trying to say. Your SSVD

does not run with mahout but on CDH (what is this btw?)? Or is it

available for mahout? So what has to be modified to run it with

mahout?

Thanks

Stefan

2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

> I also do LSI/LSA and wrote a stochastic svd that is capable to

> produce exact U, V and Sigma ,

> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.

>

> On top of that, it is adjusted to be LSI-friendly if your keys are not

> necessarily DoubleWritable (you can still keep your original document

> paths or whatever they are identified with), that info is migrated as

> keys of the output of U matrix. So you can run it directly on the

> output of the Mahout's seq2sparse (I actually tested that on reuters

> dataset example from Mahout In Action).

>

> The computation has a stochastic noise to it, but LSI problems are

> never exact problems anyway and their result are subject to corpus

> selection.

>

> If you are interested to help me to work kinks out in Mahout's

> version, I'd be grateful since i don't run the method on 0.20.2 but on

> CDH in a customized way. But be warned that it may require a number of

> patches before it works for you.

>

> Here is (a little bit too wordy) command line manual for Mahout 0.5.

> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html

>

> Thanks.

>

> -D

>

>

> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>> Yeha :) That's what I assumed. So this problem is solved. Thanks!

>>

>> And now your questions:

>>

>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF

>> Document-Term-Matrix and reduce it using Lanczos. So I get the

>> Document-Concept-Vectors. These Vectors will be compared with cosine

>> similarity to find similar documents. I can also cluster these

>> documents with k-means...

>>

>> If you have any suggestions, feel free to tell me. I'm also interested

>> in other document-similarity-techniques.

>>

>> Cheers,

>> Stefan

>>

>> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>>> If I understand your question correctly, you need to simply transpose M

>>> before you start the run and that way

>>> you will get the other singular vectors.

>>>

>>> May I ask what is the problem you are working on and why do you need the

>>> singular vectors?

>>> Can you consider using another matrix decomposition technique for example

>>> alternating least squares

>>> which gives you two lower rank matrices which simulates the large decomposed

>>> matrix?

>>>

>>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>

>>>> Hi Danny!

>>>>

>>>> I understand that for M*M' (and for M'*M) the left and right

>>>> eigenvectors are identical. But that is not exactly what I want. The

>>>> lanczos solver from mahout gives me the eigenvectors of M*M', which

>>>> are the left singular vectors of M. But I need the right singular

>>>> vectors of M (and not M*M'). How do I get them?

>>>>

>>>> Sorry, my matrix math is not as good as it should be, but I hope you

>>>> can help me!

>>>>

>>>> Thanks,

>>>> Stefan

>>>>

>>>> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>>>> > Hi Stefan!

>>>> > For a positive semidefinite matrix, the lest and right eigenvectors are

>>>> > identical.

>>>> > See SVD wikipeida text: When *M* is also positive

>>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,

>>>> > the decomposition *M* = *U**D**U* * is also a singular value

>>>> decomposition.

>>>> > So you don't need to be worried about the other singular vectors.

>>>> >

>>>> > Hope this helps!

>>>> >

>>>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <[EMAIL PROTECTED]>

>>>> wrote:

>>>> >

Stefan Wienert

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838

Mobil: +49176-40170270

I am saying i did not test it with 0.20.2

Yes it is integrated in 0.5 release but there might be problems with

hadoop 0.20.2

On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Hmm... looks nice...

>

> So there is a Lanczos implementation of SVD and the stochastic version

> of SVD. Both produce the doc-concept vectors that I need.

> I acctualy get my TFIDF Vectors directly from a lucene index (and have

> to do some magic to get IntWritable, VectorWritable).

>

> Still, I do not exactly understand what you trying to say. Your SSVD

> does not run with mahout but on CDH (what is this btw?)? Or is it

> available for mahout? So what has to be modified to run it with

> mahout?

>

> Thanks

> Stefan

>

>

>

> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>> I also do LSI/LSA and wrote a stochastic svd that is capable to

>> produce exact U, V and Sigma ,

>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.

>>

>> On top of that, it is adjusted to be LSI-friendly if your keys are not

>> necessarily DoubleWritable (you can still keep your original document

>> paths or whatever they are identified with), that info is migrated as

>> keys of the output of U matrix. So you can run it directly on the

>> output of the Mahout's seq2sparse (I actually tested that on reuters

>> dataset example from Mahout In Action).

>>

>> The computation has a stochastic noise to it, but LSI problems are

>> never exact problems anyway and their result are subject to corpus

>> selection.

>>

>> If you are interested to help me to work kinks out in Mahout's

>> version, I'd be grateful since i don't run the method on 0.20.2 but on

>> CDH in a customized way. But be warned that it may require a number of

>> patches before it works for you.

>>

>> Here is (a little bit too wordy) command line manual for Mahout 0.5.

>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html

>>

>> Thanks.

>>

>> -D

>>

>>

>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!

>>>

>>> And now your questions:

>>>

>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF

>>> Document-Term-Matrix and reduce it using Lanczos. So I get the

>>> Document-Concept-Vectors. These Vectors will be compared with cosine

>>> similarity to find similar documents. I can also cluster these

>>> documents with k-means...

>>>

>>> If you have any suggestions, feel free to tell me. I'm also interested

>>> in other document-similarity-techniques.

>>>

>>> Cheers,

>>> Stefan

>>>

>>> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>>>> If I understand your question correctly, you need to simply transpose M

>>>> before you start the run and that way

>>>> you will get the other singular vectors.

>>>>

>>>> May I ask what is the problem you are working on and why do you need the

>>>> singular vectors?

>>>> Can you consider using another matrix decomposition technique for example

>>>> alternating least squares

>>>> which gives you two lower rank matrices which simulates the large decomposed

>>>> matrix?

>>>>

>>>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>>

>>>>> Hi Danny!

>>>>>

>>>>> I understand that for M*M' (and for M'*M) the left and right

>>>>> eigenvectors are identical. But that is not exactly what I want. The

>>>>> lanczos solver from mahout gives me the eigenvectors of M*M', which

>>>>> are the left singular vectors of M. But I need the right singular

>>>>> vectors of M (and not M*M'). How do I get them?

>>>>>

>>>>> Sorry, my matrix math is not as good as it should be, but I hope you

>>>>> can help me!

>>>>>

>>>>> Thanks,

>>>>> Stefan

>>>>>

>>>>> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>>>>> > Hi Stefan!

>>>>> > For a positive semidefinite matrix, the lest and right eigenvectors are

>>>>> > identical.

>>>>> > See SVD wikipeida text: When *M* is also positive

>>>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,

Yes it is integrated in 0.5 release but there might be problems with

hadoop 0.20.2

On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Hmm... looks nice...

>

> So there is a Lanczos implementation of SVD and the stochastic version

> of SVD. Both produce the doc-concept vectors that I need.

> I acctualy get my TFIDF Vectors directly from a lucene index (and have

> to do some magic to get IntWritable, VectorWritable).

>

> Still, I do not exactly understand what you trying to say. Your SSVD

> does not run with mahout but on CDH (what is this btw?)? Or is it

> available for mahout? So what has to be modified to run it with

> mahout?

>

> Thanks

> Stefan

>

>

>

> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>> I also do LSI/LSA and wrote a stochastic svd that is capable to

>> produce exact U, V and Sigma ,

>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.

>>

>> On top of that, it is adjusted to be LSI-friendly if your keys are not

>> necessarily DoubleWritable (you can still keep your original document

>> paths or whatever they are identified with), that info is migrated as

>> keys of the output of U matrix. So you can run it directly on the

>> output of the Mahout's seq2sparse (I actually tested that on reuters

>> dataset example from Mahout In Action).

>>

>> The computation has a stochastic noise to it, but LSI problems are

>> never exact problems anyway and their result are subject to corpus

>> selection.

>>

>> If you are interested to help me to work kinks out in Mahout's

>> version, I'd be grateful since i don't run the method on 0.20.2 but on

>> CDH in a customized way. But be warned that it may require a number of

>> patches before it works for you.

>>

>> Here is (a little bit too wordy) command line manual for Mahout 0.5.

>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html

>>

>> Thanks.

>>

>> -D

>>

>>

>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!

>>>

>>> And now your questions:

>>>

>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF

>>> Document-Term-Matrix and reduce it using Lanczos. So I get the

>>> Document-Concept-Vectors. These Vectors will be compared with cosine

>>> similarity to find similar documents. I can also cluster these

>>> documents with k-means...

>>>

>>> If you have any suggestions, feel free to tell me. I'm also interested

>>> in other document-similarity-techniques.

>>>

>>> Cheers,

>>> Stefan

>>>

>>> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>>>> If I understand your question correctly, you need to simply transpose M

>>>> before you start the run and that way

>>>> you will get the other singular vectors.

>>>>

>>>> May I ask what is the problem you are working on and why do you need the

>>>> singular vectors?

>>>> Can you consider using another matrix decomposition technique for example

>>>> alternating least squares

>>>> which gives you two lower rank matrices which simulates the large decomposed

>>>> matrix?

>>>>

>>>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>>

>>>>> Hi Danny!

>>>>>

>>>>> I understand that for M*M' (and for M'*M) the left and right

>>>>> eigenvectors are identical. But that is not exactly what I want. The

>>>>> lanczos solver from mahout gives me the eigenvectors of M*M', which

>>>>> are the left singular vectors of M. But I need the right singular

>>>>> vectors of M (and not M*M'). How do I get them?

>>>>>

>>>>> Sorry, my matrix math is not as good as it should be, but I hope you

>>>>> can help me!

>>>>>

>>>>> Thanks,

>>>>> Stefan

>>>>>

>>>>> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>>>>> > Hi Stefan!

>>>>> > For a positive semidefinite matrix, the lest and right eigenvectors are

>>>>> > identical.

>>>>> > See SVD wikipeida text: When *M* is also positive

>>>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,

Before I rewrite my program, is there any advantage over the lanczos svd?

2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

> I am saying i did not test it with 0.20.2

>

> Yes it is integrated in 0.5 release but there might be problems with

> hadoop 0.20.2

>

> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>> Hmm... looks nice...

>>

>> So there is a Lanczos implementation of SVD and the stochastic version

>> of SVD. Both produce the doc-concept vectors that I need.

>> I acctualy get my TFIDF Vectors directly from a lucene index (and have

>> to do some magic to get IntWritable, VectorWritable).

>>

>> Still, I do not exactly understand what you trying to say. Your SSVD

>> does not run with mahout but on CDH (what is this btw?)? Or is it

>> available for mahout? So what has to be modified to run it with

>> mahout?

>>

>> Thanks

>> Stefan

>>

>>

>>

>> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>>> I also do LSI/LSA and wrote a stochastic svd that is capable to

>>> produce exact U, V and Sigma ,

>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.

>>>

>>> On top of that, it is adjusted to be LSI-friendly if your keys are not

>>> necessarily DoubleWritable (you can still keep your original document

>>> paths or whatever they are identified with), that info is migrated as

>>> keys of the output of U matrix. So you can run it directly on the

>>> output of the Mahout's seq2sparse (I actually tested that on reuters

>>> dataset example from Mahout In Action).

>>>

>>> The computation has a stochastic noise to it, but LSI problems are

>>> never exact problems anyway and their result are subject to corpus

>>> selection.

>>>

>>> If you are interested to help me to work kinks out in Mahout's

>>> version, I'd be grateful since i don't run the method on 0.20.2 but on

>>> CDH in a customized way. But be warned that it may require a number of

>>> patches before it works for you.

>>>

>>> Here is (a little bit too wordy) command line manual for Mahout 0.5.

>>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html

>>>

>>> Thanks.

>>>

>>> -D

>>>

>>>

>>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!

>>>>

>>>> And now your questions:

>>>>

>>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF

>>>> Document-Term-Matrix and reduce it using Lanczos. So I get the

>>>> Document-Concept-Vectors. These Vectors will be compared with cosine

>>>> similarity to find similar documents. I can also cluster these

>>>> documents with k-means...

>>>>

>>>> If you have any suggestions, feel free to tell me. I'm also interested

>>>> in other document-similarity-techniques.

>>>>

>>>> Cheers,

>>>> Stefan

>>>>

>>>> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>>>>> If I understand your question correctly, you need to simply transpose M

>>>>> before you start the run and that way

>>>>> you will get the other singular vectors.

>>>>>

>>>>> May I ask what is the problem you are working on and why do you need the

>>>>> singular vectors?

>>>>> Can you consider using another matrix decomposition technique for example

>>>>> alternating least squares

>>>>> which gives you two lower rank matrices which simulates the large decomposed

>>>>> matrix?

>>>>>

>>>>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>>>

>>>>>> Hi Danny!

>>>>>>

>>>>>> I understand that for M*M' (and for M'*M) the left and right

>>>>>> eigenvectors are identical. But that is not exactly what I want. The

>>>>>> lanczos solver from mahout gives me the eigenvectors of M*M', which

>>>>>> are the left singular vectors of M. But I need the right singular

>>>>>> vectors of M (and not M*M'). How do I get them?

>>>>>>

>>>>>> Sorry, my matrix math is not as good as it should be, but I hope you

>>>>>> can help me!

>>>>>>

>>>>>> Thanks,

>>>>>> Stefan

>>>>>>

>>>>>> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>>>>>> > Hi Stefan!

Stefan Wienert

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838

Mobil: +49176-40170270

2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

> I am saying i did not test it with 0.20.2

>

> Yes it is integrated in 0.5 release but there might be problems with

> hadoop 0.20.2

>

> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>> Hmm... looks nice...

>>

>> So there is a Lanczos implementation of SVD and the stochastic version

>> of SVD. Both produce the doc-concept vectors that I need.

>> I acctualy get my TFIDF Vectors directly from a lucene index (and have

>> to do some magic to get IntWritable, VectorWritable).

>>

>> Still, I do not exactly understand what you trying to say. Your SSVD

>> does not run with mahout but on CDH (what is this btw?)? Or is it

>> available for mahout? So what has to be modified to run it with

>> mahout?

>>

>> Thanks

>> Stefan

>>

>>

>>

>> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>>> I also do LSI/LSA and wrote a stochastic svd that is capable to

>>> produce exact U, V and Sigma ,

>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.

>>>

>>> On top of that, it is adjusted to be LSI-friendly if your keys are not

>>> necessarily DoubleWritable (you can still keep your original document

>>> paths or whatever they are identified with), that info is migrated as

>>> keys of the output of U matrix. So you can run it directly on the

>>> output of the Mahout's seq2sparse (I actually tested that on reuters

>>> dataset example from Mahout In Action).

>>>

>>> The computation has a stochastic noise to it, but LSI problems are

>>> never exact problems anyway and their result are subject to corpus

>>> selection.

>>>

>>> If you are interested to help me to work kinks out in Mahout's

>>> version, I'd be grateful since i don't run the method on 0.20.2 but on

>>> CDH in a customized way. But be warned that it may require a number of

>>> patches before it works for you.

>>>

>>> Here is (a little bit too wordy) command line manual for Mahout 0.5.

>>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html

>>>

>>> Thanks.

>>>

>>> -D

>>>

>>>

>>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!

>>>>

>>>> And now your questions:

>>>>

>>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF

>>>> Document-Term-Matrix and reduce it using Lanczos. So I get the

>>>> Document-Concept-Vectors. These Vectors will be compared with cosine

>>>> similarity to find similar documents. I can also cluster these

>>>> documents with k-means...

>>>>

>>>> If you have any suggestions, feel free to tell me. I'm also interested

>>>> in other document-similarity-techniques.

>>>>

>>>> Cheers,

>>>> Stefan

>>>>

>>>> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>>>>> If I understand your question correctly, you need to simply transpose M

>>>>> before you start the run and that way

>>>>> you will get the other singular vectors.

>>>>>

>>>>> May I ask what is the problem you are working on and why do you need the

>>>>> singular vectors?

>>>>> Can you consider using another matrix decomposition technique for example

>>>>> alternating least squares

>>>>> which gives you two lower rank matrices which simulates the large decomposed

>>>>> matrix?

>>>>>

>>>>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>>>

>>>>>> Hi Danny!

>>>>>>

>>>>>> I understand that for M*M' (and for M'*M) the left and right

>>>>>> eigenvectors are identical. But that is not exactly what I want. The

>>>>>> lanczos solver from mahout gives me the eigenvectors of M*M', which

>>>>>> are the left singular vectors of M. But I need the right singular

>>>>>> vectors of M (and not M*M'). How do I get them?

>>>>>>

>>>>>> Sorry, my matrix math is not as good as it should be, but I hope you

>>>>>> can help me!

>>>>>>

>>>>>> Thanks,

>>>>>> Stefan

>>>>>>

>>>>>> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>>>>>> > Hi Stefan!

Stefan Wienert

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838

Mobil: +49176-40170270

You can check what i summed up in javadoc for SSVDSolver to get some idea.

but on top what i have already mentioned, the main difference is that

it presumably doesn't require as much resources (reuters dataset runs

thru with default -Xmx200m per task setting just fine) and theoretical

flops are quite less than that of Lanczos which probably means it

should rip thru the same volume faster. (my practical run showed that

it takes about the same time as seq2sparse that produced it's input,

and that was before sparse vector bug fix which caused significant

slow down on sparse data).

I don't think you should 'modify' your progam, especially if you need

to show practical result within a week. but you might help me by

creating a version of your program that takes it and we can see

together what it takes to get it working for you.

-d

On Tue, Jun 7, 2011 at 1:01 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Before I rewrite my program, is there any advantage over the lanczos svd?

>

> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>> I am saying i did not test it with 0.20.2

>>

>> Yes it is integrated in 0.5 release but there might be problems with

>> hadoop 0.20.2

>>

>> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>> Hmm... looks nice...

>>>

>>> So there is a Lanczos implementation of SVD and the stochastic version

>>> of SVD. Both produce the doc-concept vectors that I need.

>>> I acctualy get my TFIDF Vectors directly from a lucene index (and have

>>> to do some magic to get IntWritable, VectorWritable).

>>>

>>> Still, I do not exactly understand what you trying to say. Your SSVD

>>> does not run with mahout but on CDH (what is this btw?)? Or is it

>>> available for mahout? So what has to be modified to run it with

>>> mahout?

>>>

>>> Thanks

>>> Stefan

>>>

>>>

>>>

>>> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>>>> I also do LSI/LSA and wrote a stochastic svd that is capable to

>>>> produce exact U, V and Sigma ,

>>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.

>>>>

>>>> On top of that, it is adjusted to be LSI-friendly if your keys are not

>>>> necessarily DoubleWritable (you can still keep your original document

>>>> paths or whatever they are identified with), that info is migrated as

>>>> keys of the output of U matrix. So you can run it directly on the

>>>> output of the Mahout's seq2sparse (I actually tested that on reuters

>>>> dataset example from Mahout In Action).

>>>>

>>>> The computation has a stochastic noise to it, but LSI problems are

>>>> never exact problems anyway and their result are subject to corpus

>>>> selection.

>>>>

>>>> If you are interested to help me to work kinks out in Mahout's

>>>> version, I'd be grateful since i don't run the method on 0.20.2 but on

>>>> CDH in a customized way. But be warned that it may require a number of

>>>> patches before it works for you.

>>>>

>>>> Here is (a little bit too wordy) command line manual for Mahout 0.5.

>>>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html

>>>>

>>>> Thanks.

>>>>

>>>> -D

>>>>

>>>>

>>>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!

>>>>>

>>>>> And now your questions:

>>>>>

>>>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF

>>>>> Document-Term-Matrix and reduce it using Lanczos. So I get the

>>>>> Document-Concept-Vectors. These Vectors will be compared with cosine

>>>>> similarity to find similar documents. I can also cluster these

>>>>> documents with k-means...

>>>>>

>>>>> If you have any suggestions, feel free to tell me. I'm also interested

>>>>> in other document-similarity-techniques.

>>>>>

>>>>> Cheers,

>>>>> Stefan

>>>>>

>>>>> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>>>>>> If I understand your question correctly, you need to simply transpose M

>>>>>> before you start the run and that way

>>>>>> you will get the other singular vectors.

>>>;

but on top what i have already mentioned, the main difference is that

it presumably doesn't require as much resources (reuters dataset runs

thru with default -Xmx200m per task setting just fine) and theoretical

flops are quite less than that of Lanczos which probably means it

should rip thru the same volume faster. (my practical run showed that

it takes about the same time as seq2sparse that produced it's input,

and that was before sparse vector bug fix which caused significant

slow down on sparse data).

I don't think you should 'modify' your progam, especially if you need

to show practical result within a week. but you might help me by

creating a version of your program that takes it and we can see

together what it takes to get it working for you.

-d

On Tue, Jun 7, 2011 at 1:01 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Before I rewrite my program, is there any advantage over the lanczos svd?

>

> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>> I am saying i did not test it with 0.20.2

>>

>> Yes it is integrated in 0.5 release but there might be problems with

>> hadoop 0.20.2

>>

>> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>> Hmm... looks nice...

>>>

>>> So there is a Lanczos implementation of SVD and the stochastic version

>>> of SVD. Both produce the doc-concept vectors that I need.

>>> I acctualy get my TFIDF Vectors directly from a lucene index (and have

>>> to do some magic to get IntWritable, VectorWritable).

>>>

>>> Still, I do not exactly understand what you trying to say. Your SSVD

>>> does not run with mahout but on CDH (what is this btw?)? Or is it

>>> available for mahout? So what has to be modified to run it with

>>> mahout?

>>>

>>> Thanks

>>> Stefan

>>>

>>>

>>>

>>> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>>>> I also do LSI/LSA and wrote a stochastic svd that is capable to

>>>> produce exact U, V and Sigma ,

>>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.

>>>>

>>>> On top of that, it is adjusted to be LSI-friendly if your keys are not

>>>> necessarily DoubleWritable (you can still keep your original document

>>>> paths or whatever they are identified with), that info is migrated as

>>>> keys of the output of U matrix. So you can run it directly on the

>>>> output of the Mahout's seq2sparse (I actually tested that on reuters

>>>> dataset example from Mahout In Action).

>>>>

>>>> The computation has a stochastic noise to it, but LSI problems are

>>>> never exact problems anyway and their result are subject to corpus

>>>> selection.

>>>>

>>>> If you are interested to help me to work kinks out in Mahout's

>>>> version, I'd be grateful since i don't run the method on 0.20.2 but on

>>>> CDH in a customized way. But be warned that it may require a number of

>>>> patches before it works for you.

>>>>

>>>> Here is (a little bit too wordy) command line manual for Mahout 0.5.

>>>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html

>>>>

>>>> Thanks.

>>>>

>>>> -D

>>>>

>>>>

>>>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!

>>>>>

>>>>> And now your questions:

>>>>>

>>>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF

>>>>> Document-Term-Matrix and reduce it using Lanczos. So I get the

>>>>> Document-Concept-Vectors. These Vectors will be compared with cosine

>>>>> similarity to find similar documents. I can also cluster these

>>>>> documents with k-means...

>>>>>

>>>>> If you have any suggestions, feel free to tell me. I'm also interested

>>>>> in other document-similarity-techniques.

>>>>>

>>>>> Cheers,

>>>>> Stefan

>>>>>

>>>>> 2011/6/6 Danny Bickson <[EMAIL PROTECTED]>:

>>>>>> If I understand your question correctly, you need to simply transpose M

>>>>>> before you start the run and that way

>>>>>> you will get the other singular vectors.

>>>

The tradoff is of course the stochastic noise (lanczos would be

significantly more precise, esp. on random data (i.e. without clear

trends), but then why would you try to figure trends on a random

data?), but you get to manage precision by increasing p parameter at

expense of more memory and flops.

(the reuters example i ran was done on k+p=200 singular values of

which LSI probably should use no more than k=50).

On Tue, Jun 7, 2011 at 1:08 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

> You can check what i summed up in javadoc for SSVDSolver to get some idea.

>

> but on top what i have already mentioned, the main difference is that

> it presumably doesn't require as much resources (reuters dataset runs

> thru with default -Xmx200m per task setting just fine) and theoretical

> flops are quite less than that of Lanczos which probably means it

> should rip thru the same volume faster. (my practical run showed that

> it takes about the same time as seq2sparse that produced it's input,

> and that was before sparse vector bug fix which caused significant

> slow down on sparse data).

>

> I don't think you should 'modify' your progam, especially if you need

> to show practical result within a week. but you might help me by

> creating a version of your program that takes it and we can see

> together what it takes to get it working for you.

>

> -d

>

> On Tue, Jun 7, 2011 at 1:01 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>> Before I rewrite my program, is there any advantage over the lanczos svd?

>>

>> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>>> I am saying i did not test it with 0.20.2

>>>

>>> Yes it is integrated in 0.5 release but there might be problems with

>>> hadoop 0.20.2

>>>

>>> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>> Hmm... looks nice...

>>>>

>>>> So there is a Lanczos implementation of SVD and the stochastic version

>>>> of SVD. Both produce the doc-concept vectors that I need.

>>>> I acctualy get my TFIDF Vectors directly from a lucene index (and have

>>>> to do some magic to get IntWritable, VectorWritable).

>>>>

>>>> Still, I do not exactly understand what you trying to say. Your SSVD

>>>> does not run with mahout but on CDH (what is this btw?)? Or is it

>>>> available for mahout? So what has to be modified to run it with

>>>> mahout?

>>>>

>>>> Thanks

>>>> Stefan

>>>>

>>>>

>>>>

>>>> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>>>>> I also do LSI/LSA and wrote a stochastic svd that is capable to

>>>>> produce exact U, V and Sigma ,

>>>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.

>>>>>

>>>>> On top of that, it is adjusted to be LSI-friendly if your keys are not

>>>>> necessarily DoubleWritable (you can still keep your original document

>>>>> paths or whatever they are identified with), that info is migrated as

>>>>> keys of the output of U matrix. So you can run it directly on the

>>>>> output of the Mahout's seq2sparse (I actually tested that on reuters

>>>>> dataset example from Mahout In Action).

>>>>>

>>>>> The computation has a stochastic noise to it, but LSI problems are

>>>>> never exact problems anyway and their result are subject to corpus

>>>>> selection.

>>>>>

>>>>> If you are interested to help me to work kinks out in Mahout's

>>>>> version, I'd be grateful since i don't run the method on 0.20.2 but on

>>>>> CDH in a customized way. But be warned that it may require a number of

>>>>> patches before it works for you.

>>>>>

>>>>> Here is (a little bit too wordy) command line manual for Mahout 0.5.

>>>>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html

>>>>>

>>>>> Thanks.

>>>>>

>>>>> -D

>>>>>

>>>>>

>>>>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!

>>>>>>

>>>>>> And now your questions:

>>>>>>

>>>>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF

>>>>>> Document-Term-Matrix and reduce it using Lanczos. So I get the

significantly more precise, esp. on random data (i.e. without clear

trends), but then why would you try to figure trends on a random

data?), but you get to manage precision by increasing p parameter at

expense of more memory and flops.

(the reuters example i ran was done on k+p=200 singular values of

which LSI probably should use no more than k=50).

On Tue, Jun 7, 2011 at 1:08 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

> You can check what i summed up in javadoc for SSVDSolver to get some idea.

>

> but on top what i have already mentioned, the main difference is that

> it presumably doesn't require as much resources (reuters dataset runs

> thru with default -Xmx200m per task setting just fine) and theoretical

> flops are quite less than that of Lanczos which probably means it

> should rip thru the same volume faster. (my practical run showed that

> it takes about the same time as seq2sparse that produced it's input,

> and that was before sparse vector bug fix which caused significant

> slow down on sparse data).

>

> I don't think you should 'modify' your progam, especially if you need

> to show practical result within a week. but you might help me by

> creating a version of your program that takes it and we can see

> together what it takes to get it working for you.

>

> -d

>

> On Tue, Jun 7, 2011 at 1:01 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>> Before I rewrite my program, is there any advantage over the lanczos svd?

>>

>> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>>> I am saying i did not test it with 0.20.2

>>>

>>> Yes it is integrated in 0.5 release but there might be problems with

>>> hadoop 0.20.2

>>>

>>> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>> Hmm... looks nice...

>>>>

>>>> So there is a Lanczos implementation of SVD and the stochastic version

>>>> of SVD. Both produce the doc-concept vectors that I need.

>>>> I acctualy get my TFIDF Vectors directly from a lucene index (and have

>>>> to do some magic to get IntWritable, VectorWritable).

>>>>

>>>> Still, I do not exactly understand what you trying to say. Your SSVD

>>>> does not run with mahout but on CDH (what is this btw?)? Or is it

>>>> available for mahout? So what has to be modified to run it with

>>>> mahout?

>>>>

>>>> Thanks

>>>> Stefan

>>>>

>>>>

>>>>

>>>> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>>>>> I also do LSI/LSA and wrote a stochastic svd that is capable to

>>>>> produce exact U, V and Sigma ,

>>>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.

>>>>>

>>>>> On top of that, it is adjusted to be LSI-friendly if your keys are not

>>>>> necessarily DoubleWritable (you can still keep your original document

>>>>> paths or whatever they are identified with), that info is migrated as

>>>>> keys of the output of U matrix. So you can run it directly on the

>>>>> output of the Mahout's seq2sparse (I actually tested that on reuters

>>>>> dataset example from Mahout In Action).

>>>>>

>>>>> The computation has a stochastic noise to it, but LSI problems are

>>>>> never exact problems anyway and their result are subject to corpus

>>>>> selection.

>>>>>

>>>>> If you are interested to help me to work kinks out in Mahout's

>>>>> version, I'd be grateful since i don't run the method on 0.20.2 but on

>>>>> CDH in a customized way. But be warned that it may require a number of

>>>>> patches before it works for you.

>>>>>

>>>>> Here is (a little bit too wordy) command line manual for Mahout 0.5.

>>>>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html

>>>>>

>>>>> Thanks.

>>>>>

>>>>> -D

>>>>>

>>>>>

>>>>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!

>>>>>>

>>>>>> And now your questions:

>>>>>>

>>>>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF

>>>>>> Document-Term-Matrix and reduce it using Lanczos. So I get the

PPS another non-distributed implementation of this is redsvd (on

google code) which is claiming amazing benchmark speed but apparently

it hits memory ceiling at some point as i think i did not see a test

larger than 10kx10k dense matrix equivalent input there.

On Tue, Jun 7, 2011 at 1:14 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

> The tradoff is of course the stochastic noise (lanczos would be

> significantly more precise, esp. on random data (i.e. without clear

> trends), but then why would you try to figure trends on a random

> data?), but you get to manage precision by increasing p parameter at

> expense of more memory and flops.

>

> (the reuters example i ran was done on k+p=200 singular values of

> which LSI probably should use no more than k=50).

>

> On Tue, Jun 7, 2011 at 1:08 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

>> You can check what i summed up in javadoc for SSVDSolver to get some idea.

>>

>> but on top what i have already mentioned, the main difference is that

>> it presumably doesn't require as much resources (reuters dataset runs

>> thru with default -Xmx200m per task setting just fine) and theoretical

>> flops are quite less than that of Lanczos which probably means it

>> should rip thru the same volume faster. (my practical run showed that

>> it takes about the same time as seq2sparse that produced it's input,

>> and that was before sparse vector bug fix which caused significant

>> slow down on sparse data).

>>

>> I don't think you should 'modify' your progam, especially if you need

>> to show practical result within a week. but you might help me by

>> creating a version of your program that takes it and we can see

>> together what it takes to get it working for you.

>>

>> -d

>>

>> On Tue, Jun 7, 2011 at 1:01 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>> Before I rewrite my program, is there any advantage over the lanczos svd?

>>>

>>> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>>>> I am saying i did not test it with 0.20.2

>>>>

>>>> Yes it is integrated in 0.5 release but there might be problems with

>>>> hadoop 0.20.2

>>>>

>>>> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>>> Hmm... looks nice...

>>>>>

>>>>> So there is a Lanczos implementation of SVD and the stochastic version

>>>>> of SVD. Both produce the doc-concept vectors that I need.

>>>>> I acctualy get my TFIDF Vectors directly from a lucene index (and have

>>>>> to do some magic to get IntWritable, VectorWritable).

>>>>>

>>>>> Still, I do not exactly understand what you trying to say. Your SSVD

>>>>> does not run with mahout but on CDH (what is this btw?)? Or is it

>>>>> available for mahout? So what has to be modified to run it with

>>>>> mahout?

>>>>>

>>>>> Thanks

>>>>> Stefan

>>>>>

>>>>>

>>>>>

>>>>> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>>>>>> I also do LSI/LSA and wrote a stochastic svd that is capable to

>>>>>> produce exact U, V and Sigma ,

>>>>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.

>>>>>>

>>>>>> On top of that, it is adjusted to be LSI-friendly if your keys are not

>>>>>> necessarily DoubleWritable (you can still keep your original document

>>>>>> paths or whatever they are identified with), that info is migrated as

>>>>>> keys of the output of U matrix. So you can run it directly on the

>>>>>> output of the Mahout's seq2sparse (I actually tested that on reuters

>>>>>> dataset example from Mahout In Action).

>>>>>>

>>>>>> The computation has a stochastic noise to it, but LSI problems are

>>>>>> never exact problems anyway and their result are subject to corpus

>>>>>> selection.

>>>>>>

>>>>>> If you are interested to help me to work kinks out in Mahout's

>>>>>> version, I'd be grateful since i don't run the method on 0.20.2 but on

>>>>>> CDH in a customized way. But be warned that it may require a number of

>>>>>> patches before it works for you.

>>>>>>

>>>>>> Here is (a little bit too wordy) command line manual for Mahout 0.5.

>>>>>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html

google code) which is claiming amazing benchmark speed but apparently

it hits memory ceiling at some point as i think i did not see a test

larger than 10kx10k dense matrix equivalent input there.

On Tue, Jun 7, 2011 at 1:14 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

> The tradoff is of course the stochastic noise (lanczos would be

> significantly more precise, esp. on random data (i.e. without clear

> trends), but then why would you try to figure trends on a random

> data?), but you get to manage precision by increasing p parameter at

> expense of more memory and flops.

>

> (the reuters example i ran was done on k+p=200 singular values of

> which LSI probably should use no more than k=50).

>

> On Tue, Jun 7, 2011 at 1:08 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote:

>> You can check what i summed up in javadoc for SSVDSolver to get some idea.

>>

>> but on top what i have already mentioned, the main difference is that

>> it presumably doesn't require as much resources (reuters dataset runs

>> thru with default -Xmx200m per task setting just fine) and theoretical

>> flops are quite less than that of Lanczos which probably means it

>> should rip thru the same volume faster. (my practical run showed that

>> it takes about the same time as seq2sparse that produced it's input,

>> and that was before sparse vector bug fix which caused significant

>> slow down on sparse data).

>>

>> I don't think you should 'modify' your progam, especially if you need

>> to show practical result within a week. but you might help me by

>> creating a version of your program that takes it and we can see

>> together what it takes to get it working for you.

>>

>> -d

>>

>> On Tue, Jun 7, 2011 at 1:01 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>> Before I rewrite my program, is there any advantage over the lanczos svd?

>>>

>>> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>>>> I am saying i did not test it with 0.20.2

>>>>

>>>> Yes it is integrated in 0.5 release but there might be problems with

>>>> hadoop 0.20.2

>>>>

>>>> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>>> Hmm... looks nice...

>>>>>

>>>>> So there is a Lanczos implementation of SVD and the stochastic version

>>>>> of SVD. Both produce the doc-concept vectors that I need.

>>>>> I acctualy get my TFIDF Vectors directly from a lucene index (and have

>>>>> to do some magic to get IntWritable, VectorWritable).

>>>>>

>>>>> Still, I do not exactly understand what you trying to say. Your SSVD

>>>>> does not run with mahout but on CDH (what is this btw?)? Or is it

>>>>> available for mahout? So what has to be modified to run it with

>>>>> mahout?

>>>>>

>>>>> Thanks

>>>>> Stefan

>>>>>

>>>>>

>>>>>

>>>>> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>>>>>> I also do LSI/LSA and wrote a stochastic svd that is capable to

>>>>>> produce exact U, V and Sigma ,

>>>>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.

>>>>>>

>>>>>> On top of that, it is adjusted to be LSI-friendly if your keys are not

>>>>>> necessarily DoubleWritable (you can still keep your original document

>>>>>> paths or whatever they are identified with), that info is migrated as

>>>>>> keys of the output of U matrix. So you can run it directly on the

>>>>>> output of the Mahout's seq2sparse (I actually tested that on reuters

>>>>>> dataset example from Mahout In Action).

>>>>>>

>>>>>> The computation has a stochastic noise to it, but LSI problems are

>>>>>> never exact problems anyway and their result are subject to corpus

>>>>>> selection.

>>>>>>

>>>>>> If you are interested to help me to work kinks out in Mahout's

>>>>>> version, I'd be grateful since i don't run the method on 0.20.2 but on

>>>>>> CDH in a customized way. But be warned that it may require a number of

>>>>>> patches before it works for you.

>>>>>>

>>>>>> Here is (a little bit too wordy) command line manual for Mahout 0.5.

>>>>>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html

Hmm... I think it is no problem to implement my tool so it works with

both versions.

My intention is to get document similarity using clustering or cosine

similarity (and before, dimension reduction via svd).

So I think there is no problem with a less precise svd if it is fast :)

At least you have exactly described the parallelization process which

would help me a lot writing this part of my master thesis ;).

I will try to implement this tomorrow...

Thanks

Stefan

2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

> You can check what i summed up in javadoc for SSVDSolver to get some idea.

>

> but on top what i have already mentioned, the main difference is that

> it presumably doesn't require as much resources (reuters dataset runs

> thru with default -Xmx200m per task setting just fine) and theoretical

> flops are quite less than that of Lanczos which probably means it

> should rip thru the same volume faster. (my practical run showed that

> it takes about the same time as seq2sparse that produced it's input,

> and that was before sparse vector bug fix which caused significant

> slow down on sparse data).

>

> I don't think you should 'modify' your progam, especially if you need

> to show practical result within a week. but you might help me by

> creating a version of your program that takes it and we can see

> together what it takes to get it working for you.

>

> -d

>

> On Tue, Jun 7, 2011 at 1:01 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>> Before I rewrite my program, is there any advantage over the lanczos svd?

>>

>> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>>> I am saying i did not test it with 0.20.2

>>>

>>> Yes it is integrated in 0.5 release but there might be problems with

>>> hadoop 0.20.2

>>>

>>> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>> Hmm... looks nice...

>>>>

>>>> So there is a Lanczos implementation of SVD and the stochastic version

>>>> of SVD. Both produce the doc-concept vectors that I need.

>>>> I acctualy get my TFIDF Vectors directly from a lucene index (and have

>>>> to do some magic to get IntWritable, VectorWritable).

>>>>

>>>> Still, I do not exactly understand what you trying to say. Your SSVD

>>>> does not run with mahout but on CDH (what is this btw?)? Or is it

>>>> available for mahout? So what has to be modified to run it with

>>>> mahout?

>>>>

>>>> Thanks

>>>> Stefan

>>>>

>>>>

>>>>

>>>> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>>>>> I also do LSI/LSA and wrote a stochastic svd that is capable to

>>>>> produce exact U, V and Sigma ,

>>>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.

>>>>>

>>>>> On top of that, it is adjusted to be LSI-friendly if your keys are not

>>>>> necessarily DoubleWritable (you can still keep your original document

>>>>> paths or whatever they are identified with), that info is migrated as

>>>>> keys of the output of U matrix. So you can run it directly on the

>>>>> output of the Mahout's seq2sparse (I actually tested that on reuters

>>>>> dataset example from Mahout In Action).

>>>>>

>>>>> The computation has a stochastic noise to it, but LSI problems are

>>>>> never exact problems anyway and their result are subject to corpus

>>>>> selection.

>>>>>

>>>>> If you are interested to help me to work kinks out in Mahout's

>>>>> version, I'd be grateful since i don't run the method on 0.20.2 but on

>>>>> CDH in a customized way. But be warned that it may require a number of

>>>>> patches before it works for you.

>>>>>

>>>>> Here is (a little bit too wordy) command line manual for Mahout 0.5.

>>>>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html

>>>>>

>>>>> Thanks.

>>>>>

>>>>> -D

>>>>>

>>>>>

>>>>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!

>>>>>>

>>>>>> And now your questions:

>>>>>>

>>>>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF

>>>>>> Document-Term-Matrix and reduce it using Lanczos. So I get the

Stefan Wienert

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838

Mobil: +49176-40170270

both versions.

My intention is to get document similarity using clustering or cosine

similarity (and before, dimension reduction via svd).

So I think there is no problem with a less precise svd if it is fast :)

At least you have exactly described the parallelization process which

would help me a lot writing this part of my master thesis ;).

I will try to implement this tomorrow...

Thanks

Stefan

2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

> You can check what i summed up in javadoc for SSVDSolver to get some idea.

>

> but on top what i have already mentioned, the main difference is that

> it presumably doesn't require as much resources (reuters dataset runs

> thru with default -Xmx200m per task setting just fine) and theoretical

> flops are quite less than that of Lanczos which probably means it

> should rip thru the same volume faster. (my practical run showed that

> it takes about the same time as seq2sparse that produced it's input,

> and that was before sparse vector bug fix which caused significant

> slow down on sparse data).

>

> I don't think you should 'modify' your progam, especially if you need

> to show practical result within a week. but you might help me by

> creating a version of your program that takes it and we can see

> together what it takes to get it working for you.

>

> -d

>

> On Tue, Jun 7, 2011 at 1:01 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>> Before I rewrite my program, is there any advantage over the lanczos svd?

>>

>> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>>> I am saying i did not test it with 0.20.2

>>>

>>> Yes it is integrated in 0.5 release but there might be problems with

>>> hadoop 0.20.2

>>>

>>> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>> Hmm... looks nice...

>>>>

>>>> So there is a Lanczos implementation of SVD and the stochastic version

>>>> of SVD. Both produce the doc-concept vectors that I need.

>>>> I acctualy get my TFIDF Vectors directly from a lucene index (and have

>>>> to do some magic to get IntWritable, VectorWritable).

>>>>

>>>> Still, I do not exactly understand what you trying to say. Your SSVD

>>>> does not run with mahout but on CDH (what is this btw?)? Or is it

>>>> available for mahout? So what has to be modified to run it with

>>>> mahout?

>>>>

>>>> Thanks

>>>> Stefan

>>>>

>>>>

>>>>

>>>> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>>>>> I also do LSI/LSA and wrote a stochastic svd that is capable to

>>>>> produce exact U, V and Sigma ,

>>>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.

>>>>>

>>>>> On top of that, it is adjusted to be LSI-friendly if your keys are not

>>>>> necessarily DoubleWritable (you can still keep your original document

>>>>> paths or whatever they are identified with), that info is migrated as

>>>>> keys of the output of U matrix. So you can run it directly on the

>>>>> output of the Mahout's seq2sparse (I actually tested that on reuters

>>>>> dataset example from Mahout In Action).

>>>>>

>>>>> The computation has a stochastic noise to it, but LSI problems are

>>>>> never exact problems anyway and their result are subject to corpus

>>>>> selection.

>>>>>

>>>>> If you are interested to help me to work kinks out in Mahout's

>>>>> version, I'd be grateful since i don't run the method on 0.20.2 but on

>>>>> CDH in a customized way. But be warned that it may require a number of

>>>>> patches before it works for you.

>>>>>

>>>>> Here is (a little bit too wordy) command line manual for Mahout 0.5.

>>>>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html

>>>>>

>>>>> Thanks.

>>>>>

>>>>> -D

>>>>>

>>>>>

>>>>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!

>>>>>>

>>>>>> And now your questions:

>>>>>>

>>>>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF

>>>>>> Document-Term-Matrix and reduce it using Lanczos. So I get the

Stefan Wienert

http://www.wienert.cc

[EMAIL PROTECTED]

Telefon: +495251-2026838

Mobil: +49176-40170270

CDH is the Cloudera distribution of Hadoop. The Hadoop people screwed

up their "forward motion" in api design and so now the Hadoop versions

are screwy. The Cloudera distribution is Hadoop 0.20.something plus

various bug fixes and features. Mahout runs on normal 0.20.something

and Cloudera. Amazon Elastic Map-Reduce is also 0.20.whatsit.

Lance

On Tue, Jun 7, 2011 at 1:23 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Hmm... I think it is no problem to implement my tool so it works with

> both versions.

>

> My intention is to get document similarity using clustering or cosine

> similarity (and before, dimension reduction via svd).

>

> So I think there is no problem with a less precise svd if it is fast :)

>

> At least you have exactly described the parallelization process which

> would help me a lot writing this part of my master thesis ;).

>

> I will try to implement this tomorrow...

>

> Thanks

> Stefan

>

> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>> You can check what i summed up in javadoc for SSVDSolver to get some idea.

>>

>> but on top what i have already mentioned, the main difference is that

>> it presumably doesn't require as much resources (reuters dataset runs

>> thru with default -Xmx200m per task setting just fine) and theoretical

>> flops are quite less than that of Lanczos which probably means it

>> should rip thru the same volume faster. (my practical run showed that

>> it takes about the same time as seq2sparse that produced it's input,

>> and that was before sparse vector bug fix which caused significant

>> slow down on sparse data).

>>

>> I don't think you should 'modify' your progam, especially if you need

>> to show practical result within a week. but you might help me by

>> creating a version of your program that takes it and we can see

>> together what it takes to get it working for you.

>>

>> -d

>>

>> On Tue, Jun 7, 2011 at 1:01 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>> Before I rewrite my program, is there any advantage over the lanczos svd?

>>>

>>> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>>>> I am saying i did not test it with 0.20.2

>>>>

>>>> Yes it is integrated in 0.5 release but there might be problems with

>>>> hadoop 0.20.2

>>>>

>>>> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>>> Hmm... looks nice...

>>>>>

>>>>> So there is a Lanczos implementation of SVD and the stochastic version

>>>>> of SVD. Both produce the doc-concept vectors that I need.

>>>>> I acctualy get my TFIDF Vectors directly from a lucene index (and have

>>>>> to do some magic to get IntWritable, VectorWritable).

>>>>>

>>>>> Still, I do not exactly understand what you trying to say. Your SSVD

>>>>> does not run with mahout but on CDH (what is this btw?)? Or is it

>>>>> available for mahout? So what has to be modified to run it with

>>>>> mahout?

>>>>>

>>>>> Thanks

>>>>> Stefan

>>>>>

>>>>>

>>>>>

>>>>> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>>>>>> I also do LSI/LSA and wrote a stochastic svd that is capable to

>>>>>> produce exact U, V and Sigma ,

>>>>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.

>>>>>>

>>>>>> On top of that, it is adjusted to be LSI-friendly if your keys are not

>>>>>> necessarily DoubleWritable (you can still keep your original document

>>>>>> paths or whatever they are identified with), that info is migrated as

>>>>>> keys of the output of U matrix. So you can run it directly on the

>>>>>> output of the Mahout's seq2sparse (I actually tested that on reuters

>>>>>> dataset example from Mahout In Action).

>>>>>>

>>>>>> The computation has a stochastic noise to it, but LSI problems are

>>>>>> never exact problems anyway and their result are subject to corpus

>>>>>> selection.

>>>>>>

>>>>>> If you are interested to help me to work kinks out in Mahout's

>>>>>> version, I'd be grateful since i don't run the method on 0.20.2 but on

>>>>>> CDH in a customized way. But be warned that it may require a number of

>>>>>> patches before it works for you.

Lance Norskog

[EMAIL PROTECTED]

up their "forward motion" in api design and so now the Hadoop versions

are screwy. The Cloudera distribution is Hadoop 0.20.something plus

various bug fixes and features. Mahout runs on normal 0.20.something

and Cloudera. Amazon Elastic Map-Reduce is also 0.20.whatsit.

Lance

On Tue, Jun 7, 2011 at 1:23 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

> Hmm... I think it is no problem to implement my tool so it works with

> both versions.

>

> My intention is to get document similarity using clustering or cosine

> similarity (and before, dimension reduction via svd).

>

> So I think there is no problem with a less precise svd if it is fast :)

>

> At least you have exactly described the parallelization process which

> would help me a lot writing this part of my master thesis ;).

>

> I will try to implement this tomorrow...

>

> Thanks

> Stefan

>

> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>> You can check what i summed up in javadoc for SSVDSolver to get some idea.

>>

>> but on top what i have already mentioned, the main difference is that

>> it presumably doesn't require as much resources (reuters dataset runs

>> thru with default -Xmx200m per task setting just fine) and theoretical

>> flops are quite less than that of Lanczos which probably means it

>> should rip thru the same volume faster. (my practical run showed that

>> it takes about the same time as seq2sparse that produced it's input,

>> and that was before sparse vector bug fix which caused significant

>> slow down on sparse data).

>>

>> I don't think you should 'modify' your progam, especially if you need

>> to show practical result within a week. but you might help me by

>> creating a version of your program that takes it and we can see

>> together what it takes to get it working for you.

>>

>> -d

>>

>> On Tue, Jun 7, 2011 at 1:01 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>> Before I rewrite my program, is there any advantage over the lanczos svd?

>>>

>>> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>>>> I am saying i did not test it with 0.20.2

>>>>

>>>> Yes it is integrated in 0.5 release but there might be problems with

>>>> hadoop 0.20.2

>>>>

>>>> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <[EMAIL PROTECTED]> wrote:

>>>>> Hmm... looks nice...

>>>>>

>>>>> So there is a Lanczos implementation of SVD and the stochastic version

>>>>> of SVD. Both produce the doc-concept vectors that I need.

>>>>> I acctualy get my TFIDF Vectors directly from a lucene index (and have

>>>>> to do some magic to get IntWritable, VectorWritable).

>>>>>

>>>>> Still, I do not exactly understand what you trying to say. Your SSVD

>>>>> does not run with mahout but on CDH (what is this btw?)? Or is it

>>>>> available for mahout? So what has to be modified to run it with

>>>>> mahout?

>>>>>

>>>>> Thanks

>>>>> Stefan

>>>>>

>>>>>

>>>>>

>>>>> 2011/6/7 Dmitriy Lyubimov <[EMAIL PROTECTED]>:

>>>>>> I also do LSI/LSA and wrote a stochastic svd that is capable to

>>>>>> produce exact U, V and Sigma ,

>>>>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.

>>>>>>

>>>>>> On top of that, it is adjusted to be LSI-friendly if your keys are not

>>>>>> necessarily DoubleWritable (you can still keep your original document

>>>>>> paths or whatever they are identified with), that info is migrated as

>>>>>> keys of the output of U matrix. So you can run it directly on the

>>>>>> output of the Mahout's seq2sparse (I actually tested that on reuters

>>>>>> dataset example from Mahout In Action).

>>>>>>

>>>>>> The computation has a stochastic noise to it, but LSI problems are

>>>>>> never exact problems anyway and their result are subject to corpus

>>>>>> selection.

>>>>>>

>>>>>> If you are interested to help me to work kinks out in Mahout's

>>>>>> version, I'd be grateful since i don't run the method on 0.20.2 but on

>>>>>> CDH in a customized way. But be warned that it may require a number of

>>>>>> patches before it works for you.

Lance Norskog

[EMAIL PROTECTED]

Hadoop is Hadoop, so I don't know that any roadmap inconsistency between CDH

and Hadoop is somehow Hadoop's fault.

I don't think it's this ambiguous. Mahout runs on 0.20.2 Amazon EMR runs

0.20.2. The latest Hadoop version is 0.20.203.0. CDH is indeed somewhere

inbetween but that's CDH.

On Wed, Jun 8, 2011 at 2:46 AM, Lance Norskog <[EMAIL PROTECTED]> wrote:

> CDH is the Cloudera distribution of Hadoop. The Hadoop people screwed

> up their "forward motion" in api design and so now the Hadoop versions

> are screwy. The Cloudera distribution is Hadoop 0.20.something plus

> various bug fixes and features. Mahout runs on normal 0.20.something

> and Cloudera. Amazon Elastic Map-Reduce is also 0.20.whatsit.

>

> Lance

>

>

and Hadoop is somehow Hadoop's fault.

I don't think it's this ambiguous. Mahout runs on 0.20.2 Amazon EMR runs

0.20.2. The latest Hadoop version is 0.20.203.0. CDH is indeed somewhere

inbetween but that's CDH.

On Wed, Jun 8, 2011 at 2:46 AM, Lance Norskog <[EMAIL PROTECTED]> wrote:

> CDH is the Cloudera distribution of Hadoop. The Hadoop people screwed

> up their "forward motion" in api design and so now the Hadoop versions

> are screwy. The Cloudera distribution is Hadoop 0.20.something plus

> various bug fixes and features. Mahout runs on normal 0.20.something

> and Cloudera. Amazon Elastic Map-Reduce is also 0.20.whatsit.

>

> Lance

>

>

i guess the only problem that creates such demand for CDH is the fact

that hadoop project twisted everybody's arm by deprecating the entire

MR api over what seems to be just perceived OOA design issues but not

functional issues. Even that would've been ok if it weren't for the

fact that they did not provide a full replacement of the functionality

in the new API for well over a year and 0.21 is which should contain

most of replacements is recognized not to be production grade.

On Wed, Jun 8, 2011 at 12:27 AM, Sean Owen <[EMAIL PROTECTED]> wrote:

> Hadoop is Hadoop, so I don't know that any roadmap inconsistency between CDH

> and Hadoop is somehow Hadoop's fault.

>

> I don't think it's this ambiguous. Mahout runs on 0.20.2 Amazon EMR runs

> 0.20.2. The latest Hadoop version is 0.20.203.0. CDH is indeed somewhere

> inbetween but that's CDH.

>

> On Wed, Jun 8, 2011 at 2:46 AM, Lance Norskog <[EMAIL PROTECTED]> wrote:

>

>> CDH is the Cloudera distribution of Hadoop. The Hadoop people screwed

>> up their "forward motion" in api design and so now the Hadoop versions

>> are screwy. The Cloudera distribution is Hadoop 0.20.something plus

>> various bug fixes and features. Mahout runs on normal 0.20.something

>> and Cloudera. Amazon Elastic Map-Reduce is also 0.20.whatsit.

>>

>> Lance

>>

>>

>

that hadoop project twisted everybody's arm by deprecating the entire

MR api over what seems to be just perceived OOA design issues but not

functional issues. Even that would've been ok if it weren't for the

fact that they did not provide a full replacement of the functionality

in the new API for well over a year and 0.21 is which should contain

most of replacements is recognized not to be production grade.

On Wed, Jun 8, 2011 at 12:27 AM, Sean Owen <[EMAIL PROTECTED]> wrote:

> Hadoop is Hadoop, so I don't know that any roadmap inconsistency between CDH

> and Hadoop is somehow Hadoop's fault.

>

> I don't think it's this ambiguous. Mahout runs on 0.20.2 Amazon EMR runs

> 0.20.2. The latest Hadoop version is 0.20.203.0. CDH is indeed somewhere

> inbetween but that's CDH.

>

> On Wed, Jun 8, 2011 at 2:46 AM, Lance Norskog <[EMAIL PROTECTED]> wrote:

>

>> CDH is the Cloudera distribution of Hadoop. The Hadoop people screwed

>> up their "forward motion" in api design and so now the Hadoop versions

>> are screwy. The Cloudera distribution is Hadoop 0.20.something plus

>> various bug fixes and features. Mahout runs on normal 0.20.something

>> and Cloudera. Amazon Elastic Map-Reduce is also 0.20.whatsit.

>>

>> Lance

>>

>>

>