Eshwaran Vijaya Kumar

2011-06-03, 23:48

Jake Mannix

2011-06-04, 01:23

Dmitriy Lyubimov

2011-06-04, 01:26

Jake Mannix

2011-06-04, 01:46

Hector Yee

2011-06-04, 01:55

Eshwaran Vijaya Kumar

2011-06-04, 03:16

Ted Dunning

2011-06-06, 10:32

Eshwaran Vijaya Kumar

2011-06-07, 23:06

Ted Dunning

2011-06-08, 06:12

Eshwaran Vijaya Kumar

2011-06-08, 14:39

Ted Dunning

2011-06-08, 22:03

Hello all,

We are trying to build a clustering system which will have an SVD component. I believe Mahout has two SVD solvers: DistributedLanczosSolver and SSVD. Could someone give me some tips on which would be a better choice of a solver given that the size of the data will be roughly 100 million rows with each row having roughly 50 K dimensions (100 million X 50000 ). We will be working with text data so the resultant matrix should be relatively sparse to begin with.

Thanks

Eshwaran

We are trying to build a clustering system which will have an SVD component. I believe Mahout has two SVD solvers: DistributedLanczosSolver and SSVD. Could someone give me some tips on which would be a better choice of a solver given that the size of the data will be roughly 100 million rows with each row having roughly 50 K dimensions (100 million X 50000 ). We will be working with text data so the resultant matrix should be relatively sparse to begin with.

Thanks

Eshwaran

With 50k columns, you're well within the "sweet spot" for traditional SVD

via Lanczos, so give it a try.

SSVD will probably run faster, but you lose some information on what the

singular vectors "mean". If you don't need this information, SSVD may be

better for you.

What would be awesome for *us* is if you tried both and told us what you

found, in terms of performance and relevance. :)

-jake

On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <[EMAIL PROTECTED]>

wrote:

Hello all,

We are trying to build a clustering system which will have an SVD

component. I believe Mahout has two SVD solvers: DistributedLanczosSolver

and SSVD. Could someone give me some tips on which would be a better choice

of a solver given that the size of the data will be roughly 100 million rows

with each row having roughly 50 K dimensions (100 million X 50000 ). We will

be working with text data so the resultant matrix should be relatively

sparse to begin with.

Thanks

Eshwaran

via Lanczos, so give it a try.

SSVD will probably run faster, but you lose some information on what the

singular vectors "mean". If you don't need this information, SSVD may be

better for you.

What would be awesome for *us* is if you tried both and told us what you

found, in terms of performance and relevance. :)

-jake

On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <[EMAIL PROTECTED]>

wrote:

Hello all,

We are trying to build a clustering system which will have an SVD

component. I believe Mahout has two SVD solvers: DistributedLanczosSolver

and SSVD. Could someone give me some tips on which would be a better choice

of a solver given that the size of the data will be roughly 100 million rows

with each row having roughly 50 K dimensions (100 million X 50000 ). We will

be working with text data so the resultant matrix should be relatively

sparse to begin with.

Thanks

Eshwaran

What you really probably need to worry is not the number of

dimensions, but only avg number of non-zero elements per row

(density). How dense is the data?

On Fri, Jun 3, 2011 at 4:48 PM, Eshwaran Vijaya Kumar

<[EMAIL PROTECTED]> wrote:

> Hello all,

> We are trying to build a clustering system which will have an SVD component. I believe Mahout has two SVD solvers: DistributedLanczosSolver and SSVD. Could someone give me some tips on which would be a better choice of a solver given that the size of the data will be roughly 100 million rows with each row having roughly 50 K dimensions (100 million X 50000 ). We will be working with text data so the resultant matrix should be relatively sparse to begin with.

>

> Thanks

> Eshwaran

dimensions, but only avg number of non-zero elements per row

(density). How dense is the data?

On Fri, Jun 3, 2011 at 4:48 PM, Eshwaran Vijaya Kumar

<[EMAIL PROTECTED]> wrote:

> Hello all,

> We are trying to build a clustering system which will have an SVD component. I believe Mahout has two SVD solvers: DistributedLanczosSolver and SSVD. Could someone give me some tips on which would be a better choice of a solver given that the size of the data will be roughly 100 million rows with each row having roughly 50 K dimensions (100 million X 50000 ). We will be working with text data so the resultant matrix should be relatively sparse to begin with.

>

> Thanks

> Eshwaran

The overall number of columns matters to Lanczos - too many and you run out

of RAM.

-jake

On Jun 3, 2011 6:26 PM, "Dmitriy Lyubimov" <[EMAIL PROTECTED]> wrote:

What you really probably need to worry is not the number of

dimensions, but only avg number of non-zero elements per row

(density). How dense is the data?

On Fri, Jun 3, 2011 at 4:48 PM, Eshwaran Vijaya Kumar <

[EMAIL PROTECTED]> wrote: > Hello a...

of RAM.

-jake

On Jun 3, 2011 6:26 PM, "Dmitriy Lyubimov" <[EMAIL PROTECTED]> wrote:

What you really probably need to worry is not the number of

dimensions, but only avg number of non-zero elements per row

(density). How dense is the data?

On Fri, Jun 3, 2011 at 4:48 PM, Eshwaran Vijaya Kumar <

[EMAIL PROTECTED]> wrote: > Hello a...

If its very sparse you can try

https://issues.apache.org/jira/browse/MAHOUT-703

Instead of minimizing reconstruction error, it tries to enforce that your

words rank higher than other words not present in your document.

Example of some results from this approach:

https://docs.google.com/present/edit?id=0AQC247eq7Jp5ZGZ6NXpyOWhfMjlmM2pzdjRkZw&authkey=CNj2h98P&hl=en_US

On Fri, Jun 3, 2011 at 4:48 PM, Eshwaran Vijaya Kumar <

[EMAIL PROTECTED]> wrote:

> Hello all,

> We are trying to build a clustering system which will have an SVD

> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver

> and SSVD. Could someone give me some tips on which would be a better choice

> of a solver given that the size of the data will be roughly 100 million rows

> with each row having roughly 50 K dimensions (100 million X 50000 ). We will

> be working with text data so the resultant matrix should be relatively

> sparse to begin with.

>

> Thanks

> Eshwaran

--

Yee Yang Li Hector

http://hectorgon.blogspot.com/ (tech + travel)

http://hectorgon.com (book reviews)

https://issues.apache.org/jira/browse/MAHOUT-703

Instead of minimizing reconstruction error, it tries to enforce that your

words rank higher than other words not present in your document.

Example of some results from this approach:

https://docs.google.com/present/edit?id=0AQC247eq7Jp5ZGZ6NXpyOWhfMjlmM2pzdjRkZw&authkey=CNj2h98P&hl=en_US

On Fri, Jun 3, 2011 at 4:48 PM, Eshwaran Vijaya Kumar <

[EMAIL PROTECTED]> wrote:

> Hello all,

> We are trying to build a clustering system which will have an SVD

> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver

> and SSVD. Could someone give me some tips on which would be a better choice

> of a solver given that the size of the data will be roughly 100 million rows

> with each row having roughly 50 K dimensions (100 million X 50000 ). We will

> be working with text data so the resultant matrix should be relatively

> sparse to begin with.

>

> Thanks

> Eshwaran

Yee Yang Li Hector

http://hectorgon.blogspot.com/ (tech + travel)

http://hectorgon.com (book reviews)

Hi Jake,

Thank you for your reply. Good to know that we can use Lanczos. I will have to look into SSVD algorithm closer to figure out whether the information loss is worth the gain in speed (and computational efficiency). I guess We will have to run more tests to see which works best to decide on which path to go by.

Esh

On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:

> With 50k columns, you're well within the "sweet spot" for traditional SVD

> via Lanczos, so give it a try.

>

> SSVD will probably run faster, but you lose some information on what the

> singular vectors "mean". If you don't need this information, SSVD may be

> better for you.

>

> What would be awesome for *us* is if you tried both and told us what you

> found, in terms of performance and relevance. :)

>

> -jake

>

> On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <[EMAIL PROTECTED]>

> wrote:

>

> Hello all,

> We are trying to build a clustering system which will have an SVD

> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver

> and SSVD. Could someone give me some tips on which would be a better choice

> of a solver given that the size of the data will be roughly 100 million rows

> with each row having roughly 50 K dimensions (100 million X 50000 ). We will

> be working with text data so the resultant matrix should be relatively

> sparse to begin with.

>

> Thanks

> Eshwaran

Thank you for your reply. Good to know that we can use Lanczos. I will have to look into SSVD algorithm closer to figure out whether the information loss is worth the gain in speed (and computational efficiency). I guess We will have to run more tests to see which works best to decide on which path to go by.

Esh

On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:

> With 50k columns, you're well within the "sweet spot" for traditional SVD

> via Lanczos, so give it a try.

>

> SSVD will probably run faster, but you lose some information on what the

> singular vectors "mean". If you don't need this information, SSVD may be

> better for you.

>

> What would be awesome for *us* is if you tried both and told us what you

> found, in terms of performance and relevance. :)

>

> -jake

>

> On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <[EMAIL PROTECTED]>

> wrote:

>

> Hello all,

> We are trying to build a clustering system which will have an SVD

> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver

> and SSVD. Could someone give me some tips on which would be a better choice

> of a solver given that the size of the data will be roughly 100 million rows

> with each row having roughly 50 K dimensions (100 million X 50000 ). We will

> be working with text data so the resultant matrix should be relatively

> sparse to begin with.

>

> Thanks

> Eshwaran

I would push for SSVD as well if you want a real SVD.

Also, I don't think that you lose information about which vectors are which

(or as Jake put it "what they mean"). The stochastic decomposition gives a

very accurate estimate of the top-k singular vectors. It does this by using

the random projection to project the top singular vectors into a sub-space

and then correcting the results obtained back into the original space. This

is not the same as simply doing the decomposition on the random projection

and then using that decomposition.

On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar <

[EMAIL PROTECTED]> wrote:

> Hi Jake,

> Thank you for your reply. Good to know that we can use Lanczos. I will

> have to look into SSVD algorithm closer to figure out whether the

> information loss is worth the gain in speed (and computational efficiency).

> I guess We will have to run more tests to see which works best to decide on

> which path to go by.

>

>

> Esh

>

> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:

>

> > With 50k columns, you're well within the "sweet spot" for traditional SVD

> > via Lanczos, so give it a try.

> >

> > SSVD will probably run faster, but you lose some information on what the

> > singular vectors "mean". If you don't need this information, SSVD may be

> > better for you.

> >

> > What would be awesome for *us* is if you tried both and told us what you

> > found, in terms of performance and relevance. :)

> >

> > -jake

> >

> > On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <

> [EMAIL PROTECTED]>

> > wrote:

> >

> > Hello all,

> > We are trying to build a clustering system which will have an SVD

> > component. I believe Mahout has two SVD solvers: DistributedLanczosSolver

> > and SSVD. Could someone give me some tips on which would be a better

> choice

> > of a solver given that the size of the data will be roughly 100 million

> rows

> > with each row having roughly 50 K dimensions (100 million X 50000 ). We

> will

> > be working with text data so the resultant matrix should be relatively

> > sparse to begin with.

> >

> > Thanks

> > Eshwaran

>

>

Also, I don't think that you lose information about which vectors are which

(or as Jake put it "what they mean"). The stochastic decomposition gives a

very accurate estimate of the top-k singular vectors. It does this by using

the random projection to project the top singular vectors into a sub-space

and then correcting the results obtained back into the original space. This

is not the same as simply doing the decomposition on the random projection

and then using that decomposition.

On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar <

[EMAIL PROTECTED]> wrote:

> Hi Jake,

> Thank you for your reply. Good to know that we can use Lanczos. I will

> have to look into SSVD algorithm closer to figure out whether the

> information loss is worth the gain in speed (and computational efficiency).

> I guess We will have to run more tests to see which works best to decide on

> which path to go by.

>

>

> Esh

>

> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:

>

> > With 50k columns, you're well within the "sweet spot" for traditional SVD

> > via Lanczos, so give it a try.

> >

> > SSVD will probably run faster, but you lose some information on what the

> > singular vectors "mean". If you don't need this information, SSVD may be

> > better for you.

> >

> > What would be awesome for *us* is if you tried both and told us what you

> > found, in terms of performance and relevance. :)

> >

> > -jake

> >

> > On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <

> [EMAIL PROTECTED]>

> > wrote:

> >

> > Hello all,

> > We are trying to build a clustering system which will have an SVD

> > component. I believe Mahout has two SVD solvers: DistributedLanczosSolver

> > and SSVD. Could someone give me some tips on which would be a better

> choice

> > of a solver given that the size of the data will be roughly 100 million

> rows

> > with each row having roughly 50 K dimensions (100 million X 50000 ). We

> will

> > be working with text data so the resultant matrix should be relatively

> > sparse to begin with.

> >

> > Thanks

> > Eshwaran

>

>

One of the algorithms we would like to try out is PCA where it seems to be good to avoid subtracting the mean matrix (m* m^T) from the data matrix, say A, to avoid destroying sparsity ( Refer : http://search.lucidimagination.com/search/document/cd4c36c2f27080d/regarding_pca_implementation#2eae2e2861213ae0 ). From what I understand about the Lanczos algorithm, it shouldn't be too hard to modify the solver code so that I can pass A and m*m^T without combining them into a single matrix and then do repeated multiplications. Unfortunately, I have not yet had time to look at SSVD; So it would be extremely helpful if someone who has looked at the problem more closer can comment on how to do these (potential?) modifications to the SSVD code to avoid having to deal with dense matrices.

Thanks in advance

Eshwaran

On Jun 6, 2011, at 3:32 AM, Ted Dunning wrote:

> I would push for SSVD as well if you want a real SVD.

>

> Also, I don't think that you lose information about which vectors are which

> (or as Jake put it "what they mean"). The stochastic decomposition gives a

> very accurate estimate of the top-k singular vectors. It does this by using

> the random projection to project the top singular vectors into a sub-space

> and then correcting the results obtained back into the original space. This

> is not the same as simply doing the decomposition on the random projection

> and then using that decomposition.

>

> On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar <

> [EMAIL PROTECTED]> wrote:

>

>> Hi Jake,

>> Thank you for your reply. Good to know that we can use Lanczos. I will

>> have to look into SSVD algorithm closer to figure out whether the

>> information loss is worth the gain in speed (and computational efficiency).

>> I guess We will have to run more tests to see which works best to decide on

>> which path to go by.

>>

>>

>> Esh

>>

>> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:

>>

>>> With 50k columns, you're well within the "sweet spot" for traditional SVD

>>> via Lanczos, so give it a try.

>>>

>>> SSVD will probably run faster, but you lose some information on what the

>>> singular vectors "mean". If you don't need this information, SSVD may be

>>> better for you.

>>>

>>> What would be awesome for *us* is if you tried both and told us what you

>>> found, in terms of performance and relevance. :)

>>>

>>> -jake

>>>

>>> On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <

>> [EMAIL PROTECTED]>

>>> wrote:

>>>

>>> Hello all,

>>> We are trying to build a clustering system which will have an SVD

>>> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver

>>> and SSVD. Could someone give me some tips on which would be a better

>> choice

>>> of a solver given that the size of the data will be roughly 100 million

>> rows

>>> with each row having roughly 50 K dimensions (100 million X 50000 ). We

>> will

>>> be working with text data so the resultant matrix should be relatively

>>> sparse to begin with.

>>>

>>> Thanks

>>> Eshwaran

>>

>>

Thanks in advance

Eshwaran

On Jun 6, 2011, at 3:32 AM, Ted Dunning wrote:

> I would push for SSVD as well if you want a real SVD.

>

> Also, I don't think that you lose information about which vectors are which

> (or as Jake put it "what they mean"). The stochastic decomposition gives a

> very accurate estimate of the top-k singular vectors. It does this by using

> the random projection to project the top singular vectors into a sub-space

> and then correcting the results obtained back into the original space. This

> is not the same as simply doing the decomposition on the random projection

> and then using that decomposition.

>

> On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar <

> [EMAIL PROTECTED]> wrote:

>

>> Hi Jake,

>> Thank you for your reply. Good to know that we can use Lanczos. I will

>> have to look into SSVD algorithm closer to figure out whether the

>> information loss is worth the gain in speed (and computational efficiency).

>> I guess We will have to run more tests to see which works best to decide on

>> which path to go by.

>>

>>

>> Esh

>>

>> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:

>>

>>> With 50k columns, you're well within the "sweet spot" for traditional SVD

>>> via Lanczos, so give it a try.

>>>

>>> SSVD will probably run faster, but you lose some information on what the

>>> singular vectors "mean". If you don't need this information, SSVD may be

>>> better for you.

>>>

>>> What would be awesome for *us* is if you tried both and told us what you

>>> found, in terms of performance and relevance. :)

>>>

>>> -jake

>>>

>>> On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <

>> [EMAIL PROTECTED]>

>>> wrote:

>>>

>>> Hello all,

>>> We are trying to build a clustering system which will have an SVD

>>> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver

>>> and SSVD. Could someone give me some tips on which would be a better

>> choice

>>> of a solver given that the size of the data will be roughly 100 million

>> rows

>>> with each row having roughly 50 K dimensions (100 million X 50000 ). We

>> will

>>> be working with text data so the resultant matrix should be relatively

>>> sparse to begin with.

>>>

>>> Thanks

>>> Eshwaran

>>

>>

I think that incorporating mean subtraction into the SSVD code should

be relatively straightforward. The trick is that you have to project

the orginal matrix and the mean separately and then combine the

results. There are later steps that require similar mods, but since

it is all based on a simple product that can always be factored, it

should be straightforward.

On Tue, Jun 7, 2011 at 4:06 PM, Eshwaran Vijaya Kumar

<[EMAIL PROTECTED]> wrote:

> One of the algorithms we would like to try out is PCA where it seems to be good to avoid subtracting the mean matrix (m* m^T) from the data matrix, say A, to avoid destroying sparsity ( Refer : http://search.lucidimagination.com/search/document/cd4c36c2f27080d/regarding_pca_implementation#2eae2e2861213ae0 ). From what I understand about the Lanczos algorithm, it shouldn't be too hard to modify the solver code so that I can pass A and m*m^T without combining them into a single matrix and then do repeated multiplications. Unfortunately, I have not yet had time to look at SSVD; So it would be extremely helpful if someone who has looked at the problem more closer can comment on how to do these (potential?) modifications to the SSVD code to avoid having to deal with dense matrices.

>

>

> Thanks in advance

> Eshwaran

>

>

>

> On Jun 6, 2011, at 3:32 AM, Ted Dunning wrote:

>

>> I would push for SSVD as well if you want a real SVD.

>>

>> Also, I don't think that you lose information about which vectors are which

>> (or as Jake put it "what they mean"). The stochastic decomposition gives a

>> very accurate estimate of the top-k singular vectors. It does this by using

>> the random projection to project the top singular vectors into a sub-space

>> and then correcting the results obtained back into the original space. This

>> is not the same as simply doing the decomposition on the random projection

>> and then using that decomposition.

>>

>> On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar <

>> [EMAIL PROTECTED]> wrote:

>>

>>> Hi Jake,

>>> Thank you for your reply. Good to know that we can use Lanczos. I will

>>> have to look into SSVD algorithm closer to figure out whether the

>>> information loss is worth the gain in speed (and computational efficiency).

>>> I guess We will have to run more tests to see which works best to decide on

>>> which path to go by.

>>>

>>>

>>> Esh

>>>

>>> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:

>>>

>>>> With 50k columns, you're well within the "sweet spot" for traditional SVD

>>>> via Lanczos, so give it a try.

>>>>

>>>> SSVD will probably run faster, but you lose some information on what the

>>>> singular vectors "mean". If you don't need this information, SSVD may be

>>>> better for you.

>>>>

>>>> What would be awesome for *us* is if you tried both and told us what you

>>>> found, in terms of performance and relevance. :)

>>>>

>>>> -jake

>>>>

>>>> On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <

>>> [EMAIL PROTECTED]>

>>>> wrote:

>>>>

>>>> Hello all,

>>>> We are trying to build a clustering system which will have an SVD

>>>> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver

>>>> and SSVD. Could someone give me some tips on which would be a better

>>> choice

>>>> of a solver given that the size of the data will be roughly 100 million

>>> rows

>>>> with each row having roughly 50 K dimensions (100 million X 50000 ). We

>>> will

>>>> be working with text data so the resultant matrix should be relatively

>>>> sparse to begin with.

>>>>

>>>> Thanks

>>>> Eshwaran

>>>

>>>

>

>

be relatively straightforward. The trick is that you have to project

the orginal matrix and the mean separately and then combine the

results. There are later steps that require similar mods, but since

it is all based on a simple product that can always be factored, it

should be straightforward.

On Tue, Jun 7, 2011 at 4:06 PM, Eshwaran Vijaya Kumar

<[EMAIL PROTECTED]> wrote:

> One of the algorithms we would like to try out is PCA where it seems to be good to avoid subtracting the mean matrix (m* m^T) from the data matrix, say A, to avoid destroying sparsity ( Refer : http://search.lucidimagination.com/search/document/cd4c36c2f27080d/regarding_pca_implementation#2eae2e2861213ae0 ). From what I understand about the Lanczos algorithm, it shouldn't be too hard to modify the solver code so that I can pass A and m*m^T without combining them into a single matrix and then do repeated multiplications. Unfortunately, I have not yet had time to look at SSVD; So it would be extremely helpful if someone who has looked at the problem more closer can comment on how to do these (potential?) modifications to the SSVD code to avoid having to deal with dense matrices.

>

>

> Thanks in advance

> Eshwaran

>

>

>

> On Jun 6, 2011, at 3:32 AM, Ted Dunning wrote:

>

>> I would push for SSVD as well if you want a real SVD.

>>

>> Also, I don't think that you lose information about which vectors are which

>> (or as Jake put it "what they mean"). The stochastic decomposition gives a

>> very accurate estimate of the top-k singular vectors. It does this by using

>> the random projection to project the top singular vectors into a sub-space

>> and then correcting the results obtained back into the original space. This

>> is not the same as simply doing the decomposition on the random projection

>> and then using that decomposition.

>>

>> On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar <

>> [EMAIL PROTECTED]> wrote:

>>

>>> Hi Jake,

>>> Thank you for your reply. Good to know that we can use Lanczos. I will

>>> have to look into SSVD algorithm closer to figure out whether the

>>> information loss is worth the gain in speed (and computational efficiency).

>>> I guess We will have to run more tests to see which works best to decide on

>>> which path to go by.

>>>

>>>

>>> Esh

>>>

>>> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:

>>>

>>>> With 50k columns, you're well within the "sweet spot" for traditional SVD

>>>> via Lanczos, so give it a try.

>>>>

>>>> SSVD will probably run faster, but you lose some information on what the

>>>> singular vectors "mean". If you don't need this information, SSVD may be

>>>> better for you.

>>>>

>>>> What would be awesome for *us* is if you tried both and told us what you

>>>> found, in terms of performance and relevance. :)

>>>>

>>>> -jake

>>>>

>>>> On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <

>>> [EMAIL PROTECTED]>

>>>> wrote:

>>>>

>>>> Hello all,

>>>> We are trying to build a clustering system which will have an SVD

>>>> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver

>>>> and SSVD. Could someone give me some tips on which would be a better

>>> choice

>>>> of a solver given that the size of the data will be roughly 100 million

>>> rows

>>>> with each row having roughly 50 K dimensions (100 million X 50000 ). We

>>> will

>>>> be working with text data so the resultant matrix should be relatively

>>>> sparse to begin with.

>>>>

>>>> Thanks

>>>> Eshwaran

>>>

>>>

>

>

Thanks Ted. That is good news.

On Jun 7, 2011, at 11:12 PM, Ted Dunning wrote:

> I think that incorporating mean subtraction into the SSVD code should

> be relatively straightforward. The trick is that you have to project

> the orginal matrix and the mean separately and then combine the

> results. There are later steps that require similar mods, but since

> it is all based on a simple product that can always be factored, it

> should be straightforward.

>

> On Tue, Jun 7, 2011 at 4:06 PM, Eshwaran Vijaya Kumar

> <[EMAIL PROTECTED]> wrote:

>> One of the algorithms we would like to try out is PCA where it seems to be good to avoid subtracting the mean matrix (m* m^T) from the data matrix, say A, to avoid destroying sparsity ( Refer : http://search.lucidimagination.com/search/document/cd4c36c2f27080d/regarding_pca_implementation#2eae2e2861213ae0 ). From what I understand about the Lanczos algorithm, it shouldn't be too hard to modify the solver code so that I can pass A and m*m^T without combining them into a single matrix and then do repeated multiplications. Unfortunately, I have not yet had time to look at SSVD; So it would be extremely helpful if someone who has looked at the problem more closer can comment on how to do these (potential?) modifications to the SSVD code to avoid having to deal with dense matrices.

>>

>>

>> Thanks in advance

>> Eshwaran

>>

>>

>>

>> On Jun 6, 2011, at 3:32 AM, Ted Dunning wrote:

>>

>>> I would push for SSVD as well if you want a real SVD.

>>>

>>> Also, I don't think that you lose information about which vectors are which

>>> (or as Jake put it "what they mean"). The stochastic decomposition gives a

>>> very accurate estimate of the top-k singular vectors. It does this by using

>>> the random projection to project the top singular vectors into a sub-space

>>> and then correcting the results obtained back into the original space. This

>>> is not the same as simply doing the decomposition on the random projection

>>> and then using that decomposition.

>>>

>>> On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar <

>>> [EMAIL PROTECTED]> wrote:

>>>

>>>> Hi Jake,

>>>> Thank you for your reply. Good to know that we can use Lanczos. I will

>>>> have to look into SSVD algorithm closer to figure out whether the

>>>> information loss is worth the gain in speed (and computational efficiency).

>>>> I guess We will have to run more tests to see which works best to decide on

>>>> which path to go by.

>>>>

>>>>

>>>> Esh

>>>>

>>>> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:

>>>>

>>>>> With 50k columns, you're well within the "sweet spot" for traditional SVD

>>>>> via Lanczos, so give it a try.

>>>>>

>>>>> SSVD will probably run faster, but you lose some information on what the

>>>>> singular vectors "mean". If you don't need this information, SSVD may be

>>>>> better for you.

>>>>>

>>>>> What would be awesome for *us* is if you tried both and told us what you

>>>>> found, in terms of performance and relevance. :)

>>>>>

>>>>> -jake

>>>>>

>>>>> On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <

>>>> [EMAIL PROTECTED]>

>>>>> wrote:

>>>>>

>>>>> Hello all,

>>>>> We are trying to build a clustering system which will have an SVD

>>>>> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver

>>>>> and SSVD. Could someone give me some tips on which would be a better

>>>> choice

>>>>> of a solver given that the size of the data will be roughly 100 million

>>>> rows

>>>>> with each row having roughly 50 K dimensions (100 million X 50000 ). We

>>>> will

>>>>> be working with text data so the resultant matrix should be relatively

>>>>> sparse to begin with.

>>>>>

>>>>> Thanks

>>>>> Eshwaran

>>>>

>>>>

>>

>>

On Jun 7, 2011, at 11:12 PM, Ted Dunning wrote:

> I think that incorporating mean subtraction into the SSVD code should

> be relatively straightforward. The trick is that you have to project

> the orginal matrix and the mean separately and then combine the

> results. There are later steps that require similar mods, but since

> it is all based on a simple product that can always be factored, it

> should be straightforward.

>

> On Tue, Jun 7, 2011 at 4:06 PM, Eshwaran Vijaya Kumar

> <[EMAIL PROTECTED]> wrote:

>> One of the algorithms we would like to try out is PCA where it seems to be good to avoid subtracting the mean matrix (m* m^T) from the data matrix, say A, to avoid destroying sparsity ( Refer : http://search.lucidimagination.com/search/document/cd4c36c2f27080d/regarding_pca_implementation#2eae2e2861213ae0 ). From what I understand about the Lanczos algorithm, it shouldn't be too hard to modify the solver code so that I can pass A and m*m^T without combining them into a single matrix and then do repeated multiplications. Unfortunately, I have not yet had time to look at SSVD; So it would be extremely helpful if someone who has looked at the problem more closer can comment on how to do these (potential?) modifications to the SSVD code to avoid having to deal with dense matrices.

>>

>>

>> Thanks in advance

>> Eshwaran

>>

>>

>>

>> On Jun 6, 2011, at 3:32 AM, Ted Dunning wrote:

>>

>>> I would push for SSVD as well if you want a real SVD.

>>>

>>> Also, I don't think that you lose information about which vectors are which

>>> (or as Jake put it "what they mean"). The stochastic decomposition gives a

>>> very accurate estimate of the top-k singular vectors. It does this by using

>>> the random projection to project the top singular vectors into a sub-space

>>> and then correcting the results obtained back into the original space. This

>>> is not the same as simply doing the decomposition on the random projection

>>> and then using that decomposition.

>>>

>>> On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar <

>>> [EMAIL PROTECTED]> wrote:

>>>

>>>> Hi Jake,

>>>> Thank you for your reply. Good to know that we can use Lanczos. I will

>>>> have to look into SSVD algorithm closer to figure out whether the

>>>> information loss is worth the gain in speed (and computational efficiency).

>>>> I guess We will have to run more tests to see which works best to decide on

>>>> which path to go by.

>>>>

>>>>

>>>> Esh

>>>>

>>>> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:

>>>>

>>>>> With 50k columns, you're well within the "sweet spot" for traditional SVD

>>>>> via Lanczos, so give it a try.

>>>>>

>>>>> SSVD will probably run faster, but you lose some information on what the

>>>>> singular vectors "mean". If you don't need this information, SSVD may be

>>>>> better for you.

>>>>>

>>>>> What would be awesome for *us* is if you tried both and told us what you

>>>>> found, in terms of performance and relevance. :)

>>>>>

>>>>> -jake

>>>>>

>>>>> On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <

>>>> [EMAIL PROTECTED]>

>>>>> wrote:

>>>>>

>>>>> Hello all,

>>>>> We are trying to build a clustering system which will have an SVD

>>>>> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver

>>>>> and SSVD. Could someone give me some tips on which would be a better

>>>> choice

>>>>> of a solver given that the size of the data will be roughly 100 million

>>>> rows

>>>>> with each row having roughly 50 K dimensions (100 million X 50000 ). We

>>>> will

>>>>> be working with text data so the resultant matrix should be relatively

>>>>> sparse to begin with.

>>>>>

>>>>> Thanks

>>>>> Eshwaran

>>>>

>>>>

>>

>>

I would encourage you to take a stab at a patch on this. You aren't

the only person to have expressed interest in scaling PCA, but you

aren't a member of a large horde, either.

On Wed, Jun 8, 2011 at 7:39 AM, Eshwaran Vijaya Kumar

<[EMAIL PROTECTED]> wrote:

> Thanks Ted. That is good news.

> On Jun 7, 2011, at 11:12 PM, Ted Dunning wrote:

>

>> I think that incorporating mean subtraction into the SSVD code should

>> be relatively straightforward. The trick is that you have to project

>> the orginal matrix and the mean separately and then combine the

>> results. There are later steps that require similar mods, but since

>> it is all based on a simple product that can always be factored, it

>> should be straightforward.

>>

>> On Tue, Jun 7, 2011 at 4:06 PM, Eshwaran Vijaya Kumar

>> <[EMAIL PROTECTED]> wrote:

>>> One of the algorithms we would like to try out is PCA where it seems to be good to avoid subtracting the mean matrix (m* m^T) from the data matrix, say A, to avoid destroying sparsity ( Refer : http://search.lucidimagination.com/search/document/cd4c36c2f27080d/regarding_pca_implementation#2eae2e2861213ae0 ). From what I understand about the Lanczos algorithm, it shouldn't be too hard to modify the solver code so that I can pass A and m*m^T without combining them into a single matrix and then do repeated multiplications. Unfortunately, I have not yet had time to look at SSVD; So it would be extremely helpful if someone who has looked at the problem more closer can comment on how to do these (potential?) modifications to the SSVD code to avoid having to deal with dense matrices.

>>>

>>>

>>> Thanks in advance

>>> Eshwaran

>>>

>>>

>>>

>>> On Jun 6, 2011, at 3:32 AM, Ted Dunning wrote:

>>>

>>>> I would push for SSVD as well if you want a real SVD.

>>>>

>>>> Also, I don't think that you lose information about which vectors are which

>>>> (or as Jake put it "what they mean"). The stochastic decomposition gives a

>>>> very accurate estimate of the top-k singular vectors. It does this by using

>>>> the random projection to project the top singular vectors into a sub-space

>>>> and then correcting the results obtained back into the original space. This

>>>> is not the same as simply doing the decomposition on the random projection

>>>> and then using that decomposition.

>>>>

>>>> On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar <

>>>> [EMAIL PROTECTED]> wrote:

>>>>

>>>>> Hi Jake,

>>>>> Thank you for your reply. Good to know that we can use Lanczos. I will

>>>>> have to look into SSVD algorithm closer to figure out whether the

>>>>> information loss is worth the gain in speed (and computational efficiency).

>>>>> I guess We will have to run more tests to see which works best to decide on

>>>>> which path to go by.

>>>>>

>>>>>

>>>>> Esh

>>>>>

>>>>> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:

>>>>>

>>>>>> With 50k columns, you're well within the "sweet spot" for traditional SVD

>>>>>> via Lanczos, so give it a try.

>>>>>>

>>>>>> SSVD will probably run faster, but you lose some information on what the

>>>>>> singular vectors "mean". If you don't need this information, SSVD may be

>>>>>> better for you.

>>>>>>

>>>>>> What would be awesome for *us* is if you tried both and told us what you

>>>>>> found, in terms of performance and relevance. :)

>>>>>>

>>>>>> -jake

>>>>>>

>>>>>> On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <

>>>>> [EMAIL PROTECTED]>

>>>>>> wrote:

>>>>>>

>>>>>> Hello all,

>>>>>> We are trying to build a clustering system which will have an SVD

>>>>>> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver

>>>>>> and SSVD. Could someone give me some tips on which would be a better

>>>>> choice

>>>>>> of a solver given that the size of the data will be roughly 100 million

>>>>> rows

>>>>>> with each row having roughly 50 K dimensions (100 million X 50000 ). We

>>>>> will

>>>>>> be working with text data so the resultant matrix should be relatively

>>>>>> sparse to begin with.

the only person to have expressed interest in scaling PCA, but you

aren't a member of a large horde, either.

On Wed, Jun 8, 2011 at 7:39 AM, Eshwaran Vijaya Kumar

<[EMAIL PROTECTED]> wrote:

> Thanks Ted. That is good news.

> On Jun 7, 2011, at 11:12 PM, Ted Dunning wrote:

>

>> I think that incorporating mean subtraction into the SSVD code should

>> be relatively straightforward. The trick is that you have to project

>> the orginal matrix and the mean separately and then combine the

>> results. There are later steps that require similar mods, but since

>> it is all based on a simple product that can always be factored, it

>> should be straightforward.

>>

>> On Tue, Jun 7, 2011 at 4:06 PM, Eshwaran Vijaya Kumar

>> <[EMAIL PROTECTED]> wrote:

>>> One of the algorithms we would like to try out is PCA where it seems to be good to avoid subtracting the mean matrix (m* m^T) from the data matrix, say A, to avoid destroying sparsity ( Refer : http://search.lucidimagination.com/search/document/cd4c36c2f27080d/regarding_pca_implementation#2eae2e2861213ae0 ). From what I understand about the Lanczos algorithm, it shouldn't be too hard to modify the solver code so that I can pass A and m*m^T without combining them into a single matrix and then do repeated multiplications. Unfortunately, I have not yet had time to look at SSVD; So it would be extremely helpful if someone who has looked at the problem more closer can comment on how to do these (potential?) modifications to the SSVD code to avoid having to deal with dense matrices.

>>>

>>>

>>> Thanks in advance

>>> Eshwaran

>>>

>>>

>>>

>>> On Jun 6, 2011, at 3:32 AM, Ted Dunning wrote:

>>>

>>>> I would push for SSVD as well if you want a real SVD.

>>>>

>>>> Also, I don't think that you lose information about which vectors are which

>>>> (or as Jake put it "what they mean"). The stochastic decomposition gives a

>>>> very accurate estimate of the top-k singular vectors. It does this by using

>>>> the random projection to project the top singular vectors into a sub-space

>>>> and then correcting the results obtained back into the original space. This

>>>> is not the same as simply doing the decomposition on the random projection

>>>> and then using that decomposition.

>>>>

>>>> On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar <

>>>> [EMAIL PROTECTED]> wrote:

>>>>

>>>>> Hi Jake,

>>>>> Thank you for your reply. Good to know that we can use Lanczos. I will

>>>>> have to look into SSVD algorithm closer to figure out whether the

>>>>> information loss is worth the gain in speed (and computational efficiency).

>>>>> I guess We will have to run more tests to see which works best to decide on

>>>>> which path to go by.

>>>>>

>>>>>

>>>>> Esh

>>>>>

>>>>> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:

>>>>>

>>>>>> With 50k columns, you're well within the "sweet spot" for traditional SVD

>>>>>> via Lanczos, so give it a try.

>>>>>>

>>>>>> SSVD will probably run faster, but you lose some information on what the

>>>>>> singular vectors "mean". If you don't need this information, SSVD may be

>>>>>> better for you.

>>>>>>

>>>>>> What would be awesome for *us* is if you tried both and told us what you

>>>>>> found, in terms of performance and relevance. :)

>>>>>>

>>>>>> -jake

>>>>>>

>>>>>> On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <

>>>>> [EMAIL PROTECTED]>

>>>>>> wrote:

>>>>>>

>>>>>> Hello all,

>>>>>> We are trying to build a clustering system which will have an SVD

>>>>>> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver

>>>>>> and SSVD. Could someone give me some tips on which would be a better

>>>>> choice

>>>>>> of a solver given that the size of the data will be roughly 100 million

>>>>> rows

>>>>>> with each row having roughly 50 K dimensions (100 million X 50000 ). We

>>>>> will

>>>>>> be working with text data so the resultant matrix should be relatively

>>>>>> sparse to begin with.