Subject: tf-idf + svd + cosine similarity


I have been intermittently following this point.

Some folks have said that having higher dimensional SVD's should change the
distribution of distances.

Actually, that isn't quite true.  SVD preserves dot products as much as
possible.  With lower dimensional projections you lose some information, but
as the singular values decline, you lose less and less information.

It *is* however true that *random* unit vectors in higher dimension have a
dot product that is more and more tightly clustered around zero.  This is a
different case entirely from the case that we are talking about where you
have real data projected down into a lower dimensional space.

On Wed, Jun 15, 2011 at 7:44 PM, Jake Mannix <[EMAIL PROTECTED]> wrote: