Subject: Clustering Question (from a newbie)


I would recommend that you work with the original counts instead of
percentages.  That allows you to use statistical similarity measures based
on the multinomial distribution.  The important thing that the counts
provide over percentages is an understanding of how certain the
distribution really is.

If you move forward with using the percentages, I would consider using
something like Kuhlback-Leibler divergence as a measure of dissimilarity.
 You would need to smooth the probabilities when you derive them from the
counts.  The simplest method for this is to introduce a simple prior into
your estimates.  Then, if the count for each category i is k_i, you would
estimate the percentage p_i as

    p_i = (k_i + \delta) / \sum_j (k_j + \delta)

This prevents you from ever estimating either 0 or 1 for these percentages
and thus helps avoid log 0.  It also will tend to give you better results
in a variety of ways.

On Tue, Nov 22, 2011 at 1:46 PM, Fernando O. <[EMAIL PROTECTED]> wrote: