Grant Ingersoll

2009-07-14, 13:41

Ted Dunning

2009-07-28, 01:42

Benson Margulies

2009-07-28, 01:51

Ted Dunning

2009-07-28, 04:48

Grant Ingersoll

2009-07-28, 10:55

Benson Margulies

2009-07-28, 18:49

Ted Dunning

2009-07-28, 20:36

Grant Ingersoll

2009-08-18, 13:55

Grant Ingersoll

2009-08-18, 14:32

Ted Dunning

2009-08-18, 17:04

Ted,

On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote:

> A principled approach to cluster evaluation is to measure how well the

> cluster membership captures the structure of unseen data. A natural

> measure

> for this is to measure how much of the entropy of the data is

> captured by

> cluster membership. For k-means and its natural L_2 metric, the

> natural

> cluster quality metric is the squared distance from the nearest

> centroid

> adjusted by the log_2 of the number of clusters. This can be

> compared to

> the squared magnitude of the original data or the squared deviation

> from the

> centroid for all of the data. The idea is that you are changing the

> representation of the data by allocating some of the bits in your

> original

> representation to represent which cluster each point is in. If

> those bits

> aren't made up by the residue being small then your clustering is

> making a

> bad trade-off.

>

> In the past, I have used other more heuristic measures as well. One

> of the

> key characteristics that I would like to see out of a clustering is

> a degree

> of stability. Thus, I look at the fractions of points that are

> assigned to

> each cluster or the distribution of distances from the cluster

> centroid.

> These values should be relatively stable when applied to held-out

> data.

>

> For text, you can actually compute perplexity which measures how well

> cluster membership predicts what words are used. This is nice

> because you

> don't have to worry about the entropy of real valued numbers.

Do you have any references on any of the above approaches?

Thanks,

Grant

On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote:

> A principled approach to cluster evaluation is to measure how well the

> cluster membership captures the structure of unseen data. A natural

> measure

> for this is to measure how much of the entropy of the data is

> captured by

> cluster membership. For k-means and its natural L_2 metric, the

> natural

> cluster quality metric is the squared distance from the nearest

> centroid

> adjusted by the log_2 of the number of clusters. This can be

> compared to

> the squared magnitude of the original data or the squared deviation

> from the

> centroid for all of the data. The idea is that you are changing the

> representation of the data by allocating some of the bits in your

> original

> representation to represent which cluster each point is in. If

> those bits

> aren't made up by the residue being small then your clustering is

> making a

> bad trade-off.

>

> In the past, I have used other more heuristic measures as well. One

> of the

> key characteristics that I would like to see out of a clustering is

> a degree

> of stability. Thus, I look at the fractions of points that are

> assigned to

> each cluster or the distribution of distances from the cluster

> centroid.

> These values should be relatively stable when applied to held-out

> data.

>

> For text, you can actually compute perplexity which measures how well

> cluster membership predicts what words are used. This is nice

> because you

> don't have to worry about the entropy of real valued numbers.

Do you have any references on any of the above approaches?

Thanks,

Grant

(vastly delayed response ... huge distractions competing with more than 2

minutes answers are to blame)

Grant,

For evaluating clustering for symbol sequences:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.7275

Most of the other references I have found talk about quality relative to

gold standard judgments about whether exemplars are in the same class or

relative to similarity/distinctiveness ratios. Neither is all that

satisfactory.

My preference is an entropic measure that describes how much of the

information in your data is captured by the clustering vs how much residual

info there is.

The other reference I am looking for may be in David Mackay's book. The

idea is that you measure the quality of the approximation by looking at the

entropy in the cluster assignment relative to the residual required to

precisely specify the original data relative to the quantized value.

This is also related to trading off signal/noise in a vector quantizer.

David, do you have a moment to talk about this with me? I can't free up

the time to chase these final references and come up with a nice formula for

this. I think you could do it in 10-20 minutes.

On Tue, Jul 14, 2009 at 6:41 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:

> On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote:

>

> A principled approach to cluster evaluation is to measure how well the

>> cluster membership captures the structure of unseen data. A natural

>> measure

>> for this is to measure how much of the entropy of the data is captured by

>> cluster membership. For k-means and its natural L_2 metric, the natural

>> cluster quality metric is the squared distance from the nearest centroid

>> adjusted by the log_2 of the number of clusters. This can be compared to

>> the squared magnitude of the original data or the squared deviation from

>> the

>> centroid for all of the data. The idea is that you are changing the

>> representation of the data by allocating some of the bits in your original

>> representation to represent which cluster each point is in. If those bits

>> aren't made up by the residue being small then your clustering is making a

>> bad trade-off.

>>

>> In the past, I have used other more heuristic measures as well. One of

>> the

>> key characteristics that I would like to see out of a clustering is a

>> degree

>> of stability. Thus, I look at the fractions of points that are assigned

>> to

>> each cluster or the distribution of distances from the cluster centroid.

>> These values should be relatively stable when applied to held-out data.

>>

>> For text, you can actually compute perplexity which measures how well

>> cluster membership predicts what words are used. This is nice because you

>> don't have to worry about the entropy of real valued numbers.

>>

>

> Do you have any references on any of the above approaches?

>

--

Ted Dunning, CTO

DeepDyve

111 West Evelyn Ave. Ste. 202

Sunnyvale, CA 94086

http://www.deepdyve.com

858-414-0013 (m)

408-773-0220 (fax)

minutes answers are to blame)

Grant,

For evaluating clustering for symbol sequences:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.7275

Most of the other references I have found talk about quality relative to

gold standard judgments about whether exemplars are in the same class or

relative to similarity/distinctiveness ratios. Neither is all that

satisfactory.

My preference is an entropic measure that describes how much of the

information in your data is captured by the clustering vs how much residual

info there is.

The other reference I am looking for may be in David Mackay's book. The

idea is that you measure the quality of the approximation by looking at the

entropy in the cluster assignment relative to the residual required to

precisely specify the original data relative to the quantized value.

This is also related to trading off signal/noise in a vector quantizer.

David, do you have a moment to talk about this with me? I can't free up

the time to chase these final references and come up with a nice formula for

this. I think you could do it in 10-20 minutes.

On Tue, Jul 14, 2009 at 6:41 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:

> On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote:

>

> A principled approach to cluster evaluation is to measure how well the

>> cluster membership captures the structure of unseen data. A natural

>> measure

>> for this is to measure how much of the entropy of the data is captured by

>> cluster membership. For k-means and its natural L_2 metric, the natural

>> cluster quality metric is the squared distance from the nearest centroid

>> adjusted by the log_2 of the number of clusters. This can be compared to

>> the squared magnitude of the original data or the squared deviation from

>> the

>> centroid for all of the data. The idea is that you are changing the

>> representation of the data by allocating some of the bits in your original

>> representation to represent which cluster each point is in. If those bits

>> aren't made up by the residue being small then your clustering is making a

>> bad trade-off.

>>

>> In the past, I have used other more heuristic measures as well. One of

>> the

>> key characteristics that I would like to see out of a clustering is a

>> degree

>> of stability. Thus, I look at the fractions of points that are assigned

>> to

>> each cluster or the distribution of distances from the cluster centroid.

>> These values should be relatively stable when applied to held-out data.

>>

>> For text, you can actually compute perplexity which measures how well

>> cluster membership predicts what words are used. This is nice because you

>> don't have to worry about the entropy of real valued numbers.

>>

>

> Do you have any references on any of the above approaches?

>

--

Ted Dunning, CTO

DeepDyve

111 West Evelyn Ave. Ste. 202

Sunnyvale, CA 94086

http://www.deepdyve.com

858-414-0013 (m)

408-773-0220 (fax)

Brown and DiPietro's algorithm for clustering based on entropy is

somewhat infamous for the difficulty of achieving usable performance.

Mike Collins was responsible for a famously speedy version. Having

build one that is just barely fast enough in C++, I wouldn't recommend

trying it in Java. Of course, you aren't proposing that, just

recommending the bigram entropy metric or something like it.

On Mon, Jul 27, 2009 at 9:42 PM, Ted Dunning<[EMAIL PROTECTED]> wrote:

> (vastly delayed response ... huge distractions competing with more than 2

> minutes answers are to blame)

>

> Grant,

>

> For evaluating clustering for symbol sequences:

>

> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.7275

>

> Most of the other references I have found talk about quality relative to

> gold standard judgments about whether exemplars are in the same class or

> relative to similarity/distinctiveness ratios. Neither is all that

> satisfactory.

>

> My preference is an entropic measure that describes how much of the

> information in your data is captured by the clustering vs how much residual

> info there is.

>

> The other reference I am looking for may be in David Mackay's book. The

> idea is that you measure the quality of the approximation by looking at the

> entropy in the cluster assignment relative to the residual required to

> precisely specify the original data relative to the quantized value.

>

> This is also related to trading off signal/noise in a vector quantizer.

>

> David, do you have a moment to talk about this with me? I can't free up

> the time to chase these final references and come up with a nice formula for

> this. I think you could do it in 10-20 minutes.

>

> On Tue, Jul 14, 2009 at 6:41 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:

>

>> On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote:

>>

>> A principled approach to cluster evaluation is to measure how well the

>>> cluster membership captures the structure of unseen data. A natural

>>> measure

>>> for this is to measure how much of the entropy of the data is captured by

>>> cluster membership. For k-means and its natural L_2 metric, the natural

>>> cluster quality metric is the squared distance from the nearest centroid

>>> adjusted by the log_2 of the number of clusters. This can be compared to

>>> the squared magnitude of the original data or the squared deviation from

>>> the

>>> centroid for all of the data. The idea is that you are changing the

>>> representation of the data by allocating some of the bits in your original

>>> representation to represent which cluster each point is in. If those bits

>>> aren't made up by the residue being small then your clustering is making a

>>> bad trade-off.

>>>

>>> In the past, I have used other more heuristic measures as well. One of

>>> the

>>> key characteristics that I would like to see out of a clustering is a

>>> degree

>>> of stability. Thus, I look at the fractions of points that are assigned

>>> to

>>> each cluster or the distribution of distances from the cluster centroid.

>>> These values should be relatively stable when applied to held-out data.

>>>

>>> For text, you can actually compute perplexity which measures how well

>>> cluster membership predicts what words are used. This is nice because you

>>> don't have to worry about the entropy of real valued numbers.

>>>

>>

>> Do you have any references on any of the above approaches?

>>

>

>

>

> --

> Ted Dunning, CTO

> DeepDyve

>

> 111 West Evelyn Ave. Ste. 202

> Sunnyvale, CA 94086

> http://www.deepdyve.com

> 858-414-0013 (m)

> 408-773-0220 (fax)

>

somewhat infamous for the difficulty of achieving usable performance.

Mike Collins was responsible for a famously speedy version. Having

build one that is just barely fast enough in C++, I wouldn't recommend

trying it in Java. Of course, you aren't proposing that, just

recommending the bigram entropy metric or something like it.

On Mon, Jul 27, 2009 at 9:42 PM, Ted Dunning<[EMAIL PROTECTED]> wrote:

> (vastly delayed response ... huge distractions competing with more than 2

> minutes answers are to blame)

>

> Grant,

>

> For evaluating clustering for symbol sequences:

>

> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.7275

>

> Most of the other references I have found talk about quality relative to

> gold standard judgments about whether exemplars are in the same class or

> relative to similarity/distinctiveness ratios. Neither is all that

> satisfactory.

>

> My preference is an entropic measure that describes how much of the

> information in your data is captured by the clustering vs how much residual

> info there is.

>

> The other reference I am looking for may be in David Mackay's book. The

> idea is that you measure the quality of the approximation by looking at the

> entropy in the cluster assignment relative to the residual required to

> precisely specify the original data relative to the quantized value.

>

> This is also related to trading off signal/noise in a vector quantizer.

>

> David, do you have a moment to talk about this with me? I can't free up

> the time to chase these final references and come up with a nice formula for

> this. I think you could do it in 10-20 minutes.

>

> On Tue, Jul 14, 2009 at 6:41 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:

>

>> On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote:

>>

>> A principled approach to cluster evaluation is to measure how well the

>>> cluster membership captures the structure of unseen data. A natural

>>> measure

>>> for this is to measure how much of the entropy of the data is captured by

>>> cluster membership. For k-means and its natural L_2 metric, the natural

>>> cluster quality metric is the squared distance from the nearest centroid

>>> adjusted by the log_2 of the number of clusters. This can be compared to

>>> the squared magnitude of the original data or the squared deviation from

>>> the

>>> centroid for all of the data. The idea is that you are changing the

>>> representation of the data by allocating some of the bits in your original

>>> representation to represent which cluster each point is in. If those bits

>>> aren't made up by the residue being small then your clustering is making a

>>> bad trade-off.

>>>

>>> In the past, I have used other more heuristic measures as well. One of

>>> the

>>> key characteristics that I would like to see out of a clustering is a

>>> degree

>>> of stability. Thus, I look at the fractions of points that are assigned

>>> to

>>> each cluster or the distribution of distances from the cluster centroid.

>>> These values should be relatively stable when applied to held-out data.

>>>

>>> For text, you can actually compute perplexity which measures how well

>>> cluster membership predicts what words are used. This is nice because you

>>> don't have to worry about the entropy of real valued numbers.

>>>

>>

>> Do you have any references on any of the above approaches?

>>

>

>

>

> --

> Ted Dunning, CTO

> DeepDyve

>

> 111 West Evelyn Ave. Ste. 202

> Sunnyvale, CA 94086

> http://www.deepdyve.com

> 858-414-0013 (m)

> 408-773-0220 (fax)

>

On Mon, Jul 27, 2009 at 6:51 PM, Benson Margulies <[EMAIL PROTECTED]>wrote:

> [brown and mercer did hard stuff] Of course, you aren't proposing that,

> just

> recommending the bigram entropy metric or something like it.

>

Peter Brown and Bob Mercer were very sharp dudes and when they did this work

it was 100 times more amazing than it is now. They had the advantage of

working for a company that understood that the resources that you give

researchers now should be 20 times more than you would expect a user to have

in 5 years, but even so, their achievements were quite something.

Frankly that record of achievement leads back beyond them to Fred Jelinek,

Lalit Bahl and Selim Roukos and all the other early guys who worked on

speech back then. That work (along with the BBN team under Jim and Janet

Baker) gave us the entire framework of HMM's and entropy based evaluation

that is core to speech systems today. It leads forward to some of the

really fabulous work that the della Pietra brothers did as well.

I owe the IBM team my interest in statistical approaches to AI and symbolic

sequences. It was on a visit to IBM in 1990 or so that Stephen (or Vincent)

dP mentioned off-handedly to me that mutual information was "trivially known

to be chi-squared distributed asymptotically". That was news to me and

formed the basis of a LOT of the work that I have done in the intervening 19

years.

--

Ted Dunning, CTO

DeepDyve

> [brown and mercer did hard stuff] Of course, you aren't proposing that,

> just

> recommending the bigram entropy metric or something like it.

>

Peter Brown and Bob Mercer were very sharp dudes and when they did this work

it was 100 times more amazing than it is now. They had the advantage of

working for a company that understood that the resources that you give

researchers now should be 20 times more than you would expect a user to have

in 5 years, but even so, their achievements were quite something.

Frankly that record of achievement leads back beyond them to Fred Jelinek,

Lalit Bahl and Selim Roukos and all the other early guys who worked on

speech back then. That work (along with the BBN team under Jim and Janet

Baker) gave us the entire framework of HMM's and entropy based evaluation

that is core to speech systems today. It leads forward to some of the

really fabulous work that the della Pietra brothers did as well.

I owe the IBM team my interest in statistical approaches to AI and symbolic

sequences. It was on a visit to IBM in 1990 or so that Stephen (or Vincent)

dP mentioned off-handedly to me that mutual information was "trivially known

to be chi-squared distributed asymptotically". That was news to me and

formed the basis of a LOT of the work that I have done in the intervening 19

years.

--

Ted Dunning, CTO

DeepDyve

On Jul 28, 2009, at 12:48 AM, Ted Dunning wrote:

>

> I owe the IBM team my interest in statistical approaches to AI and

> symbolic

> sequences. It was on a visit to IBM in 1990 or so that Stephen (or

> Vincent)

> dP mentioned off-handedly to me that mutual information was

> "trivially known

> to be chi-squared distributed asymptotically".

I love statements like these! Takes me back to the good old Math days

of "We'll leave it as an exercise to the reader" or proofs that start

off by saying "It is trivial to prove ..., so we'll proceed to the

main part of the proof" and, as a 20 year old Math student you spend

the next day beating your head against the wall because it is anything

but trivial to you!

-Grant

On Tue, Jul 28, 2009 at 6:55 AM, Grant Ingersoll<[EMAIL PROTECTED]> wrote:

>

> On Jul 28, 2009, at 12:48 AM, Ted Dunning wrote:

>

>>

>> I owe the IBM team my interest in statistical approaches to AI and

>> symbolic

>> sequences. It was on a visit to IBM in 1990 or so that Stephen (or

>> Vincent)

>> dP mentioned off-handedly to me that mutual information was "trivially

>> known

>> to be chi-squared distributed asymptotically".

>

> I love statements like these! Takes me back to the good old Math days of

> "We'll leave it as an exercise to the reader" or proofs that start off by

> saying "It is trivial to prove ..., so we'll proceed to the main part of the

> proof" and, as a 20 year old Math student you spend the next day beating

> your head against the wall because it is anything but trivial to you!

And, indeed, the paper that started this thread is a shining example

of that sort of thing from the point of view of actual programming.

The 'description' of how to get from the O(5) obvious to something

usable is largely notable for what it does not say.

>

> -Grant

>

>

> On Jul 28, 2009, at 12:48 AM, Ted Dunning wrote:

>

>>

>> I owe the IBM team my interest in statistical approaches to AI and

>> symbolic

>> sequences. It was on a visit to IBM in 1990 or so that Stephen (or

>> Vincent)

>> dP mentioned off-handedly to me that mutual information was "trivially

>> known

>> to be chi-squared distributed asymptotically".

>

> I love statements like these! Takes me back to the good old Math days of

> "We'll leave it as an exercise to the reader" or proofs that start off by

> saying "It is trivial to prove ..., so we'll proceed to the main part of the

> proof" and, as a 20 year old Math student you spend the next day beating

> your head against the wall because it is anything but trivial to you!

And, indeed, the paper that started this thread is a shining example

of that sort of thing from the point of view of actual programming.

The 'description' of how to get from the O(5) obvious to something

usable is largely notable for what it does not say.

>

> -Grant

>

To be fair, it was a trivial result. If you start from some very deep

theorems. :-)

On Tue, Jul 28, 2009 at 3:55 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:

>

> On Jul 28, 2009, at 12:48 AM, Ted Dunning wrote:

>

>

>> I owe the IBM team my interest in statistical approaches to AI and

>> symbolic

>> sequences. It was on a visit to IBM in 1990 or so that Stephen (or

>> Vincent)

>> dP mentioned off-handedly to me that mutual information was "trivially

>> known

>> to be chi-squared distributed asymptotically".

>>

>

> I love statements like these! Takes me back to the good old Math days of

> "We'll leave it as an exercise to the reader" or proofs that start off by

> saying "It is trivial to prove ..., so we'll proceed to the main part of the

> proof" and, as a 20 year old Math student you spend the next day beating

> your head against the wall because it is anything but trivial to you!

>

> -Grant

>

--

Ted Dunning, CTO

DeepDyve

theorems. :-)

On Tue, Jul 28, 2009 at 3:55 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:

>

> On Jul 28, 2009, at 12:48 AM, Ted Dunning wrote:

>

>

>> I owe the IBM team my interest in statistical approaches to AI and

>> symbolic

>> sequences. It was on a visit to IBM in 1990 or so that Stephen (or

>> Vincent)

>> dP mentioned off-handedly to me that mutual information was "trivially

>> known

>> to be chi-squared distributed asymptotically".

>>

>

> I love statements like these! Takes me back to the good old Math days of

> "We'll leave it as an exercise to the reader" or proofs that start off by

> saying "It is trivial to prove ..., so we'll proceed to the main part of the

> proof" and, as a 20 year old Math student you spend the next day beating

> your head against the wall because it is anything but trivial to you!

>

> -Grant

>

--

Ted Dunning, CTO

DeepDyve

On Jul 27, 2009, at 9:42 PM, Ted Dunning wrote:

> The other reference I am looking for may be in David Mackay's book.

> The

> idea is that you measure the quality of the approximation by looking

> at the

> entropy in the cluster assignment relative to the residual required to

> precisely specify the original data relative to the quantized value.

Is the WM Rand paper in JSTOR ("Object Criteria for Evaluation of

Clustering Methods") worthwhile on this topic? Basic searches for

"evaluating clustering" or "cluster evaluation" on Google Scholar turn

up very little. The Rand paper is from 1971, but who knows...

Of course, I'd like something that doesn't require purchase (sigh.)

Also found: http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html

On Aug 18, 2009, at 9:55 AM, Grant Ingersoll wrote:

>

> On Jul 27, 2009, at 9:42 PM, Ted Dunning wrote:

>

>> The other reference I am looking for may be in David Mackay's

>> book. The

>> idea is that you measure the quality of the approximation by

>> looking at the

>> entropy in the cluster assignment relative to the residual required

>> to

>> precisely specify the original data relative to the quantized value.

>

> Is the WM Rand paper in JSTOR ("Object Criteria for Evaluation of

> Clustering Methods") worthwhile on this topic? Basic searches for

> "evaluating clustering" or "cluster evaluation" on Google Scholar

> turn up very little. The Rand paper is from 1971, but who knows...

>

> Of course, I'd like something that doesn't require purchase (sigh.)

On Aug 18, 2009, at 9:55 AM, Grant Ingersoll wrote:

>

> On Jul 27, 2009, at 9:42 PM, Ted Dunning wrote:

>

>> The other reference I am looking for may be in David Mackay's

>> book. The

>> idea is that you measure the quality of the approximation by

>> looking at the

>> entropy in the cluster assignment relative to the residual required

>> to

>> precisely specify the original data relative to the quantized value.

>

> Is the WM Rand paper in JSTOR ("Object Criteria for Evaluation of

> Clustering Methods") worthwhile on this topic? Basic searches for

> "evaluating clustering" or "cluster evaluation" on Google Scholar

> turn up very little. The Rand paper is from 1971, but who knows...

>

> Of course, I'd like something that doesn't require purchase (sigh.)

These all depend on gold standards. If you have those, then it is easy to

evaluate a clustering.

What is hard is to evaluate a clustering without a standard. I have done

this, somewhat, in the past by looking at stability over time in terms of

cluster size and membership. I have also looked at the utility of cluster

membership in predicting objective attributes not used in the clustering.

The stability criteria might apply to some of our data sets. The utility

measure only works in a modeling setting.

On Tue, Aug 18, 2009 at 7:32 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:

> Also found:

> http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html

>

>

> On Aug 18, 2009, at 9:55 AM, Grant Ingersoll wrote:

>

>

>> On Jul 27, 2009, at 9:42 PM, Ted Dunning wrote:

>>

>> The other reference I am looking for may be in David Mackay's book. The

>>> idea is that you measure the quality of the approximation by looking at

>>> the

>>> entropy in the cluster assignment relative to the residual required to

>>> precisely specify the original data relative to the quantized value.

>>>

>>

>> Is the WM Rand paper in JSTOR ("Object Criteria for Evaluation of

>> Clustering Methods") worthwhile on this topic? Basic searches for

>> "evaluating clustering" or "cluster evaluation" on Google Scholar turn up

>> very little. The Rand paper is from 1971, but who knows...

>>

>> Of course, I'd like something that doesn't require purchase (sigh.)

>>

>

>

>

--

Ted Dunning, CTO

DeepDyve

evaluate a clustering.

What is hard is to evaluate a clustering without a standard. I have done

this, somewhat, in the past by looking at stability over time in terms of

cluster size and membership. I have also looked at the utility of cluster

membership in predicting objective attributes not used in the clustering.

The stability criteria might apply to some of our data sets. The utility

measure only works in a modeling setting.

On Tue, Aug 18, 2009 at 7:32 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:

> Also found:

> http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html

>

>

> On Aug 18, 2009, at 9:55 AM, Grant Ingersoll wrote:

>

>

>> On Jul 27, 2009, at 9:42 PM, Ted Dunning wrote:

>>

>> The other reference I am looking for may be in David Mackay's book. The

>>> idea is that you measure the quality of the approximation by looking at

>>> the

>>> entropy in the cluster assignment relative to the residual required to

>>> precisely specify the original data relative to the quantized value.

>>>

>>

>> Is the WM Rand paper in JSTOR ("Object Criteria for Evaluation of

>> Clustering Methods") worthwhile on this topic? Basic searches for

>> "evaluating clustering" or "cluster evaluation" on Google Scholar turn up

>> very little. The Rand paper is from 1971, but who knows...

>>

>> Of course, I'd like something that doesn't require purchase (sigh.)

>>

>

>

>

Ted Dunning, CTO

DeepDyve