Dan Filimon

2013-06-20, 07:53

Sean Owen

2013-06-20, 08:03

Dan Filimon

2013-06-20, 08:52

Sean Owen

2013-06-20, 09:16

Dan Filimon

2013-06-20, 09:25

Sean Owen

2013-06-20, 09:28

Dan Filimon

2013-06-20, 09:41

Ted Dunning

2013-06-20, 22:10

Dan Filimon

2013-06-21, 07:25

Ted Dunning

2013-06-21, 08:45

Dan Filimon

2013-06-21, 09:13

Ted Dunning

2013-06-21, 09:35

Dan Filimon

2013-06-21, 09:59

Ted Dunning

2013-06-21, 10:15

Sebastian Schelter

2013-06-21, 10:23

Ted Dunning

2013-06-21, 10:52

When computing item-item similarity using the log-likelihood similarity

[1], can I simply apply a sigmoid do the resulting values to get the

probability that two items are similar?

Is there any other processing I need to do?

Thanks!

[1] http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html

[1], can I simply apply a sigmoid do the resulting values to get the

probability that two items are similar?

Is there any other processing I need to do?

Thanks!

[1] http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html

someone can check my facts here, but the log-likelihood ratio follows

a chi-square distribution. You can figure an actual probability from

that in the usual way, from its CDF. You would need to tweak the code

you see in the project to compute an actual LLR by normalizing the

input.

You could use 1-p then as a similarity metric.

This also isn't how the test statistic is turned into a similarity

metric in the project now. But 1-p sounds nicer. Maybe the historical

reason was speed, or, ignorance.

On Thu, Jun 20, 2013 at 8:53 AM, Dan Filimon

<[EMAIL PROTECTED]> wrote:

> When computing item-item similarity using the log-likelihood similarity

> [1], can I simply apply a sigmoid do the resulting values to get the

> probability that two items are similar?

>

> Is there any other processing I need to do?

>

> Thanks!

>

> [1] http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html

a chi-square distribution. You can figure an actual probability from

that in the usual way, from its CDF. You would need to tweak the code

you see in the project to compute an actual LLR by normalizing the

input.

You could use 1-p then as a similarity metric.

This also isn't how the test statistic is turned into a similarity

metric in the project now. But 1-p sounds nicer. Maybe the historical

reason was speed, or, ignorance.

On Thu, Jun 20, 2013 at 8:53 AM, Dan Filimon

<[EMAIL PROTECTED]> wrote:

> When computing item-item similarity using the log-likelihood similarity

> [1], can I simply apply a sigmoid do the resulting values to get the

> probability that two items are similar?

>

> Is there any other processing I need to do?

>

> Thanks!

>

> [1] http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html

My understanding:

Yes, the log-likelihood ratio (-2 log lambda) follows a chi-squared

distribution with 1 degree of freedom in the 2x2 table case.

A ~A

B

~B

We're testing to see if p(A | B) = p(A | ~B). That's the null hypothesis. I

compute the LLR. The larger that is, the more unlikely the null hypothesis

is to be true.

I can then look at a table with df=1. And I'd get p, the probability of

seeing that result or something worse (the upper tail).

So, the probability of them being similar is 1 - p (which is exactly the

CDF for that value of X).

Now, my question is: in the contingency table case, why would I normalize?

It's a ratio already, isn't it?

On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen <[EMAIL PROTECTED]> wrote:

> someone can check my facts here, but the log-likelihood ratio follows

> a chi-square distribution. You can figure an actual probability from

> that in the usual way, from its CDF. You would need to tweak the code

> you see in the project to compute an actual LLR by normalizing the

> input.

>

> You could use 1-p then as a similarity metric.

>

> This also isn't how the test statistic is turned into a similarity

> metric in the project now. But 1-p sounds nicer. Maybe the historical

> reason was speed, or, ignorance.

>

> On Thu, Jun 20, 2013 at 8:53 AM, Dan Filimon

> <[EMAIL PROTECTED]> wrote:

> > When computing item-item similarity using the log-likelihood similarity

> > [1], can I simply apply a sigmoid do the resulting values to get the

> > probability that two items are similar?

> >

> > Is there any other processing I need to do?

> >

> > Thanks!

> >

> > [1] http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html

>

Yes, the log-likelihood ratio (-2 log lambda) follows a chi-squared

distribution with 1 degree of freedom in the 2x2 table case.

A ~A

B

~B

We're testing to see if p(A | B) = p(A | ~B). That's the null hypothesis. I

compute the LLR. The larger that is, the more unlikely the null hypothesis

is to be true.

I can then look at a table with df=1. And I'd get p, the probability of

seeing that result or something worse (the upper tail).

So, the probability of them being similar is 1 - p (which is exactly the

CDF for that value of X).

Now, my question is: in the contingency table case, why would I normalize?

It's a ratio already, isn't it?

On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen <[EMAIL PROTECTED]> wrote:

> someone can check my facts here, but the log-likelihood ratio follows

> a chi-square distribution. You can figure an actual probability from

> that in the usual way, from its CDF. You would need to tweak the code

> you see in the project to compute an actual LLR by normalizing the

> input.

>

> You could use 1-p then as a similarity metric.

>

> This also isn't how the test statistic is turned into a similarity

> metric in the project now. But 1-p sounds nicer. Maybe the historical

> reason was speed, or, ignorance.

>

> On Thu, Jun 20, 2013 at 8:53 AM, Dan Filimon

> <[EMAIL PROTECTED]> wrote:

> > When computing item-item similarity using the log-likelihood similarity

> > [1], can I simply apply a sigmoid do the resulting values to get the

> > probability that two items are similar?

> >

> > Is there any other processing I need to do?

> >

> > Thanks!

> >

> > [1] http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html

>

I think the quickest answer is: the formula computes the test

statistic as a difference of log values, rather than log of ratio of

values. By not normalizing, the entropy is multiplied by a factor (sum

of the counts) vs normalized. So you do end up with a statistic N

times larger when counts are N times larger.

On Thu, Jun 20, 2013 at 9:52 AM, Dan Filimon

<[EMAIL PROTECTED]> wrote:

> My understanding:

>

> Yes, the log-likelihood ratio (-2 log lambda) follows a chi-squared

> distribution with 1 degree of freedom in the 2x2 table case.

> A ~A

> B

> ~B

>

> We're testing to see if p(A | B) = p(A | ~B). That's the null hypothesis. I

> compute the LLR. The larger that is, the more unlikely the null hypothesis

> is to be true.

> I can then look at a table with df=1. And I'd get p, the probability of

> seeing that result or something worse (the upper tail).

> So, the probability of them being similar is 1 - p (which is exactly the

> CDF for that value of X).

>

> Now, my question is: in the contingency table case, why would I normalize?

> It's a ratio already, isn't it?

>

>

> On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen <[EMAIL PROTECTED]> wrote:

>

>> someone can check my facts here, but the log-likelihood ratio follows

>> a chi-square distribution. You can figure an actual probability from

>> that in the usual way, from its CDF. You would need to tweak the code

>> you see in the project to compute an actual LLR by normalizing the

>> input.

>>

>> You could use 1-p then as a similarity metric.

>>

>> This also isn't how the test statistic is turned into a similarity

>> metric in the project now. But 1-p sounds nicer. Maybe the historical

>> reason was speed, or, ignorance.

>>

>> On Thu, Jun 20, 2013 at 8:53 AM, Dan Filimon

>> <[EMAIL PROTECTED]> wrote:

>> > When computing item-item similarity using the log-likelihood similarity

>> > [1], can I simply apply a sigmoid do the resulting values to get the

>> > probability that two items are similar?

>> >

>> > Is there any other processing I need to do?

>> >

>> > Thanks!

>> >

>> > [1] http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html

>>

statistic as a difference of log values, rather than log of ratio of

values. By not normalizing, the entropy is multiplied by a factor (sum

of the counts) vs normalized. So you do end up with a statistic N

times larger when counts are N times larger.

On Thu, Jun 20, 2013 at 9:52 AM, Dan Filimon

<[EMAIL PROTECTED]> wrote:

> My understanding:

>

> Yes, the log-likelihood ratio (-2 log lambda) follows a chi-squared

> distribution with 1 degree of freedom in the 2x2 table case.

> A ~A

> B

> ~B

>

> We're testing to see if p(A | B) = p(A | ~B). That's the null hypothesis. I

> compute the LLR. The larger that is, the more unlikely the null hypothesis

> is to be true.

> I can then look at a table with df=1. And I'd get p, the probability of

> seeing that result or something worse (the upper tail).

> So, the probability of them being similar is 1 - p (which is exactly the

> CDF for that value of X).

>

> Now, my question is: in the contingency table case, why would I normalize?

> It's a ratio already, isn't it?

>

>

> On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen <[EMAIL PROTECTED]> wrote:

>

>> someone can check my facts here, but the log-likelihood ratio follows

>> a chi-square distribution. You can figure an actual probability from

>> that in the usual way, from its CDF. You would need to tweak the code

>> you see in the project to compute an actual LLR by normalizing the

>> input.

>>

>> You could use 1-p then as a similarity metric.

>>

>> This also isn't how the test statistic is turned into a similarity

>> metric in the project now. But 1-p sounds nicer. Maybe the historical

>> reason was speed, or, ignorance.

>>

>> On Thu, Jun 20, 2013 at 8:53 AM, Dan Filimon

>> <[EMAIL PROTECTED]> wrote:

>> > When computing item-item similarity using the log-likelihood similarity

>> > [1], can I simply apply a sigmoid do the resulting values to get the

>> > probability that two items are similar?

>> >

>> > Is there any other processing I need to do?

>> >

>> > Thanks!

>> >

>> > [1] http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html

>>

Right, makes sense. So, by normalize, I need to replace the counts in the

matrix with probabilities.

So, I would divide everything by the sum of all the counts in the matrix?

On Thu, Jun 20, 2013 at 12:16 PM, Sean Owen <[EMAIL PROTECTED]> wrote:

> I think the quickest answer is: the formula computes the test

> statistic as a difference of log values, rather than log of ratio of

> values. By not normalizing, the entropy is multiplied by a factor (sum

> of the counts) vs normalized. So you do end up with a statistic N

> times larger when counts are N times larger.

>

> On Thu, Jun 20, 2013 at 9:52 AM, Dan Filimon

> <[EMAIL PROTECTED]> wrote:

> > My understanding:

> >

> > Yes, the log-likelihood ratio (-2 log lambda) follows a chi-squared

> > distribution with 1 degree of freedom in the 2x2 table case.

> > A ~A

> > B

> > ~B

> >

> > We're testing to see if p(A | B) = p(A | ~B). That's the null

> hypothesis. I

> > compute the LLR. The larger that is, the more unlikely the null

> hypothesis

> > is to be true.

> > I can then look at a table with df=1. And I'd get p, the probability of

> > seeing that result or something worse (the upper tail).

> > So, the probability of them being similar is 1 - p (which is exactly the

> > CDF for that value of X).

> >

> > Now, my question is: in the contingency table case, why would I

> normalize?

> > It's a ratio already, isn't it?

> >

> >

> > On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen <[EMAIL PROTECTED]> wrote:

> >

> >> someone can check my facts here, but the log-likelihood ratio follows

> >> a chi-square distribution. You can figure an actual probability from

> >> that in the usual way, from its CDF. You would need to tweak the code

> >> you see in the project to compute an actual LLR by normalizing the

> >> input.

> >>

> >> You could use 1-p then as a similarity metric.

> >>

> >> This also isn't how the test statistic is turned into a similarity

> >> metric in the project now. But 1-p sounds nicer. Maybe the historical

> >> reason was speed, or, ignorance.

> >>

> >> On Thu, Jun 20, 2013 at 8:53 AM, Dan Filimon

> >> <[EMAIL PROTECTED]> wrote:

> >> > When computing item-item similarity using the log-likelihood

> similarity

> >> > [1], can I simply apply a sigmoid do the resulting values to get the

> >> > probability that two items are similar?

> >> >

> >> > Is there any other processing I need to do?

> >> >

> >> > Thanks!

> >> >

> >> > [1] http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html

> >>

>

matrix with probabilities.

So, I would divide everything by the sum of all the counts in the matrix?

On Thu, Jun 20, 2013 at 12:16 PM, Sean Owen <[EMAIL PROTECTED]> wrote:

> I think the quickest answer is: the formula computes the test

> statistic as a difference of log values, rather than log of ratio of

> values. By not normalizing, the entropy is multiplied by a factor (sum

> of the counts) vs normalized. So you do end up with a statistic N

> times larger when counts are N times larger.

>

> On Thu, Jun 20, 2013 at 9:52 AM, Dan Filimon

> <[EMAIL PROTECTED]> wrote:

> > My understanding:

> >

> > Yes, the log-likelihood ratio (-2 log lambda) follows a chi-squared

> > distribution with 1 degree of freedom in the 2x2 table case.

> > A ~A

> > B

> > ~B

> >

> > We're testing to see if p(A | B) = p(A | ~B). That's the null

> hypothesis. I

> > compute the LLR. The larger that is, the more unlikely the null

> hypothesis

> > is to be true.

> > I can then look at a table with df=1. And I'd get p, the probability of

> > seeing that result or something worse (the upper tail).

> > So, the probability of them being similar is 1 - p (which is exactly the

> > CDF for that value of X).

> >

> > Now, my question is: in the contingency table case, why would I

> normalize?

> > It's a ratio already, isn't it?

> >

> >

> > On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen <[EMAIL PROTECTED]> wrote:

> >

> >> someone can check my facts here, but the log-likelihood ratio follows

> >> a chi-square distribution. You can figure an actual probability from

> >> that in the usual way, from its CDF. You would need to tweak the code

> >> you see in the project to compute an actual LLR by normalizing the

> >> input.

> >>

> >> You could use 1-p then as a similarity metric.

> >>

> >> This also isn't how the test statistic is turned into a similarity

> >> metric in the project now. But 1-p sounds nicer. Maybe the historical

> >> reason was speed, or, ignorance.

> >>

> >> On Thu, Jun 20, 2013 at 8:53 AM, Dan Filimon

> >> <[EMAIL PROTECTED]> wrote:

> >> > When computing item-item similarity using the log-likelihood

> similarity

> >> > [1], can I simply apply a sigmoid do the resulting values to get the

> >> > probability that two items are similar?

> >> >

> >> > Is there any other processing I need to do?

> >> >

> >> > Thanks!

> >> >

> >> > [1] http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html

> >>

>

Yes that should be all that's needed.

On Jun 20, 2013 10:27 AM, "Dan Filimon" <[EMAIL PROTECTED]> wrote:

> Right, makes sense. So, by normalize, I need to replace the counts in the

> matrix with probabilities.

> So, I would divide everything by the sum of all the counts in the matrix?

>

>

> On Thu, Jun 20, 2013 at 12:16 PM, Sean Owen <[EMAIL PROTECTED]> wrote:

>

> > I think the quickest answer is: the formula computes the test

> > statistic as a difference of log values, rather than log of ratio of

> > values. By not normalizing, the entropy is multiplied by a factor (sum

> > of the counts) vs normalized. So you do end up with a statistic N

> > times larger when counts are N times larger.

> >

> > On Thu, Jun 20, 2013 at 9:52 AM, Dan Filimon

> > <[EMAIL PROTECTED]> wrote:

> > > My understanding:

> > >

> > > Yes, the log-likelihood ratio (-2 log lambda) follows a chi-squared

> > > distribution with 1 degree of freedom in the 2x2 table case.

> > > A ~A

> > > B

> > > ~B

> > >

> > > We're testing to see if p(A | B) = p(A | ~B). That's the null

> > hypothesis. I

> > > compute the LLR. The larger that is, the more unlikely the null

> > hypothesis

> > > is to be true.

> > > I can then look at a table with df=1. And I'd get p, the probability of

> > > seeing that result or something worse (the upper tail).

> > > So, the probability of them being similar is 1 - p (which is exactly

> the

> > > CDF for that value of X).

> > >

> > > Now, my question is: in the contingency table case, why would I

> > normalize?

> > > It's a ratio already, isn't it?

> > >

> > >

> > > On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen <[EMAIL PROTECTED]> wrote:

> > >

> > >> someone can check my facts here, but the log-likelihood ratio follows

> > >> a chi-square distribution. You can figure an actual probability from

> > >> that in the usual way, from its CDF. You would need to tweak the code

> > >> you see in the project to compute an actual LLR by normalizing the

> > >> input.

> > >>

> > >> You could use 1-p then as a similarity metric.

> > >>

> > >> This also isn't how the test statistic is turned into a similarity

> > >> metric in the project now. But 1-p sounds nicer. Maybe the historical

> > >> reason was speed, or, ignorance.

> > >>

> > >> On Thu, Jun 20, 2013 at 8:53 AM, Dan Filimon

> > >> <[EMAIL PROTECTED]> wrote:

> > >> > When computing item-item similarity using the log-likelihood

> > similarity

> > >> > [1], can I simply apply a sigmoid do the resulting values to get the

> > >> > probability that two items are similar?

> > >> >

> > >> > Is there any other processing I need to do?

> > >> >

> > >> > Thanks!

> > >> >

> > >> > [1]

> http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html

> > >>

> >

>

On Jun 20, 2013 10:27 AM, "Dan Filimon" <[EMAIL PROTECTED]> wrote:

> Right, makes sense. So, by normalize, I need to replace the counts in the

> matrix with probabilities.

> So, I would divide everything by the sum of all the counts in the matrix?

>

>

> On Thu, Jun 20, 2013 at 12:16 PM, Sean Owen <[EMAIL PROTECTED]> wrote:

>

> > I think the quickest answer is: the formula computes the test

> > statistic as a difference of log values, rather than log of ratio of

> > values. By not normalizing, the entropy is multiplied by a factor (sum

> > of the counts) vs normalized. So you do end up with a statistic N

> > times larger when counts are N times larger.

> >

> > On Thu, Jun 20, 2013 at 9:52 AM, Dan Filimon

> > <[EMAIL PROTECTED]> wrote:

> > > My understanding:

> > >

> > > Yes, the log-likelihood ratio (-2 log lambda) follows a chi-squared

> > > distribution with 1 degree of freedom in the 2x2 table case.

> > > A ~A

> > > B

> > > ~B

> > >

> > > We're testing to see if p(A | B) = p(A | ~B). That's the null

> > hypothesis. I

> > > compute the LLR. The larger that is, the more unlikely the null

> > hypothesis

> > > is to be true.

> > > I can then look at a table with df=1. And I'd get p, the probability of

> > > seeing that result or something worse (the upper tail).

> > > So, the probability of them being similar is 1 - p (which is exactly

> the

> > > CDF for that value of X).

> > >

> > > Now, my question is: in the contingency table case, why would I

> > normalize?

> > > It's a ratio already, isn't it?

> > >

> > >

> > > On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen <[EMAIL PROTECTED]> wrote:

> > >

> > >> someone can check my facts here, but the log-likelihood ratio follows

> > >> a chi-square distribution. You can figure an actual probability from

> > >> that in the usual way, from its CDF. You would need to tweak the code

> > >> you see in the project to compute an actual LLR by normalizing the

> > >> input.

> > >>

> > >> You could use 1-p then as a similarity metric.

> > >>

> > >> This also isn't how the test statistic is turned into a similarity

> > >> metric in the project now. But 1-p sounds nicer. Maybe the historical

> > >> reason was speed, or, ignorance.

> > >>

> > >> On Thu, Jun 20, 2013 at 8:53 AM, Dan Filimon

> > >> <[EMAIL PROTECTED]> wrote:

> > >> > When computing item-item similarity using the log-likelihood

> > similarity

> > >> > [1], can I simply apply a sigmoid do the resulting values to get the

> > >> > probability that two items are similar?

> > >> >

> > >> > Is there any other processing I need to do?

> > >> >

> > >> > Thanks!

> > >> >

> > >> > [1]

> http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html

> > >>

> >

>

Awesome! Thanks for clarifying! :)

On Thu, Jun 20, 2013 at 12:28 PM, Sean Owen <[EMAIL PROTECTED]> wrote:

> Yes that should be all that's needed.

> On Jun 20, 2013 10:27 AM, "Dan Filimon" <[EMAIL PROTECTED]>

> wrote:

>

> > Right, makes sense. So, by normalize, I need to replace the counts in the

> > matrix with probabilities.

> > So, I would divide everything by the sum of all the counts in the matrix?

> >

> >

> > On Thu, Jun 20, 2013 at 12:16 PM, Sean Owen <[EMAIL PROTECTED]> wrote:

> >

> > > I think the quickest answer is: the formula computes the test

> > > statistic as a difference of log values, rather than log of ratio of

> > > values. By not normalizing, the entropy is multiplied by a factor (sum

> > > of the counts) vs normalized. So you do end up with a statistic N

> > > times larger when counts are N times larger.

> > >

> > > On Thu, Jun 20, 2013 at 9:52 AM, Dan Filimon

> > > <[EMAIL PROTECTED]> wrote:

> > > > My understanding:

> > > >

> > > > Yes, the log-likelihood ratio (-2 log lambda) follows a chi-squared

> > > > distribution with 1 degree of freedom in the 2x2 table case.

> > > > A ~A

> > > > B

> > > > ~B

> > > >

> > > > We're testing to see if p(A | B) = p(A | ~B). That's the null

> > > hypothesis. I

> > > > compute the LLR. The larger that is, the more unlikely the null

> > > hypothesis

> > > > is to be true.

> > > > I can then look at a table with df=1. And I'd get p, the probability

> of

> > > > seeing that result or something worse (the upper tail).

> > > > So, the probability of them being similar is 1 - p (which is exactly

> > the

> > > > CDF for that value of X).

> > > >

> > > > Now, my question is: in the contingency table case, why would I

> > > normalize?

> > > > It's a ratio already, isn't it?

> > > >

> > > >

> > > > On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen <[EMAIL PROTECTED]>

> wrote:

> > > >

> > > >> someone can check my facts here, but the log-likelihood ratio

> follows

> > > >> a chi-square distribution. You can figure an actual probability from

> > > >> that in the usual way, from its CDF. You would need to tweak the

> code

> > > >> you see in the project to compute an actual LLR by normalizing the

> > > >> input.

> > > >>

> > > >> You could use 1-p then as a similarity metric.

> > > >>

> > > >> This also isn't how the test statistic is turned into a similarity

> > > >> metric in the project now. But 1-p sounds nicer. Maybe the

> historical

> > > >> reason was speed, or, ignorance.

> > > >>

> > > >> On Thu, Jun 20, 2013 at 8:53 AM, Dan Filimon

> > > >> <[EMAIL PROTECTED]> wrote:

> > > >> > When computing item-item similarity using the log-likelihood

> > > similarity

> > > >> > [1], can I simply apply a sigmoid do the resulting values to get

> the

> > > >> > probability that two items are similar?

> > > >> >

> > > >> > Is there any other processing I need to do?

> > > >> >

> > > >> > Thanks!

> > > >> >

> > > >> > [1]

> > http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html

> > > >>

> > >

> >

>

On Thu, Jun 20, 2013 at 12:28 PM, Sean Owen <[EMAIL PROTECTED]> wrote:

> Yes that should be all that's needed.

> On Jun 20, 2013 10:27 AM, "Dan Filimon" <[EMAIL PROTECTED]>

> wrote:

>

> > Right, makes sense. So, by normalize, I need to replace the counts in the

> > matrix with probabilities.

> > So, I would divide everything by the sum of all the counts in the matrix?

> >

> >

> > On Thu, Jun 20, 2013 at 12:16 PM, Sean Owen <[EMAIL PROTECTED]> wrote:

> >

> > > I think the quickest answer is: the formula computes the test

> > > statistic as a difference of log values, rather than log of ratio of

> > > values. By not normalizing, the entropy is multiplied by a factor (sum

> > > of the counts) vs normalized. So you do end up with a statistic N

> > > times larger when counts are N times larger.

> > >

> > > On Thu, Jun 20, 2013 at 9:52 AM, Dan Filimon

> > > <[EMAIL PROTECTED]> wrote:

> > > > My understanding:

> > > >

> > > > Yes, the log-likelihood ratio (-2 log lambda) follows a chi-squared

> > > > distribution with 1 degree of freedom in the 2x2 table case.

> > > > A ~A

> > > > B

> > > > ~B

> > > >

> > > > We're testing to see if p(A | B) = p(A | ~B). That's the null

> > > hypothesis. I

> > > > compute the LLR. The larger that is, the more unlikely the null

> > > hypothesis

> > > > is to be true.

> > > > I can then look at a table with df=1. And I'd get p, the probability

> of

> > > > seeing that result or something worse (the upper tail).

> > > > So, the probability of them being similar is 1 - p (which is exactly

> > the

> > > > CDF for that value of X).

> > > >

> > > > Now, my question is: in the contingency table case, why would I

> > > normalize?

> > > > It's a ratio already, isn't it?

> > > >

> > > >

> > > > On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen <[EMAIL PROTECTED]>

> wrote:

> > > >

> > > >> someone can check my facts here, but the log-likelihood ratio

> follows

> > > >> a chi-square distribution. You can figure an actual probability from

> > > >> that in the usual way, from its CDF. You would need to tweak the

> code

> > > >> you see in the project to compute an actual LLR by normalizing the

> > > >> input.

> > > >>

> > > >> You could use 1-p then as a similarity metric.

> > > >>

> > > >> This also isn't how the test statistic is turned into a similarity

> > > >> metric in the project now. But 1-p sounds nicer. Maybe the

> historical

> > > >> reason was speed, or, ignorance.

> > > >>

> > > >> On Thu, Jun 20, 2013 at 8:53 AM, Dan Filimon

> > > >> <[EMAIL PROTECTED]> wrote:

> > > >> > When computing item-item similarity using the log-likelihood

> > > similarity

> > > >> > [1], can I simply apply a sigmoid do the resulting values to get

> the

> > > >> > probability that two items are similar?

> > > >> >

> > > >> > Is there any other processing I need to do?

> > > >> >

> > > >> > Thanks!

> > > >> >

> > > >> > [1]

> > http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html

> > > >>

> > >

> >

>

I think that this is a really bad thing to do.

The LLR is really good to find interesting things. Once you have done

that, directly using the LLR in any form to produce a weight reduces the

method to something akin to Naive Bayes. This is bad generally and very,

very bad in the cases of smal counts.

Typically LLR works extremely well when you use it as a filter only and

then use som global measure to compute a weight. See the Luduan method [1]

for an example. The use of a text retrieval engine to implement a search

engine such as I have been lately nattering about much too much is another

example. A major reason that such methods work so unreasonably well is

that they don't make silly weighting decisions based on very small counts.

It is slightly paradoxical that looking at global counts rather than

counts specific so the cases of interest produce much better weights, but

the empirical evidence is pretty over-whelming.

Aside from such practical considerations, there is the fact that converting

a massive number of frequentist p values into weight is either outright

heresy (from the frequentist point of view) or simply nutty (from the

Bayesian point of view).

In any case, I have never been able get more than one bit of useful

information from an LLR score. That one bit is extremely powerful, but

getting more seems to be a very bad idea.

[1] http://arxiv.org/abs/1207.1847 chapter 7, espoecially

On Thu, Jun 20, 2013 at 10:41 AM, Dan Filimon

<[EMAIL PROTECTED]>wrote:

> Awesome! Thanks for clarifying! :)

>

>

> On Thu, Jun 20, 2013 at 12:28 PM, Sean Owen <[EMAIL PROTECTED]> wrote:

>

> > Yes that should be all that's needed.

> > On Jun 20, 2013 10:27 AM, "Dan Filimon" <[EMAIL PROTECTED]>

> > wrote:

> >

> > > Right, makes sense. So, by normalize, I need to replace the counts in

> the

> > > matrix with probabilities.

> > > So, I would divide everything by the sum of all the counts in the

> matrix?

> > >

> > >

> > > On Thu, Jun 20, 2013 at 12:16 PM, Sean Owen <[EMAIL PROTECTED]> wrote:

> > >

> > > > I think the quickest answer is: the formula computes the test

> > > > statistic as a difference of log values, rather than log of ratio of

> > > > values. By not normalizing, the entropy is multiplied by a factor

> (sum

> > > > of the counts) vs normalized. So you do end up with a statistic N

> > > > times larger when counts are N times larger.

> > > >

> > > > On Thu, Jun 20, 2013 at 9:52 AM, Dan Filimon

> > > > <[EMAIL PROTECTED]> wrote:

> > > > > My understanding:

> > > > >

> > > > > Yes, the log-likelihood ratio (-2 log lambda) follows a chi-squared

> > > > > distribution with 1 degree of freedom in the 2x2 table case.

> > > > > A ~A

> > > > > B

> > > > > ~B

> > > > >

> > > > > We're testing to see if p(A | B) = p(A | ~B). That's the null

> > > > hypothesis. I

> > > > > compute the LLR. The larger that is, the more unlikely the null

> > > > hypothesis

> > > > > is to be true.

> > > > > I can then look at a table with df=1. And I'd get p, the

> probability

> > of

> > > > > seeing that result or something worse (the upper tail).

> > > > > So, the probability of them being similar is 1 - p (which is

> exactly

> > > the

> > > > > CDF for that value of X).

> > > > >

> > > > > Now, my question is: in the contingency table case, why would I

> > > > normalize?

> > > > > It's a ratio already, isn't it?

> > > > >

> > > > >

> > > > > On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen <[EMAIL PROTECTED]>

> > wrote:

> > > > >

> > > > >> someone can check my facts here, but the log-likelihood ratio

> > follows

> > > > >> a chi-square distribution. You can figure an actual probability

> from

> > > > >> that in the usual way, from its CDF. You would need to tweak the

> > code

> > > > >> you see in the project to compute an actual LLR by normalizing the

> > > > >> input.

> > > > >>

> > > > >> You could use 1-p then as a similarity metric.

> > > > >>

> > > > >> This also isn't how the test statistic is turned into a similarity

The LLR is really good to find interesting things. Once you have done

that, directly using the LLR in any form to produce a weight reduces the

method to something akin to Naive Bayes. This is bad generally and very,

very bad in the cases of smal counts.

Typically LLR works extremely well when you use it as a filter only and

then use som global measure to compute a weight. See the Luduan method [1]

for an example. The use of a text retrieval engine to implement a search

engine such as I have been lately nattering about much too much is another

example. A major reason that such methods work so unreasonably well is

that they don't make silly weighting decisions based on very small counts.

It is slightly paradoxical that looking at global counts rather than

counts specific so the cases of interest produce much better weights, but

the empirical evidence is pretty over-whelming.

Aside from such practical considerations, there is the fact that converting

a massive number of frequentist p values into weight is either outright

heresy (from the frequentist point of view) or simply nutty (from the

Bayesian point of view).

In any case, I have never been able get more than one bit of useful

information from an LLR score. That one bit is extremely powerful, but

getting more seems to be a very bad idea.

[1] http://arxiv.org/abs/1207.1847 chapter 7, espoecially

On Thu, Jun 20, 2013 at 10:41 AM, Dan Filimon

<[EMAIL PROTECTED]>wrote:

> Awesome! Thanks for clarifying! :)

>

>

> On Thu, Jun 20, 2013 at 12:28 PM, Sean Owen <[EMAIL PROTECTED]> wrote:

>

> > Yes that should be all that's needed.

> > On Jun 20, 2013 10:27 AM, "Dan Filimon" <[EMAIL PROTECTED]>

> > wrote:

> >

> > > Right, makes sense. So, by normalize, I need to replace the counts in

> the

> > > matrix with probabilities.

> > > So, I would divide everything by the sum of all the counts in the

> matrix?

> > >

> > >

> > > On Thu, Jun 20, 2013 at 12:16 PM, Sean Owen <[EMAIL PROTECTED]> wrote:

> > >

> > > > I think the quickest answer is: the formula computes the test

> > > > statistic as a difference of log values, rather than log of ratio of

> > > > values. By not normalizing, the entropy is multiplied by a factor

> (sum

> > > > of the counts) vs normalized. So you do end up with a statistic N

> > > > times larger when counts are N times larger.

> > > >

> > > > On Thu, Jun 20, 2013 at 9:52 AM, Dan Filimon

> > > > <[EMAIL PROTECTED]> wrote:

> > > > > My understanding:

> > > > >

> > > > > Yes, the log-likelihood ratio (-2 log lambda) follows a chi-squared

> > > > > distribution with 1 degree of freedom in the 2x2 table case.

> > > > > A ~A

> > > > > B

> > > > > ~B

> > > > >

> > > > > We're testing to see if p(A | B) = p(A | ~B). That's the null

> > > > hypothesis. I

> > > > > compute the LLR. The larger that is, the more unlikely the null

> > > > hypothesis

> > > > > is to be true.

> > > > > I can then look at a table with df=1. And I'd get p, the

> probability

> > of

> > > > > seeing that result or something worse (the upper tail).

> > > > > So, the probability of them being similar is 1 - p (which is

> exactly

> > > the

> > > > > CDF for that value of X).

> > > > >

> > > > > Now, my question is: in the contingency table case, why would I

> > > > normalize?

> > > > > It's a ratio already, isn't it?

> > > > >

> > > > >

> > > > > On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen <[EMAIL PROTECTED]>

> > wrote:

> > > > >

> > > > >> someone can check my facts here, but the log-likelihood ratio

> > follows

> > > > >> a chi-square distribution. You can figure an actual probability

> from

> > > > >> that in the usual way, from its CDF. You would need to tweak the

> > code

> > > > >> you see in the project to compute an actual LLR by normalizing the

> > > > >> input.

> > > > >>

> > > > >> You could use 1-p then as a similarity metric.

> > > > >>

> > > > >> This also isn't how the test statistic is turned into a similarity

Thanks for the reference! I'll take a look at chapter 7, but let me first

describe what I'm trying to achieve.

I'm trying to identify interesting pairs, the anomalous co-occurrences with

the LLR. I'm doing this for a day's data and I want to keep the p-values.

I then want to use the p-values to compute some overall probability over

the course of multiple days to increase confidence in what I think are the

interesting pairs.

On Fri, Jun 21, 2013 at 1:10 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:

> I think that this is a really bad thing to do.

>

> The LLR is really good to find interesting things. Once you have done

> that, directly using the LLR in any form to produce a weight reduces the

> method to something akin to Naive Bayes. This is bad generally and very,

> very bad in the cases of smal counts.

>

> Typically LLR works extremely well when you use it as a filter only and

> then use som global measure to compute a weight. See the Luduan method [1]

> for an example. The use of a text retrieval engine to implement a search

> engine such as I have been lately nattering about much too much is another

> example. A major reason that such methods work so unreasonably well is

> that they don't make silly weighting decisions based on very small counts.

> It is slightly paradoxical that looking at global counts rather than

> counts specific so the cases of interest produce much better weights, but

> the empirical evidence is pretty over-whelming.

>

> Aside from such practical considerations, there is the fact that converting

> a massive number of frequentist p values into weight is either outright

> heresy (from the frequentist point of view) or simply nutty (from the

> Bayesian point of view).

>

> In any case, I have never been able get more than one bit of useful

> information from an LLR score. That one bit is extremely powerful, but

> getting more seems to be a very bad idea.

>

>

> [1] http://arxiv.org/abs/1207.1847 chapter 7, espoecially

>

>

> On Thu, Jun 20, 2013 at 10:41 AM, Dan Filimon

> <[EMAIL PROTECTED]>wrote:

>

> > Awesome! Thanks for clarifying! :)

> >

> >

> > On Thu, Jun 20, 2013 at 12:28 PM, Sean Owen <[EMAIL PROTECTED]> wrote:

> >

> > > Yes that should be all that's needed.

> > > On Jun 20, 2013 10:27 AM, "Dan Filimon" <[EMAIL PROTECTED]>

> > > wrote:

> > >

> > > > Right, makes sense. So, by normalize, I need to replace the counts in

> > the

> > > > matrix with probabilities.

> > > > So, I would divide everything by the sum of all the counts in the

> > matrix?

> > > >

> > > >

> > > > On Thu, Jun 20, 2013 at 12:16 PM, Sean Owen <[EMAIL PROTECTED]>

> wrote:

> > > >

> > > > > I think the quickest answer is: the formula computes the test

> > > > > statistic as a difference of log values, rather than log of ratio

> of

> > > > > values. By not normalizing, the entropy is multiplied by a factor

> > (sum

> > > > > of the counts) vs normalized. So you do end up with a statistic N

> > > > > times larger when counts are N times larger.

> > > > >

> > > > > On Thu, Jun 20, 2013 at 9:52 AM, Dan Filimon

> > > > > <[EMAIL PROTECTED]> wrote:

> > > > > > My understanding:

> > > > > >

> > > > > > Yes, the log-likelihood ratio (-2 log lambda) follows a

> chi-squared

> > > > > > distribution with 1 degree of freedom in the 2x2 table case.

> > > > > > A ~A

> > > > > > B

> > > > > > ~B

> > > > > >

> > > > > > We're testing to see if p(A | B) = p(A | ~B). That's the null

> > > > > hypothesis. I

> > > > > > compute the LLR. The larger that is, the more unlikely the null

> > > > > hypothesis

> > > > > > is to be true.

> > > > > > I can then look at a table with df=1. And I'd get p, the

> > probability

> > > of

> > > > > > seeing that result or something worse (the upper tail).

> > > > > > So, the probability of them being similar is 1 - p (which is

> > exactly

> > > > the

> > > > > > CDF for that value of X).

> > > > > >

> > > > > > Now, my question is: in the contingency table case, why would I

describe what I'm trying to achieve.

I'm trying to identify interesting pairs, the anomalous co-occurrences with

the LLR. I'm doing this for a day's data and I want to keep the p-values.

I then want to use the p-values to compute some overall probability over

the course of multiple days to increase confidence in what I think are the

interesting pairs.

On Fri, Jun 21, 2013 at 1:10 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:

> I think that this is a really bad thing to do.

>

> The LLR is really good to find interesting things. Once you have done

> that, directly using the LLR in any form to produce a weight reduces the

> method to something akin to Naive Bayes. This is bad generally and very,

> very bad in the cases of smal counts.

>

> Typically LLR works extremely well when you use it as a filter only and

> then use som global measure to compute a weight. See the Luduan method [1]

> for an example. The use of a text retrieval engine to implement a search

> engine such as I have been lately nattering about much too much is another

> example. A major reason that such methods work so unreasonably well is

> that they don't make silly weighting decisions based on very small counts.

> It is slightly paradoxical that looking at global counts rather than

> counts specific so the cases of interest produce much better weights, but

> the empirical evidence is pretty over-whelming.

>

> Aside from such practical considerations, there is the fact that converting

> a massive number of frequentist p values into weight is either outright

> heresy (from the frequentist point of view) or simply nutty (from the

> Bayesian point of view).

>

> In any case, I have never been able get more than one bit of useful

> information from an LLR score. That one bit is extremely powerful, but

> getting more seems to be a very bad idea.

>

>

> [1] http://arxiv.org/abs/1207.1847 chapter 7, espoecially

>

>

> On Thu, Jun 20, 2013 at 10:41 AM, Dan Filimon

> <[EMAIL PROTECTED]>wrote:

>

> > Awesome! Thanks for clarifying! :)

> >

> >

> > On Thu, Jun 20, 2013 at 12:28 PM, Sean Owen <[EMAIL PROTECTED]> wrote:

> >

> > > Yes that should be all that's needed.

> > > On Jun 20, 2013 10:27 AM, "Dan Filimon" <[EMAIL PROTECTED]>

> > > wrote:

> > >

> > > > Right, makes sense. So, by normalize, I need to replace the counts in

> > the

> > > > matrix with probabilities.

> > > > So, I would divide everything by the sum of all the counts in the

> > matrix?

> > > >

> > > >

> > > > On Thu, Jun 20, 2013 at 12:16 PM, Sean Owen <[EMAIL PROTECTED]>

> wrote:

> > > >

> > > > > I think the quickest answer is: the formula computes the test

> > > > > statistic as a difference of log values, rather than log of ratio

> of

> > > > > values. By not normalizing, the entropy is multiplied by a factor

> > (sum

> > > > > of the counts) vs normalized. So you do end up with a statistic N

> > > > > times larger when counts are N times larger.

> > > > >

> > > > > On Thu, Jun 20, 2013 at 9:52 AM, Dan Filimon

> > > > > <[EMAIL PROTECTED]> wrote:

> > > > > > My understanding:

> > > > > >

> > > > > > Yes, the log-likelihood ratio (-2 log lambda) follows a

> chi-squared

> > > > > > distribution with 1 degree of freedom in the 2x2 table case.

> > > > > > A ~A

> > > > > > B

> > > > > > ~B

> > > > > >

> > > > > > We're testing to see if p(A | B) = p(A | ~B). That's the null

> > > > > hypothesis. I

> > > > > > compute the LLR. The larger that is, the more unlikely the null

> > > > > hypothesis

> > > > > > is to be true.

> > > > > > I can then look at a table with df=1. And I'd get p, the

> > probability

> > > of

> > > > > > seeing that result or something worse (the upper tail).

> > > > > > So, the probability of them being similar is 1 - p (which is

> > exactly

> > > > the

> > > > > > CDF for that value of X).

> > > > > >

> > > > > > Now, my question is: in the contingency table case, why would I

On Fri, Jun 21, 2013 at 8:25 AM, Dan Filimon <[EMAIL PROTECTED]>wrote:

> Thanks for the reference! I'll take a look at chapter 7, but let me first

> describe what I'm trying to achieve.

>

> I'm trying to identify interesting pairs, the anomalous co-occurrences with

> the LLR. I'm doing this for a day's data and I want to keep the p-values.

> I then want to use the p-values to compute some overall probability over

> the course of multiple days to increase confidence in what I think are the

> interesting pairs.

>

You can't reliably combine p-values this way (repeated comparisons and all

that).

Also, in practice if you take the top 50-100 indicators of this sort the

p-values will be so astronomically small that frequentist tests of

significance are ludicrous.

That said, the assumptions underlying the tests are really a much bigger

problem. The interesting problems of the world are often highly

non-stationary which can lead to all kinds of problems in interpreting

these results. What does it mean if something shows a 10^-20 p value one

day and a 0.2 value the next? Are you going to multiply them? Or just say

that something isn't quite the same? But how do you avoid comparing

p-values in this case which is a famously bad practice.

To my mind, the real problem here is that we are simply asking the wrong

question. We shouldn't be asking about individual features. We should be

asking about overall model performance. You *can* measure real-world

performance and you *can* put error bars around that performance and you

*can* see changes and degradation in that performance. All of those

comparisons are well-founded and work great. Whether the model has

selected too many or too few variables really is a diagnostic matter that

has little to do with answering the question of whether the model is

working well.

> Thanks for the reference! I'll take a look at chapter 7, but let me first

> describe what I'm trying to achieve.

>

> I'm trying to identify interesting pairs, the anomalous co-occurrences with

> the LLR. I'm doing this for a day's data and I want to keep the p-values.

> I then want to use the p-values to compute some overall probability over

> the course of multiple days to increase confidence in what I think are the

> interesting pairs.

>

You can't reliably combine p-values this way (repeated comparisons and all

that).

Also, in practice if you take the top 50-100 indicators of this sort the

p-values will be so astronomically small that frequentist tests of

significance are ludicrous.

That said, the assumptions underlying the tests are really a much bigger

problem. The interesting problems of the world are often highly

non-stationary which can lead to all kinds of problems in interpreting

these results. What does it mean if something shows a 10^-20 p value one

day and a 0.2 value the next? Are you going to multiply them? Or just say

that something isn't quite the same? But how do you avoid comparing

p-values in this case which is a famously bad practice.

To my mind, the real problem here is that we are simply asking the wrong

question. We shouldn't be asking about individual features. We should be

asking about overall model performance. You *can* measure real-world

performance and you *can* put error bars around that performance and you

*can* see changes and degradation in that performance. All of those

comparisons are well-founded and work great. Whether the model has

selected too many or too few variables really is a diagnostic matter that

has little to do with answering the question of whether the model is

working well.

The thing is there's no real model for which these are features.

I'm looking for pairs of similar items (and eventually groups). I'd like a

probabilistic interpretation of how similar two items are. Something like

"what is the probability that a user that likes one will also like the

other?".

Then, with these probabilities per day, I'd combine them over the course of

multiple days by "pulling" the older probabilities towards 0.5: alpha * 0.5

+ (1 - alpha) * p would be the linear approach to combining this where

alpha is 0 for the most recent day and larger for older ones. Then, I'd

take the average of those estimates.

The result would in my mind be a "smoothed" probability.

Then, I'd get the top k per item from these.

On Fri, Jun 21, 2013 at 11:45 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:

> On Fri, Jun 21, 2013 at 8:25 AM, Dan Filimon <[EMAIL PROTECTED]

> >wrote:

>

> > Thanks for the reference! I'll take a look at chapter 7, but let me first

> > describe what I'm trying to achieve.

> >

> > I'm trying to identify interesting pairs, the anomalous co-occurrences

> with

> > the LLR. I'm doing this for a day's data and I want to keep the p-values.

> > I then want to use the p-values to compute some overall probability over

> > the course of multiple days to increase confidence in what I think are

> the

> > interesting pairs.

> >

>

> You can't reliably combine p-values this way (repeated comparisons and all

> that).

>

> Also, in practice if you take the top 50-100 indicators of this sort the

> p-values will be so astronomically small that frequentist tests of

> significance are ludicrous.

>

> That said, the assumptions underlying the tests are really a much bigger

> problem. The interesting problems of the world are often highly

> non-stationary which can lead to all kinds of problems in interpreting

> these results. What does it mean if something shows a 10^-20 p value one

> day and a 0.2 value the next? Are you going to multiply them? Or just say

> that something isn't quite the same? But how do you avoid comparing

> p-values in this case which is a famously bad practice.

>

> To my mind, the real problem here is that we are simply asking the wrong

> question. We shouldn't be asking about individual features. We should be

> asking about overall model performance. You *can* measure real-world

> performance and you *can* put error bars around that performance and you

> *can* see changes and degradation in that performance. All of those

> comparisons are well-founded and work great. Whether the model has

> selected too many or too few variables really is a diagnostic matter that

> has little to do with answering the question of whether the model is

> working well.

>

I'm looking for pairs of similar items (and eventually groups). I'd like a

probabilistic interpretation of how similar two items are. Something like

"what is the probability that a user that likes one will also like the

other?".

Then, with these probabilities per day, I'd combine them over the course of

multiple days by "pulling" the older probabilities towards 0.5: alpha * 0.5

+ (1 - alpha) * p would be the linear approach to combining this where

alpha is 0 for the most recent day and larger for older ones. Then, I'd

take the average of those estimates.

The result would in my mind be a "smoothed" probability.

Then, I'd get the top k per item from these.

On Fri, Jun 21, 2013 at 11:45 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:

> On Fri, Jun 21, 2013 at 8:25 AM, Dan Filimon <[EMAIL PROTECTED]

> >wrote:

>

> > Thanks for the reference! I'll take a look at chapter 7, but let me first

> > describe what I'm trying to achieve.

> >

> > I'm trying to identify interesting pairs, the anomalous co-occurrences

> with

> > the LLR. I'm doing this for a day's data and I want to keep the p-values.

> > I then want to use the p-values to compute some overall probability over

> > the course of multiple days to increase confidence in what I think are

> the

> > interesting pairs.

> >

>

> You can't reliably combine p-values this way (repeated comparisons and all

> that).

>

> Also, in practice if you take the top 50-100 indicators of this sort the

> p-values will be so astronomically small that frequentist tests of

> significance are ludicrous.

>

> That said, the assumptions underlying the tests are really a much bigger

> problem. The interesting problems of the world are often highly

> non-stationary which can lead to all kinds of problems in interpreting

> these results. What does it mean if something shows a 10^-20 p value one

> day and a 0.2 value the next? Are you going to multiply them? Or just say

> that something isn't quite the same? But how do you avoid comparing

> p-values in this case which is a famously bad practice.

>

> To my mind, the real problem here is that we are simply asking the wrong

> question. We shouldn't be asking about individual features. We should be

> asking about overall model performance. You *can* measure real-world

> performance and you *can* put error bars around that performance and you

> *can* see changes and degradation in that performance. All of those

> comparisons are well-founded and work great. Whether the model has

> selected too many or too few variables really is a diagnostic matter that

> has little to do with answering the question of whether the model is

> working well.

>

Well, you are still stuck with the problem that pulling more bits out of

the small count data is a bad idea.

Most of the models that I am partial to never even honestly estimate

probabilities. They just include or exclude features and then weight rare

features higher than common.

This is easy to do across days and very easy to have different days

contribute differently.

On Fri, Jun 21, 2013 at 10:13 AM, Dan Filimon

<[EMAIL PROTECTED]>wrote:

> The thing is there's no real model for which these are features.

> I'm looking for pairs of similar items (and eventually groups). I'd like a

> probabilistic interpretation of how similar two items are. Something like

> "what is the probability that a user that likes one will also like the

> other?".

>

> Then, with these probabilities per day, I'd combine them over the course of

> multiple days by "pulling" the older probabilities towards 0.5: alpha * 0.5

> + (1 - alpha) * p would be the linear approach to combining this where

> alpha is 0 for the most recent day and larger for older ones. Then, I'd

> take the average of those estimates.

> The result would in my mind be a "smoothed" probability.

>

> Then, I'd get the top k per item from these.

>

>

>

> On Fri, Jun 21, 2013 at 11:45 AM, Ted Dunning <[EMAIL PROTECTED]>

> wrote:

>

> > On Fri, Jun 21, 2013 at 8:25 AM, Dan Filimon <

> [EMAIL PROTECTED]

> > >wrote:

> >

> > > Thanks for the reference! I'll take a look at chapter 7, but let me

> first

> > > describe what I'm trying to achieve.

> > >

> > > I'm trying to identify interesting pairs, the anomalous co-occurrences

> > with

> > > the LLR. I'm doing this for a day's data and I want to keep the

> p-values.

> > > I then want to use the p-values to compute some overall probability

> over

> > > the course of multiple days to increase confidence in what I think are

> > the

> > > interesting pairs.

> > >

> >

> > You can't reliably combine p-values this way (repeated comparisons and

> all

> > that).

> >

> > Also, in practice if you take the top 50-100 indicators of this sort the

> > p-values will be so astronomically small that frequentist tests of

> > significance are ludicrous.

> >

> > That said, the assumptions underlying the tests are really a much bigger

> > problem. The interesting problems of the world are often highly

> > non-stationary which can lead to all kinds of problems in interpreting

> > these results. What does it mean if something shows a 10^-20 p value one

> > day and a 0.2 value the next? Are you going to multiply them? Or just

> say

> > that something isn't quite the same? But how do you avoid comparing

> > p-values in this case which is a famously bad practice.

> >

> > To my mind, the real problem here is that we are simply asking the wrong

> > question. We shouldn't be asking about individual features. We should

> be

> > asking about overall model performance. You *can* measure real-world

> > performance and you *can* put error bars around that performance and you

> > *can* see changes and degradation in that performance. All of those

> > comparisons are well-founded and work great. Whether the model has

> > selected too many or too few variables really is a diagnostic matter that

> > has little to do with answering the question of whether the model is

> > working well.

> >

>

the small count data is a bad idea.

Most of the models that I am partial to never even honestly estimate

probabilities. They just include or exclude features and then weight rare

features higher than common.

This is easy to do across days and very easy to have different days

contribute differently.

On Fri, Jun 21, 2013 at 10:13 AM, Dan Filimon

<[EMAIL PROTECTED]>wrote:

> The thing is there's no real model for which these are features.

> I'm looking for pairs of similar items (and eventually groups). I'd like a

> probabilistic interpretation of how similar two items are. Something like

> "what is the probability that a user that likes one will also like the

> other?".

>

> Then, with these probabilities per day, I'd combine them over the course of

> multiple days by "pulling" the older probabilities towards 0.5: alpha * 0.5

> + (1 - alpha) * p would be the linear approach to combining this where

> alpha is 0 for the most recent day and larger for older ones. Then, I'd

> take the average of those estimates.

> The result would in my mind be a "smoothed" probability.

>

> Then, I'd get the top k per item from these.

>

>

>

> On Fri, Jun 21, 2013 at 11:45 AM, Ted Dunning <[EMAIL PROTECTED]>

> wrote:

>

> > On Fri, Jun 21, 2013 at 8:25 AM, Dan Filimon <

> [EMAIL PROTECTED]

> > >wrote:

> >

> > > Thanks for the reference! I'll take a look at chapter 7, but let me

> first

> > > describe what I'm trying to achieve.

> > >

> > > I'm trying to identify interesting pairs, the anomalous co-occurrences

> > with

> > > the LLR. I'm doing this for a day's data and I want to keep the

> p-values.

> > > I then want to use the p-values to compute some overall probability

> over

> > > the course of multiple days to increase confidence in what I think are

> > the

> > > interesting pairs.

> > >

> >

> > You can't reliably combine p-values this way (repeated comparisons and

> all

> > that).

> >

> > Also, in practice if you take the top 50-100 indicators of this sort the

> > p-values will be so astronomically small that frequentist tests of

> > significance are ludicrous.

> >

> > That said, the assumptions underlying the tests are really a much bigger

> > problem. The interesting problems of the world are often highly

> > non-stationary which can lead to all kinds of problems in interpreting

> > these results. What does it mean if something shows a 10^-20 p value one

> > day and a 0.2 value the next? Are you going to multiply them? Or just

> say

> > that something isn't quite the same? But how do you avoid comparing

> > p-values in this case which is a famously bad practice.

> >

> > To my mind, the real problem here is that we are simply asking the wrong

> > question. We shouldn't be asking about individual features. We should

> be

> > asking about overall model performance. You *can* measure real-world

> > performance and you *can* put error bars around that performance and you

> > *can* see changes and degradation in that performance. All of those

> > comparisons are well-founded and work great. Whether the model has

> > selected too many or too few variables really is a diagnostic matter that

> > has little to do with answering the question of whether the model is

> > working well.

> >

>

Could you be more explicit?

What models are these, how do I use them to track how similar two items are?

I'm essentially working with a custom-tailored RowSimilarityJob after

filtering out users with too many items first.

On Fri, Jun 21, 2013 at 12:35 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

> Well, you are still stuck with the problem that pulling more bits out of

> the small count data is a bad idea.

>

> Most of the models that I am partial to never even honestly estimate

> probabilities. They just include or exclude features and then weight rare

> features higher than common.

>

> This is easy to do across days and very easy to have different days

> contribute differently.

>

>

>

> On Fri, Jun 21, 2013 at 10:13 AM, Dan Filimon

> <[EMAIL PROTECTED]>wrote:

>

> > The thing is there's no real model for which these are features.

> > I'm looking for pairs of similar items (and eventually groups). I'd like

> a

> > probabilistic interpretation of how similar two items are. Something like

> > "what is the probability that a user that likes one will also like the

> > other?".

> >

> > Then, with these probabilities per day, I'd combine them over the course

> of

> > multiple days by "pulling" the older probabilities towards 0.5: alpha *

> 0.5

> > + (1 - alpha) * p would be the linear approach to combining this where

> > alpha is 0 for the most recent day and larger for older ones. Then, I'd

> > take the average of those estimates.

> > The result would in my mind be a "smoothed" probability.

> >

> > Then, I'd get the top k per item from these.

> >

> >

> >

> > On Fri, Jun 21, 2013 at 11:45 AM, Ted Dunning <[EMAIL PROTECTED]>

> > wrote:

> >

> > > On Fri, Jun 21, 2013 at 8:25 AM, Dan Filimon <

> > [EMAIL PROTECTED]

> > > >wrote:

> > >

> > > > Thanks for the reference! I'll take a look at chapter 7, but let me

> > first

> > > > describe what I'm trying to achieve.

> > > >

> > > > I'm trying to identify interesting pairs, the anomalous

> co-occurrences

> > > with

> > > > the LLR. I'm doing this for a day's data and I want to keep the

> > p-values.

> > > > I then want to use the p-values to compute some overall probability

> > over

> > > > the course of multiple days to increase confidence in what I think

> are

> > > the

> > > > interesting pairs.

> > > >

> > >

> > > You can't reliably combine p-values this way (repeated comparisons and

> > all

> > > that).

> > >

> > > Also, in practice if you take the top 50-100 indicators of this sort

> the

> > > p-values will be so astronomically small that frequentist tests of

> > > significance are ludicrous.

> > >

> > > That said, the assumptions underlying the tests are really a much

> bigger

> > > problem. The interesting problems of the world are often highly

> > > non-stationary which can lead to all kinds of problems in interpreting

> > > these results. What does it mean if something shows a 10^-20 p value

> one

> > > day and a 0.2 value the next? Are you going to multiply them? Or just

> > say

> > > that something isn't quite the same? But how do you avoid comparing

> > > p-values in this case which is a famously bad practice.

> > >

> > > To my mind, the real problem here is that we are simply asking the

> wrong

> > > question. We shouldn't be asking about individual features. We should

> > be

> > > asking about overall model performance. You *can* measure real-world

> > > performance and you *can* put error bars around that performance and

> you

> > > *can* see changes and degradation in that performance. All of those

> > > comparisons are well-founded and work great. Whether the model has

> > > selected too many or too few variables really is a diagnostic matter

> that

> > > has little to do with answering the question of whether the model is

> > > working well.

> > >

> >

>

What models are these, how do I use them to track how similar two items are?

I'm essentially working with a custom-tailored RowSimilarityJob after

filtering out users with too many items first.

On Fri, Jun 21, 2013 at 12:35 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

> Well, you are still stuck with the problem that pulling more bits out of

> the small count data is a bad idea.

>

> Most of the models that I am partial to never even honestly estimate

> probabilities. They just include or exclude features and then weight rare

> features higher than common.

>

> This is easy to do across days and very easy to have different days

> contribute differently.

>

>

>

> On Fri, Jun 21, 2013 at 10:13 AM, Dan Filimon

> <[EMAIL PROTECTED]>wrote:

>

> > The thing is there's no real model for which these are features.

> > I'm looking for pairs of similar items (and eventually groups). I'd like

> a

> > probabilistic interpretation of how similar two items are. Something like

> > "what is the probability that a user that likes one will also like the

> > other?".

> >

> > Then, with these probabilities per day, I'd combine them over the course

> of

> > multiple days by "pulling" the older probabilities towards 0.5: alpha *

> 0.5

> > + (1 - alpha) * p would be the linear approach to combining this where

> > alpha is 0 for the most recent day and larger for older ones. Then, I'd

> > take the average of those estimates.

> > The result would in my mind be a "smoothed" probability.

> >

> > Then, I'd get the top k per item from these.

> >

> >

> >

> > On Fri, Jun 21, 2013 at 11:45 AM, Ted Dunning <[EMAIL PROTECTED]>

> > wrote:

> >

> > > On Fri, Jun 21, 2013 at 8:25 AM, Dan Filimon <

> > [EMAIL PROTECTED]

> > > >wrote:

> > >

> > > > Thanks for the reference! I'll take a look at chapter 7, but let me

> > first

> > > > describe what I'm trying to achieve.

> > > >

> > > > I'm trying to identify interesting pairs, the anomalous

> co-occurrences

> > > with

> > > > the LLR. I'm doing this for a day's data and I want to keep the

> > p-values.

> > > > I then want to use the p-values to compute some overall probability

> > over

> > > > the course of multiple days to increase confidence in what I think

> are

> > > the

> > > > interesting pairs.

> > > >

> > >

> > > You can't reliably combine p-values this way (repeated comparisons and

> > all

> > > that).

> > >

> > > Also, in practice if you take the top 50-100 indicators of this sort

> the

> > > p-values will be so astronomically small that frequentist tests of

> > > significance are ludicrous.

> > >

> > > That said, the assumptions underlying the tests are really a much

> bigger

> > > problem. The interesting problems of the world are often highly

> > > non-stationary which can lead to all kinds of problems in interpreting

> > > these results. What does it mean if something shows a 10^-20 p value

> one

> > > day and a 0.2 value the next? Are you going to multiply them? Or just

> > say

> > > that something isn't quite the same? But how do you avoid comparing

> > > p-values in this case which is a famously bad practice.

> > >

> > > To my mind, the real problem here is that we are simply asking the

> wrong

> > > question. We shouldn't be asking about individual features. We should

> > be

> > > asking about overall model performance. You *can* measure real-world

> > > performance and you *can* put error bars around that performance and

> you

> > > *can* see changes and degradation in that performance. All of those

> > > comparisons are well-founded and work great. Whether the model has

> > > selected too many or too few variables really is a diagnostic matter

> that

> > > has little to do with answering the question of whether the model is

> > > working well.

> > >

> >

>

On Fri, Jun 21, 2013 at 10:59 AM, Dan Filimon

<[EMAIL PROTECTED]>wrote:

> Could you be more explicit?

> What models are these, how do I use them to track how similar two items

> are?

>

Luduan document classification.

Recommendation systems.

Adaptive search engines.

The question of how similar items are is much harder to attack than the

question of roughly which items are very similar. You can deal with the

most related, but in the mid-range even order is very fuzzy.

I'm essentially working with a custom-tailored RowSimilarityJob after

> filtering out users with too many items first.

>

Not that it much matters, I tend to filter out user x item entries based on

the item *and* the user prevalence. This gives me a nicely bounded number

of occurrences for every user and every item.

If you don't want to count the item frequency in advance, then just

down-sampling crazy users is fine.

The reason that it doesn't much matter is that very few elements are

filtered out.

>

> On Fri, Jun 21, 2013 at 12:35 PM, Ted Dunning <[EMAIL PROTECTED]>

> wrote:

>

> > Well, you are still stuck with the problem that pulling more bits out of

> > the small count data is a bad idea.

> >

> > Most of the models that I am partial to never even honestly estimate

> > probabilities. They just include or exclude features and then weight

> rare

> > features higher than common.

> >

> > This is easy to do across days and very easy to have different days

> > contribute differently.

> >

> >

> >

> > On Fri, Jun 21, 2013 at 10:13 AM, Dan Filimon

> > <[EMAIL PROTECTED]>wrote:

> >

> > > The thing is there's no real model for which these are features.

> > > I'm looking for pairs of similar items (and eventually groups). I'd

> like

> > a

> > > probabilistic interpretation of how similar two items are. Something

> like

> > > "what is the probability that a user that likes one will also like the

> > > other?".

> > >

> > > Then, with these probabilities per day, I'd combine them over the

> course

> > of

> > > multiple days by "pulling" the older probabilities towards 0.5: alpha *

> > 0.5

> > > + (1 - alpha) * p would be the linear approach to combining this where

> > > alpha is 0 for the most recent day and larger for older ones. Then, I'd

> > > take the average of those estimates.

> > > The result would in my mind be a "smoothed" probability.

> > >

> > > Then, I'd get the top k per item from these.

> > >

> > >

> > >

> > > On Fri, Jun 21, 2013 at 11:45 AM, Ted Dunning <[EMAIL PROTECTED]>

> > > wrote:

> > >

> > > > On Fri, Jun 21, 2013 at 8:25 AM, Dan Filimon <

> > > [EMAIL PROTECTED]

> > > > >wrote:

> > > >

> > > > > Thanks for the reference! I'll take a look at chapter 7, but let me

> > > first

> > > > > describe what I'm trying to achieve.

> > > > >

> > > > > I'm trying to identify interesting pairs, the anomalous

> > co-occurrences

> > > > with

> > > > > the LLR. I'm doing this for a day's data and I want to keep the

> > > p-values.

> > > > > I then want to use the p-values to compute some overall probability

> > > over

> > > > > the course of multiple days to increase confidence in what I think

> > are

> > > > the

> > > > > interesting pairs.

> > > > >

> > > >

> > > > You can't reliably combine p-values this way (repeated comparisons

> and

> > > all

> > > > that).

> > > >

> > > > Also, in practice if you take the top 50-100 indicators of this sort

> > the

> > > > p-values will be so astronomically small that frequentist tests of

> > > > significance are ludicrous.

> > > >

> > > > That said, the assumptions underlying the tests are really a much

> > bigger

> > > > problem. The interesting problems of the world are often highly

> > > > non-stationary which can lead to all kinds of problems in

> interpreting

> > > > these results. What does it mean if something shows a 10^-20 p value

> > one

> > > > day and a 0.2 value the next? Are you going to multiply them? Or

> just

> > > say

> > > > that something isn't quite the same? But how do you avoid comparing

<[EMAIL PROTECTED]>wrote:

> Could you be more explicit?

> What models are these, how do I use them to track how similar two items

> are?

>

Luduan document classification.

Recommendation systems.

Adaptive search engines.

The question of how similar items are is much harder to attack than the

question of roughly which items are very similar. You can deal with the

most related, but in the mid-range even order is very fuzzy.

I'm essentially working with a custom-tailored RowSimilarityJob after

> filtering out users with too many items first.

>

Not that it much matters, I tend to filter out user x item entries based on

the item *and* the user prevalence. This gives me a nicely bounded number

of occurrences for every user and every item.

If you don't want to count the item frequency in advance, then just

down-sampling crazy users is fine.

The reason that it doesn't much matter is that very few elements are

filtered out.

>

> On Fri, Jun 21, 2013 at 12:35 PM, Ted Dunning <[EMAIL PROTECTED]>

> wrote:

>

> > Well, you are still stuck with the problem that pulling more bits out of

> > the small count data is a bad idea.

> >

> > Most of the models that I am partial to never even honestly estimate

> > probabilities. They just include or exclude features and then weight

> rare

> > features higher than common.

> >

> > This is easy to do across days and very easy to have different days

> > contribute differently.

> >

> >

> >

> > On Fri, Jun 21, 2013 at 10:13 AM, Dan Filimon

> > <[EMAIL PROTECTED]>wrote:

> >

> > > The thing is there's no real model for which these are features.

> > > I'm looking for pairs of similar items (and eventually groups). I'd

> like

> > a

> > > probabilistic interpretation of how similar two items are. Something

> like

> > > "what is the probability that a user that likes one will also like the

> > > other?".

> > >

> > > Then, with these probabilities per day, I'd combine them over the

> course

> > of

> > > multiple days by "pulling" the older probabilities towards 0.5: alpha *

> > 0.5

> > > + (1 - alpha) * p would be the linear approach to combining this where

> > > alpha is 0 for the most recent day and larger for older ones. Then, I'd

> > > take the average of those estimates.

> > > The result would in my mind be a "smoothed" probability.

> > >

> > > Then, I'd get the top k per item from these.

> > >

> > >

> > >

> > > On Fri, Jun 21, 2013 at 11:45 AM, Ted Dunning <[EMAIL PROTECTED]>

> > > wrote:

> > >

> > > > On Fri, Jun 21, 2013 at 8:25 AM, Dan Filimon <

> > > [EMAIL PROTECTED]

> > > > >wrote:

> > > >

> > > > > Thanks for the reference! I'll take a look at chapter 7, but let me

> > > first

> > > > > describe what I'm trying to achieve.

> > > > >

> > > > > I'm trying to identify interesting pairs, the anomalous

> > co-occurrences

> > > > with

> > > > > the LLR. I'm doing this for a day's data and I want to keep the

> > > p-values.

> > > > > I then want to use the p-values to compute some overall probability

> > > over

> > > > > the course of multiple days to increase confidence in what I think

> > are

> > > > the

> > > > > interesting pairs.

> > > > >

> > > >

> > > > You can't reliably combine p-values this way (repeated comparisons

> and

> > > all

> > > > that).

> > > >

> > > > Also, in practice if you take the top 50-100 indicators of this sort

> > the

> > > > p-values will be so astronomically small that frequentist tests of

> > > > significance are ludicrous.

> > > >

> > > > That said, the assumptions underlying the tests are really a much

> > bigger

> > > > problem. The interesting problems of the world are often highly

> > > > non-stationary which can lead to all kinds of problems in

> interpreting

> > > > these results. What does it mean if something shows a 10^-20 p value

> > one

> > > > day and a 0.2 value the next? Are you going to multiply them? Or

> just

> > > say

> > > > that something isn't quite the same? But how do you avoid comparing

> Not that it much matters, I tend to filter out user x item entries based on

> the item *and* the user prevalence. This gives me a nicely bounded number

> of occurrences for every user and every item.

I'd be interested in implementing this. Can you share a few more

details? Having another pass that counts item frequencies shouldn't hurt

much.

Best,

Sebastian

> the item *and* the user prevalence. This gives me a nicely bounded number

> of occurrences for every user and every item.

I'd be interested in implementing this. Can you share a few more

details? Having another pass that counts item frequencies shouldn't hurt

much.

Best,

Sebastian

See https://github.com/tdunning/in-memory-cooccurrence for an in-memory

implementation.

Should just require three or so lines of code.

On Fri, Jun 21, 2013 at 11:23 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote:

> > Not that it much matters, I tend to filter out user x item entries based

> on

> > the item *and* the user prevalence. This gives me a nicely bounded

> number

> > of occurrences for every user and every item.

>

> I'd be interested in implementing this. Can you share a few more

> details? Having another pass that counts item frequencies shouldn't hurt

> much.

>

> Best,

> Sebastian

>

>

implementation.

Should just require three or so lines of code.

On Fri, Jun 21, 2013 at 11:23 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote:

> > Not that it much matters, I tend to filter out user x item entries based

> on

> > the item *and* the user prevalence. This gives me a nicely bounded

> number

> > of occurrences for every user and every item.

>

> I'd be interested in implementing this. Can you share a few more

> details? Having another pass that counts item frequencies shouldn't hurt

> much.

>

> Best,

> Sebastian

>

>