Steven Raemaekers

2011-06-15, 14:51

Svetlomir Kasabov

2011-06-15, 16:56

Ted Dunning

2011-06-15, 18:44

Svetlomir Kasabov

2011-06-16, 09:22

Ted Dunning

2011-06-16, 22:19

Ted Dunning

2011-06-16, 22:20

Hello,

Currently I'm working on a classifier to classify documents written in different programming languages in the correct category. I made a test and a training set, and I get a confusion table as a result. This is nice, but the program does not supply any probabilities/uncertainties that a particular file belongs to a certain category, it only returns whether or not a single file belongs to a category or not. Because it is a Bayesian algorithm, probabilities must be involved somehow.

What I would like to have is for a single input file the chance/probability of that file belonging to each category, for instance like this:

C: 25%

C++: 50%

Java: 25%

The classifyDocument method in the class BayesAlgorithm does return numbers, but these are not really probabilities since they do not add up to 1.

Looking in the javadoc it says that these numbers are dot products between the vector of this document and the training set.

So my question is, is it possible to convert the numbers as stored in ClassifierResult and calculated in BayesAlgorithm.classifyDocument to some kind of probability?

Regards,

Steven

--

Software Improvement Group

www.sig.eu

We would like to invite you to complete our survey on the Awareness of Green Software.

It will take you less than 10 minutes.

Link to survey: http://bit.ly/kfWGZM

Currently I'm working on a classifier to classify documents written in different programming languages in the correct category. I made a test and a training set, and I get a confusion table as a result. This is nice, but the program does not supply any probabilities/uncertainties that a particular file belongs to a certain category, it only returns whether or not a single file belongs to a category or not. Because it is a Bayesian algorithm, probabilities must be involved somehow.

What I would like to have is for a single input file the chance/probability of that file belonging to each category, for instance like this:

C: 25%

C++: 50%

Java: 25%

The classifyDocument method in the class BayesAlgorithm does return numbers, but these are not really probabilities since they do not add up to 1.

Looking in the javadoc it says that these numbers are dot products between the vector of this document and the training set.

So my question is, is it possible to convert the numbers as stored in ClassifierResult and calculated in BayesAlgorithm.classifyDocument to some kind of probability?

Regards,

Steven

--

Software Improvement Group

www.sig.eu

We would like to invite you to complete our survey on the Awareness of Green Software.

It will take you less than 10 minutes.

Link to survey: http://bit.ly/kfWGZM

Hello Steven,

I've asked this question too:

http://mail-archives.apache.org/mod_mbox/mahout-user/201105.mbox/%[EMAIL PROTECTED]%3E

unfortunately, Mahout's Naive Bayes implemention can't calculate

probabilities. You are now probably really astonished - I could'nt

believe it too, as I read that (I think this is some kind of 'strange',

since Bayes's main concept is probability calculation). It's a pitty,

that such a great framework like Mahout has restricted the Bayesian

concept that way. In addition, Naive Bayes is (as far as I know) only

text-oriented, you can apply it only on documents . Mahout is still

wonderful, though, because it lets us calculate probabilities using

Logistic Regression.

That's why I switched to using Mahout's Logistic Regression

implementation: using OnlineLogisticRegression.java#classifyScalar()

returns a probability. Logistic Regression has also the advantage, that

it can handle continous values directly, while in Bayes' Clasifier you

should categorize data first.

You can try the class TrainLogisticTest.java from the mahout-examples in

order to see how it works. See also the calculation of probability in

TrainLogistic.java:

double p = lr.classifyScalar(input);

Am 15.06.2011 16:51, schrieb Steven Raemaekers:

> Hello,

>

> Currently I'm working on a classifier to classify documents written in different programming languages in the correct category. I made a test and a training set, and I get a confusion table as a result. This is nice, but the program does not supply any probabilities/uncertainties that a particular file belongs to a certain category, it only returns whether or not a single file belongs to a category or not. Because it is a Bayesian algorithm, probabilities must be involved somehow.

>

> What I would like to have is for a single input file the chance/probability of that file belonging to each category, for instance like this:

>

> C: 25%

> C++: 50%

> Java: 25%

>

> The classifyDocument method in the class BayesAlgorithm does return numbers, but these are not really probabilities since they do not add up to 1.

> Looking in the javadoc it says that these numbers are dot products between the vector of this document and the training set.

>

> So my question is, is it possible to convert the numbers as stored in ClassifierResult and calculated in BayesAlgorithm.classifyDocument to some kind of probability?

>

> Regards,

>

> Steven

>

> --

> Software Improvement Group

> www.sig.eu

>

> We would like to invite you to complete our survey on the Awareness of Green Software.

> It will take you less than 10 minutes.

> Link to survey: http://bit.ly/kfWGZM

>

>

I've asked this question too:

http://mail-archives.apache.org/mod_mbox/mahout-user/201105.mbox/%[EMAIL PROTECTED]%3E

unfortunately, Mahout's Naive Bayes implemention can't calculate

probabilities. You are now probably really astonished - I could'nt

believe it too, as I read that (I think this is some kind of 'strange',

since Bayes's main concept is probability calculation). It's a pitty,

that such a great framework like Mahout has restricted the Bayesian

concept that way. In addition, Naive Bayes is (as far as I know) only

text-oriented, you can apply it only on documents . Mahout is still

wonderful, though, because it lets us calculate probabilities using

Logistic Regression.

That's why I switched to using Mahout's Logistic Regression

implementation: using OnlineLogisticRegression.java#classifyScalar()

returns a probability. Logistic Regression has also the advantage, that

it can handle continous values directly, while in Bayes' Clasifier you

should categorize data first.

You can try the class TrainLogisticTest.java from the mahout-examples in

order to see how it works. See also the calculation of probability in

TrainLogistic.java:

double p = lr.classifyScalar(input);

Am 15.06.2011 16:51, schrieb Steven Raemaekers:

> Hello,

>

> Currently I'm working on a classifier to classify documents written in different programming languages in the correct category. I made a test and a training set, and I get a confusion table as a result. This is nice, but the program does not supply any probabilities/uncertainties that a particular file belongs to a certain category, it only returns whether or not a single file belongs to a category or not. Because it is a Bayesian algorithm, probabilities must be involved somehow.

>

> What I would like to have is for a single input file the chance/probability of that file belonging to each category, for instance like this:

>

> C: 25%

> C++: 50%

> Java: 25%

>

> The classifyDocument method in the class BayesAlgorithm does return numbers, but these are not really probabilities since they do not add up to 1.

> Looking in the javadoc it says that these numbers are dot products between the vector of this document and the training set.

>

> So my question is, is it possible to convert the numbers as stored in ClassifierResult and calculated in BayesAlgorithm.classifyDocument to some kind of probability?

>

> Regards,

>

> Steven

>

> --

> Software Improvement Group

> www.sig.eu

>

> We would like to invite you to complete our survey on the Awareness of Green Software.

> It will take you less than 10 minutes.

> Link to survey: http://bit.ly/kfWGZM

>

>

This is what the term Naive is used in the name. The scores for this kind

of algorithm are 0 to 1 or are logarithms of such a number, but are not at

all calibrated probabilities.

And, frankly, it is rare in practice for the output of logistic regression

to be calibrated either. Those outputs are much more like probabilities,

but they still have some issues.

On Wed, Jun 15, 2011 at 6:56 PM, Svetlomir Kasabov <

[EMAIL PROTECTED]> wrote:

> unfortunately, Mahout's Naive Bayes implemention can't calculate

> probabilities. You are now probably really astonished - I could'nt believe

> it too, as I read that (I think this is some kind of 'strange', since

> Bayes's main concept is probability calculation). It's a pitty, that such a

> great framework like Mahout has restricted the Bayesian concept that way. In

> addition, Naive Bayes is (as far as I know) only text-oriented, you can

> apply it only on documents . Mahout is still wonderful, though, because it

> lets us calculate probabilities using Logistic Regression.

>

of algorithm are 0 to 1 or are logarithms of such a number, but are not at

all calibrated probabilities.

And, frankly, it is rare in practice for the output of logistic regression

to be calibrated either. Those outputs are much more like probabilities,

but they still have some issues.

On Wed, Jun 15, 2011 at 6:56 PM, Svetlomir Kasabov <

[EMAIL PROTECTED]> wrote:

> unfortunately, Mahout's Naive Bayes implemention can't calculate

> probabilities. You are now probably really astonished - I could'nt believe

> it too, as I read that (I think this is some kind of 'strange', since

> Bayes's main concept is probability calculation). It's a pitty, that such a

> great framework like Mahout has restricted the Bayesian concept that way. In

> addition, Naive Bayes is (as far as I know) only text-oriented, you can

> apply it only on documents . Mahout is still wonderful, though, because it

> lets us calculate probabilities using Logistic Regression.

>

Hello Ted,

what are the main issues of the probability estimations of the logistic

regression ? I am developing a medical application for making

probability estimations using time series and I that's why I would know

if they are critical for me.

Many thanks and best regards,

Svetlomir.

Am 15.06.2011 20:44, schrieb Ted Dunning:

> This is what the term Naive is used in the name. The scores for this kind

> of algorithm are 0 to 1 or are logarithms of such a number, but are not at

> all calibrated probabilities.

>

> And, frankly, it is rare in practice for the output of logistic regression

> to be calibrated either. Those outputs are much more like probabilities,

> but they still have some issues.

>

> On Wed, Jun 15, 2011 at 6:56 PM, Svetlomir Kasabov<

> [EMAIL PROTECTED]> wrote:

>

>> unfortunately, Mahout's Naive Bayes implemention can't calculate

>> probabilities. You are now probably really astonished - I could'nt believe

>> it too, as I read that (I think this is some kind of 'strange', since

>> Bayes's main concept is probability calculation). It's a pitty, that such a

>> great framework like Mahout has restricted the Bayesian concept that way. In

>> addition, Naive Bayes is (as far as I know) only text-oriented, you can

>> apply it only on documents . Mahout is still wonderful, though, because it

>> lets us calculate probabilities using Logistic Regression.

>>

what are the main issues of the probability estimations of the logistic

regression ? I am developing a medical application for making

probability estimations using time series and I that's why I would know

if they are critical for me.

Many thanks and best regards,

Svetlomir.

Am 15.06.2011 20:44, schrieb Ted Dunning:

> This is what the term Naive is used in the name. The scores for this kind

> of algorithm are 0 to 1 or are logarithms of such a number, but are not at

> all calibrated probabilities.

>

> And, frankly, it is rare in practice for the output of logistic regression

> to be calibrated either. Those outputs are much more like probabilities,

> but they still have some issues.

>

> On Wed, Jun 15, 2011 at 6:56 PM, Svetlomir Kasabov<

> [EMAIL PROTECTED]> wrote:

>

>> unfortunately, Mahout's Naive Bayes implemention can't calculate

>> probabilities. You are now probably really astonished - I could'nt believe

>> it too, as I read that (I think this is some kind of 'strange', since

>> Bayes's main concept is probability calculation). It's a pitty, that such a

>> great framework like Mahout has restricted the Bayesian concept that way. In

>> addition, Naive Bayes is (as far as I know) only text-oriented, you can

>> apply it only on documents . Mahout is still wonderful, though, because it

>> lets us calculate probabilities using Logistic Regression.

>>

The problem is that logistic regression makes some assumptions that are

unrealistic in practice. That leads to uncalibrated probabilities in

certain cases.

This is particularly true where variable interactions are strong.

The good news is that logistic regression gives you the best answer that is

possible with a generalized linear model.

On Thu, Jun 16, 2011 at 11:22 AM, Svetlomir Kasabov <

[EMAIL PROTECTED]> wrote:

> Hello Ted,

>

> what are the main issues of the probability estimations of the logistic

> regression ? I am developing a medical application for making probability

> estimations using time series and I that's why I would know if they are

> critical for me.

>

> Many thanks and best regards,

>

> Svetlomir.

>

> Am 15.06.2011 20:44, schrieb Ted Dunning:

>

> This is what the term Naive is used in the name. The scores for this kind

>> of algorithm are 0 to 1 or are logarithms of such a number, but are not at

>> all calibrated probabilities.

>>

>> And, frankly, it is rare in practice for the output of logistic regression

>> to be calibrated either. Those outputs are much more like probabilities,

>> but they still have some issues.

>>

>> On Wed, Jun 15, 2011 at 6:56 PM, Svetlomir Kasabov<

>> [EMAIL PROTECTED]> wrote:

>>

>> unfortunately, Mahout's Naive Bayes implemention can't calculate

>>> probabilities. You are now probably really astonished - I could'nt

>>> believe

>>> it too, as I read that (I think this is some kind of 'strange', since

>>> Bayes's main concept is probability calculation). It's a pitty, that such

>>> a

>>> great framework like Mahout has restricted the Bayesian concept that way.

>>> In

>>> addition, Naive Bayes is (as far as I know) only text-oriented, you can

>>> apply it only on documents . Mahout is still wonderful, though, because

>>> it

>>> lets us calculate probabilities using Logistic Regression.

>>>

>>>

>

unrealistic in practice. That leads to uncalibrated probabilities in

certain cases.

This is particularly true where variable interactions are strong.

The good news is that logistic regression gives you the best answer that is

possible with a generalized linear model.

On Thu, Jun 16, 2011 at 11:22 AM, Svetlomir Kasabov <

[EMAIL PROTECTED]> wrote:

> Hello Ted,

>

> what are the main issues of the probability estimations of the logistic

> regression ? I am developing a medical application for making probability

> estimations using time series and I that's why I would know if they are

> critical for me.

>

> Many thanks and best regards,

>

> Svetlomir.

>

> Am 15.06.2011 20:44, schrieb Ted Dunning:

>

> This is what the term Naive is used in the name. The scores for this kind

>> of algorithm are 0 to 1 or are logarithms of such a number, but are not at

>> all calibrated probabilities.

>>

>> And, frankly, it is rare in practice for the output of logistic regression

>> to be calibrated either. Those outputs are much more like probabilities,

>> but they still have some issues.

>>

>> On Wed, Jun 15, 2011 at 6:56 PM, Svetlomir Kasabov<

>> [EMAIL PROTECTED]> wrote:

>>

>> unfortunately, Mahout's Naive Bayes implemention can't calculate

>>> probabilities. You are now probably really astonished - I could'nt

>>> believe

>>> it too, as I read that (I think this is some kind of 'strange', since

>>> Bayes's main concept is probability calculation). It's a pitty, that such

>>> a

>>> great framework like Mahout has restricted the Bayesian concept that way.

>>> In

>>> addition, Naive Bayes is (as far as I know) only text-oriented, you can

>>> apply it only on documents . Mahout is still wonderful, though, because

>>> it

>>> lets us calculate probabilities using Logistic Regression.

>>>

>>>

>

I should add that the regularization will also make the logistic regression

classifier a little bit conservative about estimating probabilities near 0

or near 1.

On Fri, Jun 17, 2011 at 12:19 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:

> The problem is that logistic regression makes some assumptions that are

> unrealistic in practice. That leads to uncalibrated probabilities in

> certain cases.

>

> This is particularly true where variable interactions are strong.

>

> The good news is that logistic regression gives you the best answer that is

> possible with a generalized linear model.

>

>

> On Thu, Jun 16, 2011 at 11:22 AM, Svetlomir Kasabov <

> [EMAIL PROTECTED]> wrote:

>

>> Hello Ted,

>>

>> what are the main issues of the probability estimations of the logistic

>> regression ? I am developing a medical application for making probability

>> estimations using time series and I that's why I would know if they are

>> critical for me.

>>

>> Many thanks and best regards,

>>

>> Svetlomir.

>>

>> Am 15.06.2011 20:44, schrieb Ted Dunning:

>>

>> This is what the term Naive is used in the name. The scores for this

>>> kind

>>> of algorithm are 0 to 1 or are logarithms of such a number, but are not

>>> at

>>> all calibrated probabilities.

>>>

>>> And, frankly, it is rare in practice for the output of logistic

>>> regression

>>> to be calibrated either. Those outputs are much more like probabilities,

>>> but they still have some issues.

>>>

>>> On Wed, Jun 15, 2011 at 6:56 PM, Svetlomir Kasabov<

>>> [EMAIL PROTECTED]> wrote:

>>>

>>> unfortunately, Mahout's Naive Bayes implemention can't calculate

>>>> probabilities. You are now probably really astonished - I could'nt

>>>> believe

>>>> it too, as I read that (I think this is some kind of 'strange', since

>>>> Bayes's main concept is probability calculation). It's a pitty, that

>>>> such a

>>>> great framework like Mahout has restricted the Bayesian concept that

>>>> way. In

>>>> addition, Naive Bayes is (as far as I know) only text-oriented, you can

>>>> apply it only on documents . Mahout is still wonderful, though, because

>>>> it

>>>> lets us calculate probabilities using Logistic Regression.

>>>>

>>>>

>>

>

classifier a little bit conservative about estimating probabilities near 0

or near 1.

On Fri, Jun 17, 2011 at 12:19 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:

> The problem is that logistic regression makes some assumptions that are

> unrealistic in practice. That leads to uncalibrated probabilities in

> certain cases.

>

> This is particularly true where variable interactions are strong.

>

> The good news is that logistic regression gives you the best answer that is

> possible with a generalized linear model.

>

>

> On Thu, Jun 16, 2011 at 11:22 AM, Svetlomir Kasabov <

> [EMAIL PROTECTED]> wrote:

>

>> Hello Ted,

>>

>> what are the main issues of the probability estimations of the logistic

>> regression ? I am developing a medical application for making probability

>> estimations using time series and I that's why I would know if they are

>> critical for me.

>>

>> Many thanks and best regards,

>>

>> Svetlomir.

>>

>> Am 15.06.2011 20:44, schrieb Ted Dunning:

>>

>> This is what the term Naive is used in the name. The scores for this

>>> kind

>>> of algorithm are 0 to 1 or are logarithms of such a number, but are not

>>> at

>>> all calibrated probabilities.

>>>

>>> And, frankly, it is rare in practice for the output of logistic

>>> regression

>>> to be calibrated either. Those outputs are much more like probabilities,

>>> but they still have some issues.

>>>

>>> On Wed, Jun 15, 2011 at 6:56 PM, Svetlomir Kasabov<

>>> [EMAIL PROTECTED]> wrote:

>>>

>>> unfortunately, Mahout's Naive Bayes implemention can't calculate

>>>> probabilities. You are now probably really astonished - I could'nt

>>>> believe

>>>> it too, as I read that (I think this is some kind of 'strange', since

>>>> Bayes's main concept is probability calculation). It's a pitty, that

>>>> such a

>>>> great framework like Mahout has restricted the Bayesian concept that

>>>> way. In

>>>> addition, Naive Bayes is (as far as I know) only text-oriented, you can

>>>> apply it only on documents . Mahout is still wonderful, though, because

>>>> it

>>>> lets us calculate probabilities using Logistic Regression.

>>>>

>>>>

>>

>