Subject: Relevance score - Classification

You have both a training problem and a scalability problem here.

The training problem is that to build a classifier, you either need to make
fairly gross assumptions about which features matter and what weights they
should have.  For most text retrieval systems, this is done by taking the
user's query, (possibly) adding a few extra terms and then assuming that
this set of terms are the features that your classifier.  Weights are
generally either derived heuristically using something like IDF or simply
ignored in favor of some other relevancy score like page-rank or other
document quality measures.  A middle road can be taken in which these
different scores are combined.

In contrast, for most document classifiers, relevant terms are derived by
examining a set of training documents that are labeled as positive and
negative relative to the question of interest.

In your application, you don't have training data of the sort to build the
second kind of classifier so you need to build the first kind.  But this is
just the same as saying you should use a normal text retrieval system.

The second issue is one of how the computations are arranged.  With both
kinds of systems, the computational problem is shaped just like approximate
sparse matrix multiplication, but in the text retrieval system,
considerable knowledge is used to avoid computations that cannot affect the
final retrieval result.  With a straightforward implementation using text
classifiers, you need to evaluate the classifier for every document.  This
cannot scale as well the text retrieval simply because you have to read
data for far more documents.

It is possible to combine these two approaches and only evaluate the
classifier on documents that contain the terms that have non-zero features
in the classifier, but the number of terms involved makes the inverted
index much less effective at avoiding work.

So how did you plan to derive the features and weights for the classifiers
you mention?
On Wed, Nov 23, 2011 at 10:42 PM, Faizan(Aroha) <[EMAIL PROTECTED]
> wrote:

> We are trying to implement relevant search(using machine learning) at a
> website where we have 3 million visitors a week.. and 150k blog posts a
> single day.
> We are currently in the planning phase,  so we are trying several different
> approaches.
> I will take the news group dataset example to explain my situation :
> Let's say , we apply the classifier on a new document X that may belong to
> "", we know that 397 documents in our collection that
> have
> been correctly identified by the classifier.
> When we apply the classifier on X, the classifier should bring back a
> result
> with the list of documents that are sorted  in a way that the top most
> document is most relevant document to the query (document X) and the last
> document is the most irrelevant one.
> and in order to do the above stated, we need to devise a way where we can
> use these classifiers for information retrieval
> The classifier should be used as a retrieval algorithm where it will first
> compute relevance scores for all the documents and produce a ranking. When
> that retrieval algorithm is applied to an individual query document , it
> will bring back a set of documents that are sorted in a way that the top
> most documents are the most relevant one to the blog post and the last
> document is the most irrelevant one.
> This is a little background.
> Thanks.
> -----Original Message-----
> From: Isabel Drost [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, November 24, 2011 1:47 AM
> Subject: Re: Relevance score - Classification
> On 23.11.2011 Faizan(Aroha) wrote:
> > We are working on using Classification as a Search.
> >
> > I want to compute the relevance score of the output which is generated
> > by the Naive Bayes Classifier or some other classifier.
> >
> > Please give any guideline/hint!
> Can you please provide some more background to your use case? Which