Subject: Multidimensional log-likelihood similarity


Yes.  You can turn the normal item-item relationships around to get this.

What you have is an item x feature matrix.  Normally, one has a user x item
matrix in cooccurrence analysis and you get an item x item matrix.

If you consider the features to be "users" in the computation, then the
resulting indicator matrix would be just what you want.

The basic idea is that items would be related if they share features.  Two
items that have the same feature would be said to co-occur on that feature.
 Finding anomalous cooccurrence would be what you need to do to find items
that co-occur on many features.

This works by building a small 2x2 matrix that relates item A and item B.
 The entries would be feature counts.  The upper left entry of the matrix
is the number of features that A and B both have, the upper right is the
number of features that B has that A does not and so on. Put another way,
the columns represent features that A has or does not have respectively and
the rows represent the features that B has or does not have respectively.
 Items that give high root log-likelihood ratio values should considered
connected.  Those that have small values for root LLR should be considered
not connected.  The value of the root-LLR should be discarded after
thresholding and should not be considered a measure of the strength of the
relationship.

I would recommend the same down-sampling that the rowSimilarityJob already
does.

On Sun, Sep 29, 2013 at 3:40 AM, Mridul Kapoor <[EMAIL PROTECTED]>wrote: