Subject: Setting up a recommender


Not following so…

Here so is what I've done in probably too much detail:

1) ingest raw log files and split them up by action
2) turn these into Mahout preference files using Mahout type IDs, keeping a map of IDs
3) run the Mahout Item-based recommender using LLR for similarity
4) created a Mahout style cross-recommender using cooccurrence similarity using matrix math
5) given two similairty matrixes and a user history matrix I am writing them to csv files with Mahout ID replaced by the original string external IDs for users and items

input log file before splitting:
u1 purchase iphone
u1 purchase ipad
u2 purchase nexus-tablet
u2 purchase galaxy
u3 purchase surface
u4 purchase iphone
u4 purchase ipad
u1 view iphone
u1 view ipad
u1 view nexus-tablet
u1 view galaxy
u2 view iphone
u2 view ipad
u2 view nexus-tablet
u2 view galaxy
u3 view surface
u4 view iphone
u4 view ipad
u4 view nexus-tablet
Input user history DRM after ID translation to mahout IDs and splitting for action "purchase"

B user/item iphone ipad nexus-tablet galaxy surface
u1 1 1 0 0 0
u2 0 0 1 1 0
u3 0 0 0 0 1
u4 1 1 0 0 0

Map of IDs Mahout to Original/External
0 -> iphone
1 -> ipad
2 -> nexus-tablet
3 -> galaxy
4 -> surface

To be specific the DRM from the RecommenderJob with item-item similarities using LLR looks like this:
Input Path: out/p-recs/sims/part-r-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable
Key: 0: Value: {1:0.8472157541208549}
Key: 1: Value: {0:0.8472157541208549}
Key: 2: Value: {3:0.8181382096075936}
Key: 3: Value: {2:0.8181382096075936}
Key: 4: Value: {}

This will be written to a directory for later Solr indexing as a csv of the form:
item_id,similar_items,cross_action_similar_items
iphone,ipad,
ipad,iphone,
nexus-tablet,galaxy,
galaxy, nexus-tablet,
surface,,

By using a user's history vector as a query you get results = recommendations
So if the user is u1, the history vector is:
"iphone ipad"

The Solr results for query "iphone ipad" using field "similar_items" will be
1. Doc ID, ipad
2. Doc ID, iphone

If you want item similarities, for instance if a user is anonymous with no history and is looking at an iphone product page. You would fetch the doc for id =  "iphone" and get:
"ipad"

Perhaps a bad example for ordering, since there is only one ID in the doc but the items in the "similar_items" field would be ordered by similarity strength.

Likewise for the cross-action similarities though the matrix will have cooccurrence [B'A] values in the DRM.

For item similarities there is no need to do more than fetch one doc that contains the similarities, right? I've successfully used this method with the Mahout recommender but please correct me if something above is wrong.
On Jul 31, 2013, at 4:52 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

Pat,

See inline
On Wed, Jul 31, 2013 at 1:29 PM, Pat Ferrel <[EMAIL PROTECTED]> wrote:

Right.  Doesn't matter what format.  Might want quotes around space
delimited lists, but anything will do.

I always say "dither" so that is an easy one.

But fetching similar items of a center item by fetching the center item and
then fetching each of the referenced items is typically slower by about 2x
than running the search for mentions of the center item.