The issue of offline tests is often misunderstood I suspect. While I agree with Ted it might do to explain a bit.
For myself I'd say offline testing is a requirement but not for comparing two disparate recommenders. Companies like Amazon and Netflix, as well as others on record, have a workflow that includes offline testing and comparison against previous versions of their own code and their own gold data set. These comparisons can be quite useful, if only in pointing to otherwise obscure bugs. If they see a difference in two offline tests they ask, why? Then when they think they have an optimal solution they do A/B tests as challenger/champion competitions and it's these that are the only reliable measure of goodness.
I do agree that comparing two recommenders with offline tests is dubious at best, as the paper points out. But put yourself in the place of a company new to recommenders who has several to choose from. Maybe even versions of the same recommender with different tuning parameters. Do the offline tests with a standard set of your own data and pick the best to start with. What other choice do you have? Maybe flexibility or architecture trumps the offline tests, if not then using them is better than a random choice. Take this result with a grain of salt though and get ready to A/B test later challengers when or if you have time.
In the case of the Solr recommender it is extremely flexible and online (realtime results). These features for me trump any offline tests against alternatives. But the demo site will include offline Mahout recommendations for comparison, and in the unlikely event that it gets any traffic, will incorporate A/B tests.
On Oct 9, 2013, at 4:29 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
On Wed, Oct 9, 2013 at 12:54 PM, Michael Sokolov <[EMAIL PROTECTED]> wrote:
BTW lest we forget this does not imply the Solr-recommender is better than Myrrix or the Mahout-only recommenders. There needs to be some careful comparison of results. Michael, did you do offline or A/B tests during your implementation?
I ran some offline tests using our historical data, but I don't have a lot of faith in these beyond the fact they indicate we didn't make any obvious implementation errors. We haven't attempted A/B testing yet since our site is so new, and we need to get a meaningful baseline going and sort out a lot of other more pressing issues on the site - recommendations are only one piece, albeit an important one.
Actually there was an interesting idea for an article posted recently about the difficulty of comparing results across systems in this field: http://www.docear.org/2013/09/23/research-paper-recommender-system-evaluation-a-quantitative-literature-survey/
but that's no excuse not to do better. I'll certainly share when I know more :)
I tend to be a pessimist with regard to off-line evaluation. It is fine to do, but if a system is anywhere near best, I think that it should be considered for A/B testing.