On Sun, Mar 25, 2012 at 3:36 PM, Razon, Oren <[EMAIL PROTECTED]> wrote:
But because I'm talking about very large scales, I guess that I want to
Actually, you aren't talking about all that large a scale. At Veoh, we
built our models from several billion interactions on a tiny cluster.
Repeating what I said earlier, the offline part produces item-item
information only. It does not produce KNN data for any users. There is no
reference to a user in the result.
All that happens here is that item => item* lists are combined.
This is correct.
Yes. You will need off-line and on-line machines if you want to have
serious guarantees about response times. And yes, you will need to do some
copying if you use standard Hadoop. If you use MapR's version of Hadoop,
you can serve data directly out of the cluster with no copying because you
can access files via NFS.
You don't need to store your data ONLY on an SQL machine and storing logs
in SQL is generally a bad mistake.
100x, roughly. SQL is generally not usable as the source for parallel