Hello everyone, I am working on a scheme design for a time series database. Something very similar to Twitter where people can follow each other and see their posts. I've looked at opentsdb but I think my problem is more complicated because I don't have the leading "metricid" in the row key. I've made several attempts so far but I am not happy with the performance.
1. Md5(user)+timestamp . The problem with is when I want to query the feed, I have to do a scan with the highest user ( alphabetical order) and the lowest and then add column column filter. Getting the next batch is hard.
2. Md5(user)+day and then put the posts of the day in the columns with timestamp in the qualifier name. Not optimal, getting the next batch is hard.
So... What do you guys think? Any ideas for making this efficient or possible?
Hi Amandeep, thanks for that . I've read it already as well as your book. But I believe that it doesn't discuss the "feed" problem , ie: get latest posts from the 2k people I follow. That's for me is the challenge. I would really appreciate any ideas. Thank you. On Wed, 1 Jul 2015 at 7:57 pm Amandeep Khurana <[EMAIL PROTECTED]> wrote:
Thanks Stack, looks like a good read. Vladimir, I called it time-series because (ordering by time/ filtering by the tweet owner) is the goal. To answer your questions, lets for now assume that its not as massive as Twitter because otherwise it will be very complicated as you mentioned. So
1. How many updates per second in the system? We never mutate data, we write 500 tweets/sec. 2. How many users? 10000 3. Average # of followers per user? 250 users.
Even with these modest numbers, the schema is still tricky to be highly optimised for reads. Any thoughts? Thanks. On Wed, Jul 1, 2015 at 11:36 PM, Vladimir Rodionov <[EMAIL PROTECTED]> wrote:
As I said already, this not a " how to design time-series data in HBase" kind of question.
Usually, to fight heavy followers - following number skewness (some has 1 M followers, others - follows 1M) one might to identify top X persons who has # of followers > N and top Y persons who follows more than M users.
For X : we keep all updates in memory for last N days, distribute/replicate them across the cluster For Y: for every user who makes twits, we check his followers and if some of them belong to group Y - we do in-place update for that user
1. All users not from Y will always read updates one-by-one from data store. 2. Users from group Y always have all updates indexed and read them using one scan operation (in HBase lingo)
3. Users from group X cache their tweets in a fast memory store with replication - they have extreme # of followers and their tweets are most hot. 4. Users from not group X (99%) store tweets directly into HBase, besides this, if some of their followers are from group Y - their index is updated (tweets stored directly into Y user record)
If you can implement something using only HBase - great, let us know ;)
-Vlad On Wed, Jul 1, 2015 at 4:09 PM, Sleiman Jneidi <[EMAIL PROTECTED]> wrote: