Thanks for the pointers. I am also thinking on the similar lines.
I am doubtful at 1 point :
I will be having separate data files for every interval. Let's take example if I have 5 mins interval file which contain data for 2 hours and 10 mins. In this scenario I want to process 2 hours data with hours job and 10 mins data with mins job. Now since I will provide my data file as Input to MR jobs so I think original file needs to split in 2 files : HourFile and
MinsFile. HourFile wll contain data for 2 hours and MinsFile will conatin data for 10 mins.
I have attained file splitting with simple Java class but I think there is too much I/O operations and if I can attain this also in MR or in some efficient way, it will be good because the original data files can be huge and then the initial breaking of files will itself take too much time.
From: Marcos Ortiz [mailto:[EMAIL PROTECTED]]
Sent: Sunday, February 26, 2012 7:40 PM
To: [EMAIL PROTECTED]
Cc: Stuti Awasthi
Subject: Re: Query Regarding design MR job for Billing
Well, first, you can design 6 MR jobs:
1- for 5 mins interval
2- for 1 hour
3- for 1 day
4- for 1 month
5- for 1 year
6- and a last for any interval
If you say that for each interval, you have to do a different calculation; this way could be a solution (at least I think that).
You can read the "design patterns" for MapReduce algorithms proposed by Jimmy Lin and Chris Dyer on his "Data-Intensive Text Processing with MapReduce" book.
On 02/27/2012 05:39 AM, Stuti Awasthi wrote:
Marcos Luis Ortíz Valmaseda
Senior Software Engineer (UCI) http://marcosluis2186.posterous.com http://www.linkedin.com/in/marcosluis2186
Fin a la injusticia, LIBERTAD AHORA A NUESTROS CINCO COMPATRIOTAS QUE SE ENCUENTRAN INJUSTAMENTE EN PRISIONES DE LOS EEUU!http://www.antiterroristas.cuhttp://justiciaparaloscinco.wordpress.com