praveenesh kumar

2013-02-02, 05:05

Eugene Kirpichov

2013-02-02, 06:23

praveenesh kumar

2013-02-02, 07:17

Russell Jurney

2013-02-02, 07:30

praveenesh kumar

2013-02-02, 07:37

Russell Jurney

2013-02-02, 08:10

praveenesh kumar

2013-02-02, 11:07

Niels Basjes

2013-02-02, 12:44

- Hadoop
- mail # user
- how to find top N values using map-reduce ?

I am looking for a better solution for this.

1 way to do this would be to find top N values from each mappers and

then find out the top N out of them in 1 reducer. I am afraid that

this won't work effectively if my N is larger than number of values in

my inputsplit (or mapper input).

Otherway is to just sort all of them in 1 reducer and then do the cat of top-N.

Wondering if there is any better approach to do this ?

Regards

Praveenesh

Hi,

Can you tell more about:

* How big is N

* How big is the input dataset

* How many mappers you have

* Do input splits correlate with the sorting criterion for top N?

Depending on the answers, very different strategies will be optimal.

Actually what I am trying to find to top n% of the whole data.

This n could be very large if my data is large.

Assuming I have uniform rows of equal size and if the total data size

is 10 GB, using the above mentioned approach, if I have to take top

10% of the whole data set, I need 10% of 10GB which could be rows

worth of 1 GB (roughly) in my mappers.

I think that would not be possible given my input splits are of

64/128/512 MB (based on my block size) or am I making wrong

assumptions. I can increase the inputsplit size, but is there a better

way to find top n%.

My whole actual problem is to give ranks to some values and then find

out the top 10 ranks.

I think this context can give more idea about the problem ?

Regards

Praveenesh

Pig. Datafu. 7 lines of code.

https://gist.github.com/4696443

https://github.com/linkedin/datafu

--

Russell Jurney

--

Russell Jurney

Thanks for that Russell. Unfortunately I can't use Pig. Need to write

my own MR job. I was wondering how its usually done in the best way

possible.

Regards

Praveenesh

Maybe look at the pig source to see how it does it?

Russell Jurney

Russell Jurney

My actual problem is to rank all values and then run logic 1 to top n%

values and logic 2 to rest values.

1st - Ranking ? (need major suggestions here)

2nd - Find top n% out of them.

Then rest is covered.

Regards

Praveenesh

My suggestion is to use secondary sort with a single reducer. That easy you

can easily extract the top N. If you want to get the top N% you'll need an

additional phase to determine how many records this N% really is.

--

Met vriendelijke groet,

Niels Basjes

(Verstuurd vanaf mobiel )

can easily extract the top N. If you want to get the top N% you'll need an

additional phase to determine how many records this N% really is.

