praveenesh kumar

2013-02-02, 05:05

Eugene Kirpichov

2013-02-02, 06:23

praveenesh kumar

2013-02-02, 07:17

Russell Jurney

2013-02-02, 07:30

praveenesh kumar

2013-02-02, 07:37

Russell Jurney

2013-02-02, 08:10

praveenesh kumar

2013-02-02, 11:07

Niels Basjes

2013-02-02, 12:44

- Hadoop
- mail # user
- how to find top N values using map-reduce ?

I am looking for a better solution for this.

1 way to do this would be to find top N values from each mappers and

then find out the top N out of them in 1 reducer. I am afraid that

this won't work effectively if my N is larger than number of values in

my inputsplit (or mapper input).

Otherway is to just sort all of them in 1 reducer and then do the cat of top-N.

Wondering if there is any better approach to do this ?

Regards

Praveenesh

1 way to do this would be to find top N values from each mappers and

then find out the top N out of them in 1 reducer. I am afraid that

this won't work effectively if my N is larger than number of values in

my inputsplit (or mapper input).

Otherway is to just sort all of them in 1 reducer and then do the cat of top-N.

Wondering if there is any better approach to do this ?

Regards

Praveenesh

Hi,

Can you tell more about:

* How big is N

* How big is the input dataset

* How many mappers you have

* Do input splits correlate with the sorting criterion for top N?

Depending on the answers, very different strategies will be optimal.

On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

> I am looking for a better solution for this.

>

> 1 way to do this would be to find top N values from each mappers and

> then find out the top N out of them in 1 reducer. I am afraid that

> this won't work effectively if my N is larger than number of values in

> my inputsplit (or mapper input).

>

> Otherway is to just sort all of them in 1 reducer and then do the cat of

> top-N.

>

> Wondering if there is any better approach to do this ?

>

> Regards

> Praveenesh

>

--

Eugene Kirpichov

http://www.linkedin.com/in/eugenekirpichov

http://jkff.info/software/timeplotters - my performance visualization tools

Can you tell more about:

* How big is N

* How big is the input dataset

* How many mappers you have

* Do input splits correlate with the sorting criterion for top N?

Depending on the answers, very different strategies will be optimal.

On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

> I am looking for a better solution for this.

>

> 1 way to do this would be to find top N values from each mappers and

> then find out the top N out of them in 1 reducer. I am afraid that

> this won't work effectively if my N is larger than number of values in

> my inputsplit (or mapper input).

>

> Otherway is to just sort all of them in 1 reducer and then do the cat of

> top-N.

>

> Wondering if there is any better approach to do this ?

>

> Regards

> Praveenesh

>

--

Eugene Kirpichov

http://www.linkedin.com/in/eugenekirpichov

http://jkff.info/software/timeplotters - my performance visualization tools

Actually what I am trying to find to top n% of the whole data.

This n could be very large if my data is large.

Assuming I have uniform rows of equal size and if the total data size

is 10 GB, using the above mentioned approach, if I have to take top

10% of the whole data set, I need 10% of 10GB which could be rows

worth of 1 GB (roughly) in my mappers.

I think that would not be possible given my input splits are of

64/128/512 MB (based on my block size) or am I making wrong

assumptions. I can increase the inputsplit size, but is there a better

way to find top n%.

My whole actual problem is to give ranks to some values and then find

out the top 10 ranks.

I think this context can give more idea about the problem ?

Regards

Praveenesh

On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]> wrote:

> Hi,

>

> Can you tell more about:

> * How big is N

> * How big is the input dataset

> * How many mappers you have

> * Do input splits correlate with the sorting criterion for top N?

>

> Depending on the answers, very different strategies will be optimal.

>

>

>

> On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

>

>> I am looking for a better solution for this.

>>

>> 1 way to do this would be to find top N values from each mappers and

>> then find out the top N out of them in 1 reducer. I am afraid that

>> this won't work effectively if my N is larger than number of values in

>> my inputsplit (or mapper input).

>>

>> Otherway is to just sort all of them in 1 reducer and then do the cat of

>> top-N.

>>

>> Wondering if there is any better approach to do this ?

>>

>> Regards

>> Praveenesh

>>

>

>

>

> --

> Eugene Kirpichov

> http://www.linkedin.com/in/eugenekirpichov

> http://jkff.info/software/timeplotters - my performance visualization tools

This n could be very large if my data is large.

Assuming I have uniform rows of equal size and if the total data size

is 10 GB, using the above mentioned approach, if I have to take top

10% of the whole data set, I need 10% of 10GB which could be rows

worth of 1 GB (roughly) in my mappers.

I think that would not be possible given my input splits are of

64/128/512 MB (based on my block size) or am I making wrong

assumptions. I can increase the inputsplit size, but is there a better

way to find top n%.

My whole actual problem is to give ranks to some values and then find

out the top 10 ranks.

I think this context can give more idea about the problem ?

Regards

Praveenesh

On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]> wrote:

> Hi,

>

> Can you tell more about:

> * How big is N

> * How big is the input dataset

> * How many mappers you have

> * Do input splits correlate with the sorting criterion for top N?

>

> Depending on the answers, very different strategies will be optimal.

>

>

>

> On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

>

>> I am looking for a better solution for this.

>>

>> 1 way to do this would be to find top N values from each mappers and

>> then find out the top N out of them in 1 reducer. I am afraid that

>> this won't work effectively if my N is larger than number of values in

>> my inputsplit (or mapper input).

>>

>> Otherway is to just sort all of them in 1 reducer and then do the cat of

>> top-N.

>>

>> Wondering if there is any better approach to do this ?

>>

>> Regards

>> Praveenesh

>>

>

>

>

> --

> Eugene Kirpichov

> http://www.linkedin.com/in/eugenekirpichov

> http://jkff.info/software/timeplotters - my performance visualization tools

Pig. Datafu. 7 lines of code.

https://gist.github.com/4696443

https://github.com/linkedin/datafu

On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

> Actually what I am trying to find to top n% of the whole data.

> This n could be very large if my data is large.

>

> Assuming I have uniform rows of equal size and if the total data size

> is 10 GB, using the above mentioned approach, if I have to take top

> 10% of the whole data set, I need 10% of 10GB which could be rows

> worth of 1 GB (roughly) in my mappers.

> I think that would not be possible given my input splits are of

> 64/128/512 MB (based on my block size) or am I making wrong

> assumptions. I can increase the inputsplit size, but is there a better

> way to find top n%.

>

>

> My whole actual problem is to give ranks to some values and then find

> out the top 10 ranks.

>

> I think this context can give more idea about the problem ?

>

> Regards

> Praveenesh

>

> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]>

> wrote:

> > Hi,

> >

> > Can you tell more about:

> > * How big is N

> > * How big is the input dataset

> > * How many mappers you have

> > * Do input splits correlate with the sorting criterion for top N?

> >

> > Depending on the answers, very different strategies will be optimal.

> >

> >

> >

> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]

> >wrote:

> >

> >> I am looking for a better solution for this.

> >>

> >> 1 way to do this would be to find top N values from each mappers and

> >> then find out the top N out of them in 1 reducer. I am afraid that

> >> this won't work effectively if my N is larger than number of values in

> >> my inputsplit (or mapper input).

> >>

> >> Otherway is to just sort all of them in 1 reducer and then do the cat of

> >> top-N.

> >>

> >> Wondering if there is any better approach to do this ?

> >>

> >> Regards

> >> Praveenesh

> >>

> >

> >

> >

> > --

> > Eugene Kirpichov

> > http://www.linkedin.com/in/eugenekirpichov

> > http://jkff.info/software/timeplotters - my performance visualization

> tools

>

--

Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com

https://gist.github.com/4696443

https://github.com/linkedin/datafu

On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

> Actually what I am trying to find to top n% of the whole data.

> This n could be very large if my data is large.

>

> Assuming I have uniform rows of equal size and if the total data size

> is 10 GB, using the above mentioned approach, if I have to take top

> 10% of the whole data set, I need 10% of 10GB which could be rows

> worth of 1 GB (roughly) in my mappers.

> I think that would not be possible given my input splits are of

> 64/128/512 MB (based on my block size) or am I making wrong

> assumptions. I can increase the inputsplit size, but is there a better

> way to find top n%.

>

>

> My whole actual problem is to give ranks to some values and then find

> out the top 10 ranks.

>

> I think this context can give more idea about the problem ?

>

> Regards

> Praveenesh

>

> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]>

> wrote:

> > Hi,

> >

> > Can you tell more about:

> > * How big is N

> > * How big is the input dataset

> > * How many mappers you have

> > * Do input splits correlate with the sorting criterion for top N?

> >

> > Depending on the answers, very different strategies will be optimal.

> >

> >

> >

> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]

> >wrote:

> >

> >> I am looking for a better solution for this.

> >>

> >> 1 way to do this would be to find top N values from each mappers and

> >> then find out the top N out of them in 1 reducer. I am afraid that

> >> this won't work effectively if my N is larger than number of values in

> >> my inputsplit (or mapper input).

> >>

> >> Otherway is to just sort all of them in 1 reducer and then do the cat of

> >> top-N.

> >>

> >> Wondering if there is any better approach to do this ?

> >>

> >> Regards

> >> Praveenesh

> >>

> >

> >

> >

> > --

> > Eugene Kirpichov

> > http://www.linkedin.com/in/eugenekirpichov

> > http://jkff.info/software/timeplotters - my performance visualization

> tools

>

--

Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com

Thanks for that Russell. Unfortunately I can't use Pig. Need to write

my own MR job. I was wondering how its usually done in the best way

possible.

Regards

Praveenesh

On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney <[EMAIL PROTECTED]> wrote:

> Pig. Datafu. 7 lines of code.

>

> https://gist.github.com/4696443

> https://github.com/linkedin/datafu

>

>

> On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

>

>> Actually what I am trying to find to top n% of the whole data.

>> This n could be very large if my data is large.

>>

>> Assuming I have uniform rows of equal size and if the total data size

>> is 10 GB, using the above mentioned approach, if I have to take top

>> 10% of the whole data set, I need 10% of 10GB which could be rows

>> worth of 1 GB (roughly) in my mappers.

>> I think that would not be possible given my input splits are of

>> 64/128/512 MB (based on my block size) or am I making wrong

>> assumptions. I can increase the inputsplit size, but is there a better

>> way to find top n%.

>>

>>

>> My whole actual problem is to give ranks to some values and then find

>> out the top 10 ranks.

>>

>> I think this context can give more idea about the problem ?

>>

>> Regards

>> Praveenesh

>>

>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]>

>> wrote:

>> > Hi,

>> >

>> > Can you tell more about:

>> > * How big is N

>> > * How big is the input dataset

>> > * How many mappers you have

>> > * Do input splits correlate with the sorting criterion for top N?

>> >

>> > Depending on the answers, very different strategies will be optimal.

>> >

>> >

>> >

>> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]

>> >wrote:

>> >

>> >> I am looking for a better solution for this.

>> >>

>> >> 1 way to do this would be to find top N values from each mappers and

>> >> then find out the top N out of them in 1 reducer. I am afraid that

>> >> this won't work effectively if my N is larger than number of values in

>> >> my inputsplit (or mapper input).

>> >>

>> >> Otherway is to just sort all of them in 1 reducer and then do the cat of

>> >> top-N.

>> >>

>> >> Wondering if there is any better approach to do this ?

>> >>

>> >> Regards

>> >> Praveenesh

>> >>

>> >

>> >

>> >

>> > --

>> > Eugene Kirpichov

>> > http://www.linkedin.com/in/eugenekirpichov

>> > http://jkff.info/software/timeplotters - my performance visualization

>> tools

>>

>

>

>

> --

> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com

my own MR job. I was wondering how its usually done in the best way

possible.

Regards

Praveenesh

On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney <[EMAIL PROTECTED]> wrote:

> Pig. Datafu. 7 lines of code.

>

> https://gist.github.com/4696443

> https://github.com/linkedin/datafu

>

>

> On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

>

>> Actually what I am trying to find to top n% of the whole data.

>> This n could be very large if my data is large.

>>

>> Assuming I have uniform rows of equal size and if the total data size

>> is 10 GB, using the above mentioned approach, if I have to take top

>> 10% of the whole data set, I need 10% of 10GB which could be rows

>> worth of 1 GB (roughly) in my mappers.

>> I think that would not be possible given my input splits are of

>> 64/128/512 MB (based on my block size) or am I making wrong

>> assumptions. I can increase the inputsplit size, but is there a better

>> way to find top n%.

>>

>>

>> My whole actual problem is to give ranks to some values and then find

>> out the top 10 ranks.

>>

>> I think this context can give more idea about the problem ?

>>

>> Regards

>> Praveenesh

>>

>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]>

>> wrote:

>> > Hi,

>> >

>> > Can you tell more about:

>> > * How big is N

>> > * How big is the input dataset

>> > * How many mappers you have

>> > * Do input splits correlate with the sorting criterion for top N?

>> >

>> > Depending on the answers, very different strategies will be optimal.

>> >

>> >

>> >

>> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]

>> >wrote:

>> >

>> >> I am looking for a better solution for this.

>> >>

>> >> 1 way to do this would be to find top N values from each mappers and

>> >> then find out the top N out of them in 1 reducer. I am afraid that

>> >> this won't work effectively if my N is larger than number of values in

>> >> my inputsplit (or mapper input).

>> >>

>> >> Otherway is to just sort all of them in 1 reducer and then do the cat of

>> >> top-N.

>> >>

>> >> Wondering if there is any better approach to do this ?

>> >>

>> >> Regards

>> >> Praveenesh

>> >>

>> >

>> >

>> >

>> > --

>> > Eugene Kirpichov

>> > http://www.linkedin.com/in/eugenekirpichov

>> > http://jkff.info/software/timeplotters - my performance visualization

>> tools

>>

>

>

>

> --

> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com

Maybe look at the pig source to see how it does it?

Russell Jurney http://datasyndrome.com

On Feb 1, 2013, at 11:37 PM, praveenesh kumar <[EMAIL PROTECTED]> wrote:

> Thanks for that Russell. Unfortunately I can't use Pig. Need to write

> my own MR job. I was wondering how its usually done in the best way

> possible.

>

> Regards

> Praveenesh

>

> On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney <[EMAIL PROTECTED]> wrote:

>> Pig. Datafu. 7 lines of code.

>>

>> https://gist.github.com/4696443

>> https://github.com/linkedin/datafu

>>

>>

>> On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

>>

>>> Actually what I am trying to find to top n% of the whole data.

>>> This n could be very large if my data is large.

>>>

>>> Assuming I have uniform rows of equal size and if the total data size

>>> is 10 GB, using the above mentioned approach, if I have to take top

>>> 10% of the whole data set, I need 10% of 10GB which could be rows

>>> worth of 1 GB (roughly) in my mappers.

>>> I think that would not be possible given my input splits are of

>>> 64/128/512 MB (based on my block size) or am I making wrong

>>> assumptions. I can increase the inputsplit size, but is there a better

>>> way to find top n%.

>>>

>>>

>>> My whole actual problem is to give ranks to some values and then find

>>> out the top 10 ranks.

>>>

>>> I think this context can give more idea about the problem ?

>>>

>>> Regards

>>> Praveenesh

>>>

>>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]>

>>> wrote:

>>>> Hi,

>>>>

>>>> Can you tell more about:

>>>> * How big is N

>>>> * How big is the input dataset

>>>> * How many mappers you have

>>>> * Do input splits correlate with the sorting criterion for top N?

>>>>

>>>> Depending on the answers, very different strategies will be optimal.

>>>>

>>>>

>>>>

>>>> On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]

>>>> wrote:

>>>>

>>>>> I am looking for a better solution for this.

>>>>>

>>>>> 1 way to do this would be to find top N values from each mappers and

>>>>> then find out the top N out of them in 1 reducer. I am afraid that

>>>>> this won't work effectively if my N is larger than number of values in

>>>>> my inputsplit (or mapper input).

>>>>>

>>>>> Otherway is to just sort all of them in 1 reducer and then do the cat of

>>>>> top-N.

>>>>>

>>>>> Wondering if there is any better approach to do this ?

>>>>>

>>>>> Regards

>>>>> Praveenesh

>>>>>

>>>>

>>>>

>>>>

>>>> --

>>>> Eugene Kirpichov

>>>> http://www.linkedin.com/in/eugenekirpichov

>>>> http://jkff.info/software/timeplotters - my performance visualization

>>> tools

>>>

>>

>>

>>

>> --

>> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com

Russell Jurney http://datasyndrome.com

On Feb 1, 2013, at 11:37 PM, praveenesh kumar <[EMAIL PROTECTED]> wrote:

> Thanks for that Russell. Unfortunately I can't use Pig. Need to write

> my own MR job. I was wondering how its usually done in the best way

> possible.

>

> Regards

> Praveenesh

>

> On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney <[EMAIL PROTECTED]> wrote:

>> Pig. Datafu. 7 lines of code.

>>

>> https://gist.github.com/4696443

>> https://github.com/linkedin/datafu

>>

>>

>> On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar <[EMAIL PROTECTED]>wrote:

>>

>>> Actually what I am trying to find to top n% of the whole data.

>>> This n could be very large if my data is large.

>>>

>>> Assuming I have uniform rows of equal size and if the total data size

>>> is 10 GB, using the above mentioned approach, if I have to take top

>>> 10% of the whole data set, I need 10% of 10GB which could be rows

>>> worth of 1 GB (roughly) in my mappers.

>>> I think that would not be possible given my input splits are of

>>> 64/128/512 MB (based on my block size) or am I making wrong

>>> assumptions. I can increase the inputsplit size, but is there a better

>>> way to find top n%.

>>>

>>>

>>> My whole actual problem is to give ranks to some values and then find

>>> out the top 10 ranks.

>>>

>>> I think this context can give more idea about the problem ?

>>>

>>> Regards

>>> Praveenesh

>>>

>>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]>

>>> wrote:

>>>> Hi,

>>>>

>>>> Can you tell more about:

>>>> * How big is N

>>>> * How big is the input dataset

>>>> * How many mappers you have

>>>> * Do input splits correlate with the sorting criterion for top N?

>>>>

>>>> Depending on the answers, very different strategies will be optimal.

>>>>

>>>>

>>>>

>>>> On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <[EMAIL PROTECTED]

>>>> wrote:

>>>>

>>>>> I am looking for a better solution for this.

>>>>>

>>>>> 1 way to do this would be to find top N values from each mappers and

>>>>> then find out the top N out of them in 1 reducer. I am afraid that

>>>>> this won't work effectively if my N is larger than number of values in

>>>>> my inputsplit (or mapper input).

>>>>>

>>>>> Otherway is to just sort all of them in 1 reducer and then do the cat of

>>>>> top-N.

>>>>>

>>>>> Wondering if there is any better approach to do this ?

>>>>>

>>>>> Regards

>>>>> Praveenesh

>>>>>

>>>>

>>>>

>>>>

>>>> --

>>>> Eugene Kirpichov

>>>> http://www.linkedin.com/in/eugenekirpichov

>>>> http://jkff.info/software/timeplotters - my performance visualization

>>> tools

>>>

>>

>>

>>

>> --

>> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com

My actual problem is to rank all values and then run logic 1 to top n%

values and logic 2 to rest values.

1st - Ranking ? (need major suggestions here)

2nd - Find top n% out of them.

Then rest is covered.

Regards

Praveenesh

On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang <[EMAIL PROTECTED]> wrote:

> there's one thing i want to clarify that you can use multi-reducers to sort

> the data globally and then cat all the parts to get the top n records. The

> data in all parts are globally in order.

> Then you may find the problem is much easier.

>

> 在 2013-2-2 下午3:18，"praveenesh kumar" <[EMAIL PROTECTED]>写道：

>

>> Actually what I am trying to find to top n% of the whole data.

>> This n could be very large if my data is large.

>>

>> Assuming I have uniform rows of equal size and if the total data size

>> is 10 GB, using the above mentioned approach, if I have to take top

>> 10% of the whole data set, I need 10% of 10GB which could be rows

>> worth of 1 GB (roughly) in my mappers.

>> I think that would not be possible given my input splits are of

>> 64/128/512 MB (based on my block size) or am I making wrong

>> assumptions. I can increase the inputsplit size, but is there a better

>> way to find top n%.

>>

>>

>> My whole actual problem is to give ranks to some values and then find

>> out the top 10 ranks.

>>

>> I think this context can give more idea about the problem ?

>>

>> Regards

>> Praveenesh

>>

>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]>

>> wrote:

>> > Hi,

>> >

>> > Can you tell more about:

>> > * How big is N

>> > * How big is the input dataset

>> > * How many mappers you have

>> > * Do input splits correlate with the sorting criterion for top N?

>> >

>> > Depending on the answers, very different strategies will be optimal.

>> >

>> >

>> >

>> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar

>> > <[EMAIL PROTECTED]>wrote:

>> >

>> >> I am looking for a better solution for this.

>> >>

>> >> 1 way to do this would be to find top N values from each mappers and

>> >> then find out the top N out of them in 1 reducer. I am afraid that

>> >> this won't work effectively if my N is larger than number of values in

>> >> my inputsplit (or mapper input).

>> >>

>> >> Otherway is to just sort all of them in 1 reducer and then do the cat

>> >> of

>> >> top-N.

>> >>

>> >> Wondering if there is any better approach to do this ?

>> >>

>> >> Regards

>> >> Praveenesh

>> >>

>> >

>> >

>> >

>> > --

>> > Eugene Kirpichov

>> > http://www.linkedin.com/in/eugenekirpichov

>> > http://jkff.info/software/timeplotters - my performance visualization

>> > tools

values and logic 2 to rest values.

1st - Ranking ? (need major suggestions here)

2nd - Find top n% out of them.

Then rest is covered.

Regards

Praveenesh

On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang <[EMAIL PROTECTED]> wrote:

> there's one thing i want to clarify that you can use multi-reducers to sort

> the data globally and then cat all the parts to get the top n records. The

> data in all parts are globally in order.

> Then you may find the problem is much easier.

>

> 在 2013-2-2 下午3:18，"praveenesh kumar" <[EMAIL PROTECTED]>写道：

>

>> Actually what I am trying to find to top n% of the whole data.

>> This n could be very large if my data is large.

>>

>> Assuming I have uniform rows of equal size and if the total data size

>> is 10 GB, using the above mentioned approach, if I have to take top

>> 10% of the whole data set, I need 10% of 10GB which could be rows

>> worth of 1 GB (roughly) in my mappers.

>> I think that would not be possible given my input splits are of

>> 64/128/512 MB (based on my block size) or am I making wrong

>> assumptions. I can increase the inputsplit size, but is there a better

>> way to find top n%.

>>

>>

>> My whole actual problem is to give ranks to some values and then find

>> out the top 10 ranks.

>>

>> I think this context can give more idea about the problem ?

>>

>> Regards

>> Praveenesh

>>

>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]>

>> wrote:

>> > Hi,

>> >

>> > Can you tell more about:

>> > * How big is N

>> > * How big is the input dataset

>> > * How many mappers you have

>> > * Do input splits correlate with the sorting criterion for top N?

>> >

>> > Depending on the answers, very different strategies will be optimal.

>> >

>> >

>> >

>> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar

>> > <[EMAIL PROTECTED]>wrote:

>> >

>> >> I am looking for a better solution for this.

>> >>

>> >> 1 way to do this would be to find top N values from each mappers and

>> >> then find out the top N out of them in 1 reducer. I am afraid that

>> >> this won't work effectively if my N is larger than number of values in

>> >> my inputsplit (or mapper input).

>> >>

>> >> Otherway is to just sort all of them in 1 reducer and then do the cat

>> >> of

>> >> top-N.

>> >>

>> >> Wondering if there is any better approach to do this ?

>> >>

>> >> Regards

>> >> Praveenesh

>> >>

>> >

>> >

>> >

>> > --

>> > Eugene Kirpichov

>> > http://www.linkedin.com/in/eugenekirpichov

>> > http://jkff.info/software/timeplotters - my performance visualization

>> > tools

My suggestion is to use secondary sort with a single reducer. That easy you

can easily extract the top N. If you want to get the top N% you'll need an

additional phase to determine how many records this N% really is.

--

Met vriendelijke groet,

Niels Basjes

(Verstuurd vanaf mobiel )

Op 2 feb. 2013 12:08 schreef "praveenesh kumar" <[EMAIL PROTECTED]> het

volgende:

> My actual problem is to rank all values and then run logic 1 to top n%

> values and logic 2 to rest values.

> 1st - Ranking ? (need major suggestions here)

> 2nd - Find top n% out of them.

> Then rest is covered.

>

> Regards

> Praveenesh

>

> On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang <[EMAIL PROTECTED]> wrote:

> > there's one thing i want to clarify that you can use multi-reducers to

> sort

> > the data globally and then cat all the parts to get the top n records.

> The

> > data in all parts are globally in order.

> > Then you may find the problem is much easier.

> >

> > 在 2013-2-2 下午3:18，"praveenesh kumar" <[EMAIL PROTECTED]>写道：

> >

> >> Actually what I am trying to find to top n% of the whole data.

> >> This n could be very large if my data is large.

> >>

> >> Assuming I have uniform rows of equal size and if the total data size

> >> is 10 GB, using the above mentioned approach, if I have to take top

> >> 10% of the whole data set, I need 10% of 10GB which could be rows

> >> worth of 1 GB (roughly) in my mappers.

> >> I think that would not be possible given my input splits are of

> >> 64/128/512 MB (based on my block size) or am I making wrong

> >> assumptions. I can increase the inputsplit size, but is there a better

> >> way to find top n%.

> >>

> >>

> >> My whole actual problem is to give ranks to some values and then find

> >> out the top 10 ranks.

> >>

> >> I think this context can give more idea about the problem ?

> >>

> >> Regards

> >> Praveenesh

> >>

> >> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]

> >

> >> wrote:

> >> > Hi,

> >> >

> >> > Can you tell more about:

> >> > * How big is N

> >> > * How big is the input dataset

> >> > * How many mappers you have

> >> > * Do input splits correlate with the sorting criterion for top N?

> >> >

> >> > Depending on the answers, very different strategies will be optimal.

> >> >

> >> >

> >> >

> >> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar

> >> > <[EMAIL PROTECTED]>wrote:

> >> >

> >> >> I am looking for a better solution for this.

> >> >>

> >> >> 1 way to do this would be to find top N values from each mappers and

> >> >> then find out the top N out of them in 1 reducer. I am afraid that

> >> >> this won't work effectively if my N is larger than number of values

> in

> >> >> my inputsplit (or mapper input).

> >> >>

> >> >> Otherway is to just sort all of them in 1 reducer and then do the cat

> >> >> of

> >> >> top-N.

> >> >>

> >> >> Wondering if there is any better approach to do this ?

> >> >>

> >> >> Regards

> >> >> Praveenesh

> >> >>

> >> >

> >> >

> >> >

> >> > --

> >> > Eugene Kirpichov

> >> > http://www.linkedin.com/in/eugenekirpichov

> >> > http://jkff.info/software/timeplotters - my performance visualization

> >> > tools

>

can easily extract the top N. If you want to get the top N% you'll need an

additional phase to determine how many records this N% really is.

--

Met vriendelijke groet,

Niels Basjes

(Verstuurd vanaf mobiel )

Op 2 feb. 2013 12:08 schreef "praveenesh kumar" <[EMAIL PROTECTED]> het

volgende:

> My actual problem is to rank all values and then run logic 1 to top n%

> values and logic 2 to rest values.

> 1st - Ranking ? (need major suggestions here)

> 2nd - Find top n% out of them.

> Then rest is covered.

>

> Regards

> Praveenesh

>

> On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang <[EMAIL PROTECTED]> wrote:

> > there's one thing i want to clarify that you can use multi-reducers to

> sort

> > the data globally and then cat all the parts to get the top n records.

> The

> > data in all parts are globally in order.

> > Then you may find the problem is much easier.

> >

> > 在 2013-2-2 下午3:18，"praveenesh kumar" <[EMAIL PROTECTED]>写道：

> >

> >> Actually what I am trying to find to top n% of the whole data.

> >> This n could be very large if my data is large.

> >>

> >> Assuming I have uniform rows of equal size and if the total data size

> >> is 10 GB, using the above mentioned approach, if I have to take top

> >> 10% of the whole data set, I need 10% of 10GB which could be rows

> >> worth of 1 GB (roughly) in my mappers.

> >> I think that would not be possible given my input splits are of

> >> 64/128/512 MB (based on my block size) or am I making wrong

> >> assumptions. I can increase the inputsplit size, but is there a better

> >> way to find top n%.

> >>

> >>

> >> My whole actual problem is to give ranks to some values and then find

> >> out the top 10 ranks.

> >>

> >> I think this context can give more idea about the problem ?

> >>

> >> Regards

> >> Praveenesh

> >>

> >> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <[EMAIL PROTECTED]

> >

> >> wrote:

> >> > Hi,

> >> >

> >> > Can you tell more about:

> >> > * How big is N

> >> > * How big is the input dataset

> >> > * How many mappers you have

> >> > * Do input splits correlate with the sorting criterion for top N?

> >> >

> >> > Depending on the answers, very different strategies will be optimal.

> >> >

> >> >

> >> >

> >> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar

> >> > <[EMAIL PROTECTED]>wrote:

> >> >

> >> >> I am looking for a better solution for this.

> >> >>

> >> >> 1 way to do this would be to find top N values from each mappers and

> >> >> then find out the top N out of them in 1 reducer. I am afraid that

> >> >> this won't work effectively if my N is larger than number of values

> in

> >> >> my inputsplit (or mapper input).

> >> >>

> >> >> Otherway is to just sort all of them in 1 reducer and then do the cat

> >> >> of

> >> >> top-N.

> >> >>

> >> >> Wondering if there is any better approach to do this ?

> >> >>

> >> >> Regards

> >> >> Praveenesh

> >> >>

> >> >

> >> >

> >> >

> >> > --

> >> > Eugene Kirpichov

> >> > http://www.linkedin.com/in/eugenekirpichov

> >> > http://jkff.info/software/timeplotters - my performance visualization

> >> > tools

>