Greetings Nutchlings,
I would like to make my generate jobs go faster, and I see that the reducer spills a lot of records.
Here are the numbers for a typical long-running reduce task of the generate-select job: 100 million spilled records, 255K input records, 90k output records, 13G file bytes written, only 3G committed heap usage. mapreduce.reduce.java.opts is 8000M, mapreduce.reduce.memory.mb is12000.
Do I have to increase  mapreduce.reduce.java.opts and mapreduce.reduce.memory.mb? If so, how can I compute how big they should be? Also, are there other settings changes needed?
My actual commd line is apache-nutch-1.12/runtime/deploy/bin/nutch generate  -D mapreduce.job.reduces=16 -D mapreduce.input.fileinputformat.split.minsize=536870912 -D mapreduce.reduce.memory.mb=12000 -D mapreduce.reduce.java.opts=-Xmx8000m  -D db.fetch.interval.default=5184000 -D db.fetch.schedule.adaptive.min_interval=3888000 -D generate.update.crawldb=true  -D generate.max.count=25 /crawls/popular/data/crawldb /crawls/popular/data/segments/ -topN 60000 -numFetchers 2 -noFilter -maxNumSegments 24