That is a very interesting note. I have been wanting something like that. I use the python-based "newspaper" package but it is not directly compatible with the nutch/hadoop infrastructure.
From: Jorge Betancourt <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Tuesday, November 14, 2017 5:35 AM
Subject: Re: Removing header,Footer and left menus while crawling
Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13, then you
could use the Tika boilerpipe implementation, on the nutch-site.xml you
need to enable this feature with:
Which text extraction algorithm to use. Valid values are: boilerpipe or
And configure the proper extractor with
the tika.extractor.boilerpipe.algorithm setting.
This is not a perfect solution, but I've used it successfully in the past,
of course, your results will depend on how is the structure (markup of the
Other option could be to implement your own parser if you need to have more
control over what to include/exclude from the HTML. You can take a look at
this issue https://issues.apache.org/jira/browse/NUTCH-585
some info and old patches.
On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[EMAIL PROTECTED]>