The interfaces related to extending Nutch parser/indexer are actually very
simple. However, finding up-to-date documented samples is not. Luckily,
Nutch comes with plenty built-in, so my suggestion would be to pick one, and
dive into its implementation. Then just copy its folder and use it as a
skeleton, replacing the specific logic (and plugin metadata).
The first question you need to ask yourself is if you really want to write a
Parser/Indexer or just a HtmlParseFilter/IndexingFilter. I suspect that the
default behaviour of the Nutch Parser and Indexer is useful for you, and you
just want to add more functionality (that is what Any23 is doing). You can
chain Filters, so your code could also leverage the Any23 logic, for
The documentation starting point is the Wiki
). For your specific question, this is the
most relevant page: https://wiki.apache.org/nutch/AboutPlugins
One (old) example of writing a custom parser can be found here:http://www.treselle.com/blog/apache-nutch-with-custom-parser/
. I suggest you
Google for more information as needed, but always keep in mind that things
may have changed over time.
I think the best approach for domain-specific parsers is to have a custom
parser that maps from the URL to the specific code. This can be just one big
if/else, or a Map of domain->code (possibly using functional programming),
or you can even have this map configurable in some file.
Once you have more specific questions/problems, I suggest you email
[EMAIL PROTECTED]. [EMAIL PROTECTED] is intended for discussing code
contributions to Nutch, as far as I understand, and I think less people see
your messages here. (Also, more people will benefit from your questions
In summary, from my experience, writing any one of these plugins is really
easy (discounting your own complex logic, of course), just implementing one
or a few methods, changing some plugin XML file, and adding your extension
to the global build (Ant) files. But to really understand how the passed
data looks, and what you can do with it, debugging (in local mode) is the
ultimate tool, and in the end is much more time-efficient than looking for
information on the web. This is partly because a lot of the data is passed
in Map-like form, so even the JavaDoc doesn't really tell you what will be
there (it depends on what plugins you have configured, and how you
configured those plugins...).
> am still not (yet) excited. Bottom line is website data is not well
> not super friendly to algorithmic consumption (but you already knew that).
> that end, I am interested to developer custom parsers per internet domain
> effort to capture specific domain data. It currently looks like the
> does not allow a per domain-based approach for parser / indexer. I wonder
> someone could guide me toward a high level view of the Nutch data
> then guide me towards where to get started for creating custom parsers
> might support a per-domain approach?