I am porting Arch (https://www.atnf.csiro.au/computing/software/arch/
) to Nutch 1.14 and Solr 7.2, and I have come across a few serious issues, of which you should be aware:
1. The Nutch-2071 is still an issue in 1.14, because the returned parseResult is never null. If a parser fails to parse a document, it returns an empty result, but not null. This means that, from a chain of parser candidates, only the first one has a chance to try to parse the document.
2. Nutch adopted Tika as a general parsing tool, and stopped supporting "legacy" parsing (OO, MS) plugins. I continued using them and hoped to stop supporting them in the next version of Arch I am preparing to be released, but I still can't do it, because Tika fails to parse too many documents on our site. But, when I reinforce Tika with the legacy parsers, I achieve almost 100% parsing success rate. This is why NUTCH-2071 is important for Arch. I think you should bring back legacy parsers to Nutch, because the quality of parsing of "real life" data, such as ours, is not great without them.
3. The lines defining fall-back (*) plugin in parse-plugins.xml are not effective, because they are ignored, as long as there is at least one plugin claiming * in its plugin.xml file. In some cases, Nutch assigns * capability to plugins that don't even claim it. For example, I can't understand, why Arch content blocking plugin gets it.
4. In earlier versions of Nutch, use of the native libraries really helped. It reduced crawling of our site from a couple of days to 6-7 hours. In Nutch 1.14, I don't notice this. I've obtained Hadoop libraries, placed them where they are expected, even inserted an explicit load library call in my code, but I still don't notice any significant time savings.
5. The Feed plugin seems to have a major problem. The line 102 in FeedIndexingFilter.java generated a NumberFormatException (which caused the failure of the entire crawling process!) because it was trying to parse a date in string format, not a number. Given that this metadata piece was generated by the feed parser (same plugin), it seems that the plugin is in disagreement with itself.
6. This is less important, but when Tika fails to parse a document, it generates a scary error message and ugly stack trace. I think this should be a one line warning, because other parsers may still parse this document successfully.
Hope this helps.