[NUTCH-1993] Nutch does not use backup parsers - Nutch - [issue]
...From reading the code it is clear that it is designed to allow using several parsers to parse a document in a sequence, until it is successfully parsed. In practice, this does not work becau...    Author: Arkadi Kosmynin , 2018-07-19, 13:57
[NUTCH-2071]  A parser failure on a single document may fail crawling job if parser.timeout=-1 - Nutch - [issue] Job failed!        at org.apache.hadoop.mapred.JobClient.runJob(        at org.apache.nutch.parse.ParseSegmen...    Author: Arkadi Kosmynin , 2018-07-17, 12:09
[NUTCH-2605] The Feed plugin causes a NumberFormatException - Nutch - [issue]
...The Feed plugin seems to have a major problem. The line 102 in generated a NumberFormatException (which caused the failure of the entire crawling process!) because i...    Author: Arkadi Kosmynin , 2018-06-28, 09:48
[NUTCH-2603] Bring back legacy pre-Tika parsers and use them as back up parsers - Nutch - [issue]
...There are cases when legacy parsers successfully parse documents on which Tika fails. I am attaching a list of examples of such documents. Nutch allows use of more than one parser on a docum...    Author: Arkadi Kosmynin , 2018-06-20, 15:06
[NUTCH-2604] The lines defining catch-all (*) parser in parse-plugins.xml are ignored - Nutch - [issue]
...The lines defining catch-all  plugin in parse-plugins.xml are not effective, because they are ignored, as long as there is at least one plugin claiming * in its plugin.xml file. In some...    Author: Arkadi Kosmynin , 2018-06-20, 06:32
[NUTCH-1251] SolrDedup to use proper Lucene catch-all query - Nutch - [issue]
...Deletion of duplicates fails. This happens because the "get all" query used to get Solr index size is "id:[* TO *]", which is a range query. Lucene is trying to expand it to a Boolea...    Author: Arkadi Kosmynin , 2013-05-22, 03:53