[NUTCH-2634] Some links marked as "nofollow" are followed anyway. - Nutch - [issue]
...In order to check if an outlink in an <a> tag can be followed, nutch checks whether the value of its rel attribute is the exact string string "nofollow".However, the rel attribute can ...    Author: Gerard Bouchar , 2018-08-14, 07:49
[NUTCH-2567] parse-metatags writes all meta tags twice - Nutch - [issue]
...Using nutch witch the following configuration, MetaTagsParser writes HTML meta tags to the metadata twice:    <property>        <name>plugin.include...    Author: Gerard Bouchar , 2018-07-19, 14:20
[NUTCH-2586] Add a fallback mechanism for missing meta tags - Nutch - [issue]
...While using nutch, we faced the following issue: some web pages miss a "description"  meta tag, but include an "og:description" meta (using the open graph protocol).Here are two example...    Author: Gerard Bouchar , 2018-07-13, 11:17
[NUTCH-2599] charset detection issue with parse-tika - Nutch - [issue]
...Here is an example page that is displayed correctly in web browsers, but is decoded with the wrong charset in nutch : This pa...    Author: Gerard Bouchar , 2018-06-15, 17:16
[NUTCH-2549] protocol-http does not behave the same as browsers - Nutch - [issue]
...We identified the following issues in protocol-http (a plugin implementing the HTTP protocol): It fails if an url's path does not start with '/' Example: (browser...    Author: Gerard Bouchar , 2018-06-12, 19:15
[NUTCH-2564] protocol-http throws an error when the content-length header is not a number - Nutch - [issue]
...When a server sends an invalid Content-Length header (one that is not a valid number) with a plain-text http body, browsers simply ignore it, but protocol-http has a strange approach: if the...    Author: Gerard Bouchar , 2018-06-12, 19:14
[NUTCH-2563] HTTP header spellchecking issues - Nutch - [issue]
...When reading http headers, for each header, the SpellCheckedMetadata class computes a Levenshtein distance between it and every  known header in the HttpHeaders interface. Not only is that s...    Author: Gerard Bouchar , 2018-06-12, 19:14
[NUTCH-2561] protocol-http can be made to read arbitrarily large HTTP responses - Nutch - [issue]
...protocol-http limits the size of the HTTP response body. However There is no limit over the size of the HTTP headers it reads. A bogus server could send an infinite stream of different HTTP ...    Author: Gerard Bouchar , 2018-06-12, 19:13
[NUTCH-2560] protocol-http throws an error when an http header spans over multiple lines - Nutch - [issue]
...Some servers invalidly send headers that span over multiple lines. In that case, browsers simply ignore the subsequent lines, but protocol-http throws an error, thus preventing us from fetch...    Author: Gerard Bouchar , 2018-06-12, 19:12
[NUTCH-2559] protocol-http cannot handle colons after the HTTP status code - Nutch - [issue]
...Some servers invalidly add colons after the HTTP status code in the status line (they can send HTTP/1.1 404: Not found instead of HTTP/1.1 404 Not found for instance). Browsers can handle th...    Author: Gerard Bouchar , 2018-06-12, 19:08