[NUTCH-2623] Fetcher to guarantee delay for same host/domain/ip independent of http/https protocol - Nutch - [issue]
... Fetcher uses a combination of protocol and host/domain/ip as ID for fetch item queues, see This inhibits a guaranteed delay, in case both http:// and https:// URLs are fetch...    Author: Sebastian Nagel , 2018-10-07, 20:56
[NUTCH-2645] Webgraph tools ignore command-line options - Nutch - [issue]
...Some webgraph jobs/tools do not properly set command-line options in the job configuration (see NUTCH-2644 for a similar problem in CrawlDbReader)....    Author: Sebastian Nagel , 2018-10-07, 20:56
[NUTCH-2644] CrawlDbReader -dump ignores filter options - Nutch - [issue]
...The CrawlDbReader ignores the filter options -status and -expr when dumping a crawldb:% bin/nutch readdb crawldb/ -dump cdb.dump -status 'db_fetched' -expr 'status == "db_fetched"'...% grep ...    Author: Sebastian Nagel , 2018-10-07, 20:56
[NUTCH-2635] Generator writes unneeded temporary output - Nutch - [issue]
...Generator writes the temporary output of the Selector job/step twice (see line 516). Not a big issue when generating small fetch lists but may be when working on large data. The temporary ou...    Author: Sebastian Nagel , 2018-10-07, 20:56
[NUTCH-2643] ant target "resolve-default" to depend on "init" - Nutch - [issue]
...If ant resolve-default (resolve library dependencies) is called on a clean Nutch source tree, it fails because the ant ivy library is not installed (it's installed by "ivy-init" or "init"). ...    Author: Sebastian Nagel , 2018-10-07, 20:56
[NUTCH-2641] ClassCastException in webui - Nutch - [issue]
...webui 2.x constantly logs this exception whenever the status of a crawl changes:java.lang.ClassCastException: org.apache.nutch.webui.client.model.JobInfo$State cannot be cast to [Ljava.lang....    Author: Rustam Abdullaev , 2018-10-07, 19:43
[NUTCH-2642] MoreIndexingFilter parses ISO 8601 UTC dates in local time zone - Nutch - [issue]
...The ISO 8601 pattern in MoreIndexingFilter.getTime is "yyyy-MM-dd'T'HH:mm:ss'Z'". Note the literal Z.    Author: John Lacey , 2018-10-07, 17:54
[expand - 5 more] - Regex to block some patterns - Nutch - [mail # user]
...Hi Sebastian,Thanks for the update, here is my regex pattern to block my use case afterlong spent time.*-.*(modal[-_a-zA-Z0-9]*[\.]html|exit.html[\/]?\??.*|model[-_a-zA-Z0-9]*[\.]html|exitpa...
   Author: Amarnatha Reddy , govind nitk , ... , 2018-10-06, 02:46
[expand - 2 more] - Alternatives to Solr - Nutch - [mail # user]
...Thank you so very much😊On Fri, Oct 5, 2018, 3:41 PM Yash Thenuan Thenuan wrote:> You can use elasticsearch.>> On Sat, 6 Oct 2018, 00:58 Timeka Cobb,  wrote:>> > Hel...
   Author: Timeka Cobb , Yash Thenuan Thenuan , ... , 2018-10-05, 19:44
[expand - 4 more] - Connect Solr and Nutch in Ubuntu 18 - Nutch - [mail # user]
...No problem and sorry about that!On Fri, Oct 5, 2018 at 11:50 AM Sebastian Nagel wrote:> Hi Timeka,>> > because Solr is missing the> > files from its packet for it to w...
   Author: Timeka Cobb , Sebastian Nagel , ... , 2018-10-05, 17:30