[expand - 1 more] - IndexWriter interface in 1.15 - Nutch - [mail # user]
...Hi Lewis,First of all I must say that I can't reproduce my claim regarding getConf/setConf. I was getting a compilation error for their @Override, but not anymore, and it's being called, so ...
   Author: Yossi Tamari , 2018-09-06, 16:28
[expand - 1 more] - Issues while crawling pagination - Nutch - [mail # user]
...Hi Shiva,Having looked at the specific site, I have to amend my recommended max-depth from 1 to 2, since I assume you want to fetch the stories themselves, not just the hubpages.If you want ...
   Author: Yossi Tamari , 2018-07-28, 23:02
[expand - 1 more] - [MASSMAIL]RE: Events out-of-the-box - Nutch - [mail # user]
...Hi Roannel,I am not using, and was not even aware of Nutch's ability to emit events. I just read where they basically ha...
   Author: Yossi Tamari , 2018-07-05, 07:04
[NUTCH-2449] Usage of Tika LanguageIdentifier in language-identifier plugin - Nutch - [issue]
...The language-identifier plugin uses org.apache.tika.language.LanguageIdentifier for extracting the language from the document text. There are two issues with that: LanguageIdentifier is depr...    Author: Yossi Tamari , 2018-06-28, 10:05
[NUTCH-2611] Add line-breaks when parsing HTML block-level elements - Nutch - [issue]
...Currently, the HTML and Tika parser only add newlines following text-nodes that contain only whitespaces (e.g </span> <span>), but not based on what the tags are, so for example ...    Author: Yossi Tamari , 2018-06-28, 09:08
Sitemap URL's concatenated, causing status 14 not found - Nutch - [mail # user]
...Hi Markus,I don’t believe this is a valid sitemapindex. Each  should include exactly one .See also and
   Author: Yossi Tamari , 2018-05-26, 00:57
Problems starting crawl from sitemaps - Nutch - [mail # user]
...Hi Chris,In order to inject sitemaps, you should use the "nutch sitemap" command. After you inject those sitemaps to the crawl DB, you can proceed as normal with the crawl command, without t...
   Author: Yossi Tamari , 2018-05-24, 11:19
random sampling of crawlDb urls - Nutch - [mail # user]
...Hi Michael,If you are using 1.14, there is a parameter -sample that allows you to request a random sample. See Yossi.> -----Original Mess...
   Author: Yossi Tamari , 2018-05-01, 22:06
No internet connection in Nutch crawler: Proxy configuration -PAC file - Nutch - [mail # user]
...To add to what Lewis said, PAC files are mostly used by browsers, not so much by servers (like Nutch). It is possible your IT department has another proxy configuration that you can use in a...
   Author: Yossi Tamari , 2018-04-23, 15:33
[NUTCH-2456] Allow to index pages/URLs not contained in CrawlDb - Nutch - [issue]
...If http.redirect.max is set to a positive value, the Fetcher will follow redirects, creating a new CrawlDatum.If the redirected URL is fetched and parsed, during indexing for it we have a sp...    Author: Yossi Tamari , 2018-04-23, 08:29