clear query| facets| time Search criteria: .   Results from 1 to 10 from 83 (0.0s).
Loading phrases to help you
refine your search...
Blacklisting TLDs - Nutch - [mail # user]
...I want to blacklist certain top-level domains for a very large web crawl. I tried using the domainblacklist urlfilter in Nutch 1.12, but that doesn't seem to work.My domainblacklist-urlfilte...
   Author: Michael Coffey , 2018-06-14, 17:46
[NUTCH-2574] Generator: hostCount >= maxCount comparison wrong - Nutch - [issue]
...In the Generator.Selector.reduce function, there is a comparison of hostCount[1] to maxCount, to determine whether or not to push the current URL to the next segment. The purpose is ...
http://issues.apache.org/jira/browse/NUTCH-2574    Author: Michael Coffey , 2018-06-13, 08:53
[NUTCH-2468] should filter out invalid URLs by default - Nutch - [issue]
...Some Nutch components, by default, should reject invalid URLs. This was recently discussed in the users mailing list and has affected my work for a while. Although there may be some special-...
http://issues.apache.org/jira/browse/NUTCH-2468    Author: Michael Coffey , 2018-05-11, 14:36
[expand - 1 more] - RE: random sampling of crawlDb urls - Nutch - [mail # user]
...Just to clarify: .99 does NOT work fine. It should have rejected most of the records when I specified "((Math.random())>=.99)". I have used expressions not involving Math.random. For...
   Author: Michael Coffey , 2018-05-01, 21:18
[expand - 1 more] - spilled records from reducer - Nutch - [mail # user]
...Hi Sebastian, thanks for the response.The numbers I gave were for a single reduce task, not a whole job. I'll try to give a better picture.crawldb/current has 161.4 gbytes of data, on about ...
   Author: Michael Coffey , 2018-04-13, 15:31
[expand - 1 more] - how could I identify obsolete segments? - Nutch - [mail # user]
...But all the old segment data is still sitting there in hdfs. On Friday, March 23, 2018, 1:34:21 PM PDT, Sebastian Nagel  wrote: Hi Michael,when segments are merged only the most re...
   Author: Michael Coffey , 2018-03-23, 21:56
Is there any way to block the hubpages while crawling - Nutch - [mail # user]
...I think you will find that you need different rules for each website and that some amount of maintenance will be needed as the websites change their practices....
   Author: Michael Coffey , 2018-03-20, 15:15
[expand - 1 more] - dealing with redirects from http to https - Nutch - [mail # user]
...Thanks for the suggestion. On closer inspection, I see that redirection targets do show up in the crawldb.One problem is that the target urls all have scores equal to zero, because no other ...
   Author: Michael Coffey , 2018-03-09, 22:06
[expand - 2 more] - readseg dump and non-ASCII characters - Nutch - [mail # user]
...Not sure it's practical to go around to all the hadoop machines and change their default encoding settings. Not sure it wouldn't break something else!I'm wondering if there's a simple fix I ...
   Author: Michael Coffey , 2017-12-14, 18:30
purging low-scoring urls - Nutch - [mail # user]
...Is it possible to purge low-scoring urls from the crawldb? My news crawl has many thousands of zero-scoring urls and also many thousands of urls with scores less than 0.03. These urls will n...
   Author: Michael Coffey , 2017-12-04, 18:39