A previous note of nutch in this blog is too simple. So I decide to write more, as a series.

    Focus on:

  1. Inject
  2. Generate
  3. Fetch

Nutch is a Java implementation of a search engine. Nutch’s crawler is distributed, base on Hadoop and Lucene.
It’s open architecture and pluggable.
You can write your own plugins to do special tasks. Plugins are decoupled from Nutch system.
Several tasks in Nutch are designed as Hadoop Map-Reduce job. Like Inject/Generate/Fetch/Index.
The final result (indexes pages) is stored using Lucene index.
All data are in HDFS.

You can find for more informations in Nutch official WIKI: http://wiki.apache.org/nutch/

I write this post just want to say some points I’ve realized through experience. It’s more about nutch crawler, not including nutch searcher.

It’s not easy to build a realtime search engine depends on Nutch. Or in another words, it’s not easy to get quick feedback/reaction from nutch. Because Nutch launch task as Hadoop MapReduce job. And job setup/initiation is high time cost.
The Fetch task is started to fetch a list of urls under a segment. A segment is generated from crawldb. These data are all Map/Sequence files. Because they are stored in HDFS, so all of they are “write once”. So if you want nutch to crawl some urls quickly, there might be some extra works you need to implement.
I mean it’s not easy, but it’s possible. And you need to do a lot of changes with Nutch original architecture.

HDFS’ API is very friendly and easy in using.

In [Inject] process, user can store some metadata to CrawlDatum object, but these metadata won’t be taken to later objects like org.apache.nutch.protocol.Content. Maybe Nutch’s designer think user can read the metadata from crawl_generate/crawldb. But in my opinion, let other objects carry these metadata is better.

URL-Filter is not flexible. The original url-filter read a configuration file once, then filter all urls with read rules. But I want to filter urls with custom rules. For example.
URL c might be ignored when it’s found in Page a. But it might be received when it’s found in Page b.

Currently, It seems that Nutch do not cache resolved hostname-ip to minimize DNS lookup.

In [Fetch] progress, MapReduce framework divides fetch-list (in one segment) into several parts and dispatches them to working nodes. Fetch progress will be finished only after every nodes done its work.
For pliteness, all URLs of one domain are fetched on a node. So, if one domain is slower than others (due to bandwidth). The full progress will be delayed. An example of [Law of the minimum].

——————————-

Nutch 是用 Java 实现的一个搜索引擎。它的爬虫是分布式的,基于 Hadoop 和 Lucene。
它的体系是开放并且可插接。
你可以写自己的插件以完成特殊和任务。插件与 Nutch 体系是解耦的。
Nutch 里的任务被设计为 Hadoop 的 Map-Reduce 任务,如 Inject/Generate/Fetch/Index。
最终结果(索引的页面集)用的是 Lucene。
所有文件都在 HDFS 里。

写这帖子是为说一些使用过程中认识到的点,更多是关于爬虫的,无关索引、搜索。

基于 Nutch 建立一个实时搜索引擎不是件简单的事,换句话说,依赖 Nutch 去得到快速反应是不容易的。因为它的任务都是 Hadoop Map-Reduce 的任务,而任务的创造、初始化都很费时。
爬取任务是以 segment 为单位的,而 segment 来自于 crawldb,这些文件是 Map/Sequence 文件,因为都存在 HDFS 里,所有文件都只能写一次。所以如果你有紧急的 URL 爬取任务,你得做些额外的工作。
不容易,但还是可行的。只不过你需要对 Nutch 原生的体系做大的修改。

HDFS 的 API 倒是非常友好、简单。

在 Inject 的过程里,你可以往 CrawlDatum 对象的 metadata 里存一些自定义属性,但这些属性不会被带到后续对象中,像 org.apache.nutch.protocol.Content。可以设计者故意这样的,认为这些自定义的属性可以从相关的 crawldb 里读取,不过我还是觉得后续对象里带上这些会比较好。

URL-Filter 不够弹性化,原生的 URL-Filter 从一个配置文件里读取规则,然后所有的过滤策略都依赖于这些规则。但是我觉得应该有自定义的过滤规则。比如:
当一个链接从页面A被发现时,它应该被过滤掉。但如果它是从页面B里提取出来的,那就应该收集起来。

翻了一下原码,目前 Nuthc 好像不缓存 hostname-ip 的,这样就没有减小 DNS 相关的搜索时间了。

在 Fetch 过程里,MapReduce 框架把任务分成几个部分并分发给工作节点,然后等待所有节点完成,Fetch 才完成。
出于礼貌,某域下的 URL 会被放在一个节点下。如此,当一个网站比其它网站慢很多的时候,整个 Fetch 过程就慢掉了。类似于木桶理论。

———————

擦,做学问能不能踏实一点儿啊,国内的这些网站们!
我找【木桶理论】的正解翻译,累死,国内这些狗屁网站全部给老子瞎整,尽他妈的误导!!!
全部都是“死译”,神马cask principle,short-board effect,barrel theory;cask theory;pail theory,cannikin law,还有反句Build your performance on strength, not weakness
我草,这些东西我都找不到英文原文。。。

最后的搜索词是自己瞎拼的:
carry water shortest wooden
—-> http://www.theforgottenways.org/mpulse/
然后换上图片搜索
minimum factor -> the minimum barrel
最后才找到
Law of the Minimum
—-> http://en.wikipedia.org/wiki/Liebig’s_law_of_the_minimum

anyShare分享到:
          

没准儿您会对以下内容感兴趣: