最近在看这玩意,需要对nutch进行二次开发,整一个爬虫出来。
下面这些是读书、读代码的笔记。
其实都是读的英文,第一部分是Hadoop的书里的,这里简单的翻译成中文。
第二部分是apache-nutch-1.2的源代码。
本来应该用英文写这个日志的(对自己的要求是具体一点的技术文章要用英文,但这个读书笔记一开始就是用中文记的,所以就这样咯,也没必要强求自己)

Part I:

Hadoop.The.Definitive.Guide(2009) Page 431/453
生成 fetch-lists的约束限制
1, 来自同一个域的URLs需要分区到同一个区。可以控制对同一个域的访问量。
2, 来自同一个域的URLs需要尽可能的分开,避免爬虫被block。策略是和其他域的URLs混合
3, 来自同一个域的URLs不能超过某个限定值x,防止大网站挤死小网站
4, 高分值的URLs应该排在前面
5, fetch-list里的URLs数目也要有一个限制值y
6, output分区的数目应当和 fetching-map-task 的数目对等

步骤:
1, Select, sort by score, limit by URL count per host
2, Invert, partition by host, sort randomly


Part II:
基本流程,这个步骤列表,在 org.apache.nutch.crawl.Crawl类里可以明显的看到

1. inject url
2. generate crawldb
3. fetch
4. parse
5. update crawldb, goto -> 2, condition -> depth
6. invert links, update linkdb
7. index, generate indexes folders
8. delelte duplicate indexes, merge indexes, generate index folder

源代码相关,里面的 skip 或者 do nothing,都表示简单的返回内容,没有进行复杂的处理

Injector.inject
two mapred job:
1. sort job,just has a mapper named InjectMapper, no Reducer, normalize/filter urls, convert to CrawlDatum object.
2. merge job,Mapper:CrawlDbFilter, InjectReducer,
-Mapper: CrawlDbFilter, do some work like sortJob’s Mapper
-Reducer: InjectReducer, set CrawlDatum.status, remove duplicate items

Generator.generate
1. select job, Mapper/Partitioner/Reducer: Selector
- Mapper: filter(url, fetch time, generate time) -> sort(score)
- Partitioner: URLPartitioner, by host hashcode(partition by host)
- Reducer: host max count
2. partition job,
- Mapper: SelectorInverseMapper, skip
- Partitioner: URLPartitioner, use HashComparator, sort by random
- Reducer: PartitionReducer, skip
3. update job, Mapper/Reducer: CrawlDbUpdater
- Mapper: do nothing
- Reducer: set generate time?

Fetcher.fetch, MapperRunner: Fetcher
- fetch, store, but here just has a “store flag”, no related ExtensionPoint.

ParseSegment, Mapper/Reducer: ParseSegment
- Mapper: parse content, plugins: kinds of parsers
- Reducer: skip

CrawlDb.update
- Mapper: CrawlDbFilter, normalize, filter
- Reducer: CrawlDbReducer, skip

anyShare分享到:
          

没准儿您会对以下内容感兴趣: