Nutch – Generator
Nutch – Generator
From command line, using this to call Injector (simplest):
hostname:${NUTCH_HOME}$ bin/nutch Generator <crawldb> <segments_dir>
Generator select some urls from crawldb, generate a segment. After this, Fetcher starts to do fetch task. (我想说Fetcher以一个segment为单位开始做fetch任务,但是这样用英文不知道怎么说。。。)
Generator has 3 MapReduce jobs.
- Select job
- Partition job
- Update job(option,选择性的)
Select job, select some urls from crawldb for fetch
- Mapper: Selector, filter/schedule/ScoreFilter, generate score, which is used by key comparator
- Partitioner: Selector, partition by host/domain/IP, see @URLPartitioner.
- Reducer: Selector, normalize, limit url count for a host (generate.max.count if you set).
- Input Path: crawldb/CURRENT
- Output Path: temp dir
- Input format:
- Output format: GeneratorOutputFormat
- Output key class: FloatWritable
- Output key comparator: DecreasingFloatComparator
- Output value class: SelectorEntry (Text, CrawlDatum, IntWritable segnum)
Partition job, Partitioning selected urls for politeness.
- Mapper: SelectorInverseMapper, just collect data..
- Partitioner: URLPartitioner, determine partition mode, default is by domain(actually, it’s by host). like: url.host.hashCode() % task_number
- Reducer: PartionReducer, just collect data…
- Input path: temp-dir, select job’s output path
- Input format: SequenceFileInputFormat
- Map output key class: Text
- Map output value class: Selector Entry
- Output path: segments/${current-segment}/crawl_generate
- Output Format: SequenceFileOutputFormat
- Output key comparator: HashComparator
Update job, if you set generate.update.crawldb to true, this will run. To update CrawlDatum’s generate time (use latest time)
- Mapper: CrawlDbUpdater, just collect data..
- Reducer: CrawlDbUpdater, set generate time for CrawlDatum object
- Input path: crawldb/CURRENT, segments/${current-segment}/crawl_generate
- Output path: temp dir2
- Output format: MapFileOutputFormat
After Update job, Generator calls CrawlDb.install(job, crawldbPath), rename the latest version crawldb to CURRENT. Update total crawldb.
If there’s something you must know exactly, that might be the Partitioner, it determines the fetch tasks’ dispatch.
Nutch 的 Generator 的输入路径是 Crawldb,输出是在 segments 目录下生成一个以时间戳命名的 segment,供后续的 Fetcher 工作,Fetcher 是以 segment 为单位工作的。
Generator 的大致流程并无太多的可说之处。
不过它的 Partitioner 应该去关注一下,这个类涉及到了对后续 Fetcher 任务的分配。
| anyShare分享到: | |
| |
Between the two ..probably a URLFIlter but maybe I missed a other filter that could be use…About a URLFilter custom filter can I limit its use to say the Generator?..Or is it going to be call in any class that use URLFilters? I see..URLFilters used in map function in Generator if crawl.generate.filter is..set it is not found in the nutch-default.xml file but default to TRUE …Also found in Injector LinkDB and LinkDBFilter both if..linkdb.url.filters set which default to TRUE ParseOutputFormat ..SegmentMerger if segment.merger.filter is set and probably somewhere………………………………….. It as i understand filters ..URLs that are about to enter the CrawlDB during UpdateDB as well as ..read from the CrawlDB the generator .
[回复]