Nutch – Generator

From command line, using this to call Injector (simplest):

hostname:${NUTCH_HOME}$ bin/nutch Generator <crawldb> <segments_dir>

Generator select some urls from crawldb, generate a segment. After this, Fetcher starts to do fetch task. (我想说Fetcher以一个segment为单位开始做fetch任务,但是这样用英文不知道怎么说。。。)

Generator has 3 MapReduce jobs.

  • Select job
  • Partition job
  • Update job(option,选择性的)


Select job, select some urls from crawldb for fetch

  • Mapper: Selector, filter/schedule/ScoreFilter, generate score, which is used by key comparator
  • Partitioner: Selector, partition by host/domain/IP, see @URLPartitioner.
  • Reducer: Selector, normalize, limit url count for a host (generate.max.count if you set).
  • Input Path: crawldb/CURRENT
  • Output Path: temp dir
  • Input format:
  • Output format: GeneratorOutputFormat
  • Output key class: FloatWritable
  • Output key comparator: DecreasingFloatComparator
  • Output value class: SelectorEntry (Text, CrawlDatum, IntWritable segnum)

Partition job, Partitioning selected urls for politeness.

  • Mapper: SelectorInverseMapper, just collect data..
  • Partitioner: URLPartitioner, determine partition mode, default is by domain(actually, it’s by host). like: url.host.hashCode() % task_number
  • Reducer: PartionReducer, just collect data…
  • Input path: temp-dir, select job’s output path
  • Input format: SequenceFileInputFormat
  • Map output key class: Text
  • Map output value class: Selector Entry
  • Output path: segments/${current-segment}/crawl_generate
  • Output Format: SequenceFileOutputFormat
  • Output key comparator: HashComparator

Update job, if you set generate.update.crawldb to true, this will run. To update CrawlDatum’s generate time (use latest time)

  • Mapper: CrawlDbUpdater, just collect data..
  • Reducer: CrawlDbUpdater, set generate time for CrawlDatum object
  • Input path: crawldb/CURRENT, segments/${current-segment}/crawl_generate
  • Output path: temp dir2
  • Output format: MapFileOutputFormat

After Update job, Generator calls CrawlDb.install(job, crawldbPath), rename the latest version crawldb to CURRENT. Update total crawldb.

If there’s something you must know exactly, that might be the Partitioner, it determines the fetch tasks’ dispatch.


Nutch 的 Generator 的输入路径是 Crawldb,输出是在 segments 目录下生成一个以时间戳命名的 segment,供后续的 Fetcher 工作,Fetcher 是以 segment 为单位工作的。
Generator 的大致流程并无太多的可说之处。
不过它的 Partitioner 应该去关注一下,这个类涉及到了对后续 Fetcher 任务的分配。

anyShare分享到:
          

没准儿您会对以下内容感兴趣: