Nutch – Injector
Injector
From command line, using this to call Injector:
hostname:${NUTCH_HOME}$ bin/nutch Injector <crawldb> <url-dir>
This shows the input path of Injector is: url-dir
and output path is: crawldb
Injector has two Map-Reduce jobs: sort job and merge job.
Sort job
just has Mapper, convert a string url to format.
If you want to set some other property into CrawlDatum object, here is where you need modify.
It also runs URL-Normalize, URL-Filter and URL-ScoringFilter.
CrawlDatum holds a MapWritable object, stores some metadata.
But, these metadata won’t be taken to later objects (like Content), write it yourself.
In your url file, you can set them with texts like(\t splits):
http://www.somesite.com/ metaKey1=value1 metaKey2=anotherValue
- Mapper: InjectorMapper
- Input Path: url-dir
- Input format: string line
- Output Path: temp-dir
- Output format: SequenceFileOutputFormat
- Output key class: Text
- Output value class: CrawlDatum
Merge job
- Mapper: CrawlDbFilter, Like sort job’s Mapper, runs URL-Normalize and URL-Filter.
- Reducer: InjectReducer, filter duplicate links, and set/update some properties of CrawlDatum.
- Input path: temp-dir, sort job’s output path (if crawldb/CURRENT exists, this path will be added to input path)
- Output path: a dir with random number name in crawldb-dir
- Input format: SequenceFileInputFormat
- Output format: MapFileOutputFormat
- Output key class: Text
- Output value class: CrawlDatum
After SortJob and MergeJob, Injector’ll call CrawlDb.install(job, crawldbPath), rename the latest version crawldb to CURRENT.
And inject is done.
| anyShare分享到: | |
| |