

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Total Configuration of Nutch with solr server
Typology: Assignments
1 / 2
This page cannot be seen from the preview
Don't miss anything!


️️️️️️️️️️️️️️️ mkdir -p $NUTCH_HOME/urls
We’ll also create that seed.txt in the folder we just created
️️️️️️️️️️️️️️️ touch $NUTCH_HOME/urls/seed.txt
1 Inject root URLs into the WebDB (inject). ️️️️️️️️️️️️️️️ bin/nutch inject crawl/crawldb urls
2 Generate a fetchlist from the WebDB in a new segment (generate). ️️️️️️️️️️️️️️️ bin/nutch generate crawl/crawldb crawl/segments
3 Fetch content from URLs in the fetchlist (fetch). ️️️️️️️️️️️️️️️ s1=ls -d crawl/segments/2* | tail -1 ️️️️️️️️️️️️️️️ bin/nutch fetch $s Then parse the returned content to remove unnecessary stuff out of it:
️️️️️️️️️️️️️️️ bin/nutch parse $s
4 Update the WebDB with links from fetched pages (updatedb). ️️️️️️️️️️️️️️️ bin/nutch updatedb crawl/crawldb $s
5 Repeat steps 3-5 until the required depth is reached.
6 invert The reason is since these segments will contain so many links, we should find a way to specify which links are important. ️️️️️️️️️️️️️️️ bin/nutch invertlinks crawl/linkdb -dir crawl/segments
7 Index the fetched pages in Solr server (index). ️️️️️️️️️️️️️️️ bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s3 -filter -normalize
8 Eliminate duplicate content (and duplicate URLs) from the indexes (dedup). ️️️️️️️️️️️️️️️ bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s3 -filter -normalize -deleteGone
9 localhost:8983/solr