Nutch web crawler tutorial, Assignments of Natural Language Processing (NLP)

Total Configuration of Nutch with solr server

Typology: Assignments

2019/2020

Uploaded on 10/30/2020

arjun_pg
arjun_pg 🇮🇳

1 document

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1. Install Solr by downloading and extracting it to a folder.
2. Install Nutch:- download Nutch v1.17 source package and extract it to a folder then cimpile it
/Home/Username/Apache-Nutch-1.17/ Ant
3. Create resources that will be used by our Solr core.
All resources for Solr cores are to be placed in the $SOLR_HOME/server/solr/configsests
directory.
create a folder in this directory
mkdir -p $SOLR_HOME/server/solr/configsets/nutch/
4. So we’ll create the urls folder in our $NUTCH_HOME folder:
mkdir -p $NUTCH_HOME/urls
We’ll also create that `seed.txt` in the folder we just created
touch $NUTCH_HOME/urls/seed.txt
5. To control which links get crawled, you can do that in the $NUTCH_HOME/conf/regex-
urlfilter.txt.
1 Inject root URLs into the WebDB (inject).
bin/nutch inject crawl/crawldb urls
2 Generate a fetchlist from the WebDB in a new segment (generate).
bin/nutch generate crawl/crawldb crawl/segments
3 Fetch content from URLs in the fetchlist (fetch).
s1=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $s1
Then parse the returned content to remove unnecessary stuff out of it:
bin/nutch parse $s1
4 Update the WebDB with links from fetched pages (updatedb).
bin/nutch updatedb crawl/crawldb $s1
5 Repeat steps 3-5 until the required depth is reached.
6 invert
The reason is since these segments will contain so many links, we should find a way
to specify which links are important.
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
7 Index the fetched pages in Solr server (index).
bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s3 -filter -normalize
8 Eliminate duplicate content (and duplicate URLs) from the indexes (dedup).
bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s3 -filter -normalize -deleteGone
9 localhost:8983/solr
pf2

Partial preview of the text

Download Nutch web crawler tutorial and more Assignments Natural Language Processing (NLP) in PDF only on Docsity!

  1. Install Solr by downloading and extracting it to a folder.
  2. Install Nutch:- download Nutch v1.17 source package and extract it to a folder then cimpile it ️️️️️️️️️️️️️️️ /Home/Username/Apache-Nutch-1.17/ Ant
  3. Create resources that will be used by our Solr core. All resources for Solr cores are to be placed in the $SOLR_HOME/server/solr/configsests directory. create a folder in this directory ️️️️️️️️️️️️️️️ mkdir -p $SOLR_HOME/server/solr/configsets/nutch/
  4. So we’ll create the urls folder in our $NUTCH_HOME folder:

️️️️️️️️️️️️️️️ mkdir -p $NUTCH_HOME/urls

We’ll also create that seed.txt in the folder we just created

️️️️️️️️️️️️️️️ touch $NUTCH_HOME/urls/seed.txt

  1. To control which links get crawled, you can do that in the $NUTCH_HOME/conf/regex- urlfilter.txt.

1 Inject root URLs into the WebDB (inject). ️️️️️️️️️️️️️️️ bin/nutch inject crawl/crawldb urls

2 Generate a fetchlist from the WebDB in a new segment (generate). ️️️️️️️️️️️️️️️ bin/nutch generate crawl/crawldb crawl/segments

3 Fetch content from URLs in the fetchlist (fetch). ️️️️️️️️️️️️️️️ s1=ls -d crawl/segments/2* | tail -1 ️️️️️️️️️️️️️️️ bin/nutch fetch $s Then parse the returned content to remove unnecessary stuff out of it:

️️️️️️️️️️️️️️️ bin/nutch parse $s

4 Update the WebDB with links from fetched pages (updatedb). ️️️️️️️️️️️️️️️ bin/nutch updatedb crawl/crawldb $s

5 Repeat steps 3-5 until the required depth is reached.

6 invert The reason is since these segments will contain so many links, we should find a way to specify which links are important. ️️️️️️️️️️️️️️️ bin/nutch invertlinks crawl/linkdb -dir crawl/segments

7 Index the fetched pages in Solr server (index). ️️️️️️️️️️️️️️️ bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s3 -filter -normalize

8 Eliminate duplicate content (and duplicate URLs) from the indexes (dedup). ️️️️️️️️️️️️️️️ bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s3 -filter -normalize -deleteGone

9 localhost:8983/solr