












Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This presentation is for final year project to complete degree in Computer Science. It emphasis on Applications of Computer Sciences. It was supervised by Prof. Ambuja Viral at Bengal Engineering and Science University. It includes: Page, Rank, Based, Prototype, Search, Engine, Inverted, Index, Operational, HTML, Boolean
Typology: Slides
1 / 20
This page cannot be seen from the preview
Don't miss anything!













Example: We have three documents of id 1, id 2, and id 3:
id 1: Web mining is useful.
1 2 3 4
id 2: Usage mining applications.
1 2 3
id 3: Web structure mining studies the Web hyperlink structure.
1 2 3 4 5 6 7 8
vocabulary:
{Web, mining, useful, applications, usage, structure, studies, hyperlink}
Inverted index:
Applications: id 2
Hyperlink: id 3
Mining: id 1, id 2, id 3
Structure: id 3
Studies: id 3
Usage: id 2
Useful: id 1
Web: id 1, id 3
Swish-e S imple W eb I ndexing S ystem for H umans - E nhanced Swish-e is a fast, flexible, and free open source system for indexing collections of Web pages or other files. Swish-e is extremely fast in both indexing and searching, highly configurable. Swish is designed to index small- to medium-sized collection of documents. Since we are designing prototype of a search engine, thus this system can fulfill our need of about million documents.
Swish-e CONFIG File What files Swish-e indexes and how they are indexed, and where the index is written can be controlled by a configuration file. The configuration file is a text file composed of comments, blank lines, and configuration directives. Any line in which the first non-whitespace character is a # is ignored by SWISH-E as a comment.
Indexing HTML on file system with SWISH-E
Operational model of an incremental crawler
Conceptual operation of the crawler can be shown with following pseudo-code. Procedure While (true) //select the next page to crawl url selectToCrawl (AllUrls) //crawls the page page crawl (url) if ( url is in CollUrls ) then update (url, page) else //discard existing page from the collection tmpurl selectToDiscard (CollUrls) discard (tmpurl) //compress crawled page page compress (page) //save that compressed page save (url, page) //update CollUrls CollUrls (CollUrls-{tmpurl}) U {url} End if //extract links from the page newurls extractUrls (page) AllUrls AllUrls U newurls End while
Architecture of an incremental crawler
The user can use Boolean operators, AND, OR, and NOT to construct complex queries. Thus, such queries consist of terms and Boolean operators. For example, „data OR Web‟ is a Boolean query, which requests documents that contain the word „data‟ or „Web. A page is returned for a Boolean query if the query is logically true in the page (i.e., exact match).
Such a query consists of a sequence of words that makes up a phrase. Each returned document must contain at least one instance of the phrase. In a search engine, a phrase query is normally enclosed with double quotes. For example, one can issue the following phrase query (including the double quotes), “Web mining techniques and applications” to find documents that contain the exact phrase.
Boolean Operators You can use the Boolean operators and , or , near or not in searching. Without these Boolean operators Swish-e will assume you're and 'ing the words together Wildcards Two different wildcard characters are available, each evoking different behavior. The * means "match zero or more characters." The? means "match exactly one character." Phrase Searching To search for a phrase in a document use double-quotes to delimit your search terms.
Thanks for your patience