Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Web Searching & Indexing: Inverted Lists, Signature Files, and Ranking Result Pages, Slides of Introduction to Database Management Systems

Duke University Introduction to Database Management Systems

Various techniques used in web searching and indexing, including inverted lists, signature files, and ranking result pages. Inverted lists store the matrix by rows for each keyword and help sort document ids. Signature files store the matrix by columns and compress them for each document, allowing for efficient bit-wise comparisons. Ranking result pages is necessary as users will not look at all result pages, and pages need to be ranked based on content and link structure.

Typology: Slides

2011/2012

Uploaded on 01/29/2012

arold 🇺🇸

4.7

(24)

372 documents

1 / 5

This page cannot be seen from the preview

Don't miss anything!

1

Web Searching & Indexing

CPS 116

Introduction to Database Systems

2

Announcements

Homework #4 due on Thursday (Dec. 2)

Homework #3 graded

Available for pick up in my office tomorrow

Course project demo signup begins tomorrow via

email

Final exam on Friday, Dec. 10

More info and a brief review this Thursday

3

Keyword search

Google…

Web | Images | Groups

| Directory

Google Search | I’m

Feeling Lucky

Advanced Search |

Preferences | Language

Tools…

Association for

Computing Machinery

Founded in 1947,

ACM is the world’s

first educational and

scientific computing

society. Today, our

members—…

CPS 216: Advanced

Database Systems

(Fall 2001)

Course Information

Course Description /

Time and Place /

Books

Resources: Staff…

The Internet Movie

Database (IMDb)…

… Search the Internet

Movie Database. For

more search options,

please visit Search

central…

database AND search Search

What are the documents containing both “database” and “search”?

4

Keywords ×documents

Inverted lists: store the matrix by rows

Signature files: store the matrix by columns

111…1

110…0

001…0

010…1

001…0

……………

Document 1

Document 2

Document 3

Document n

All documents

“a”

“database”

“cat”

“dog”

“search”

All keywords

1 means keyword appears in the document

0 means otherwise

5

Inverted lists

Store the matrix by rows

For each keyword, store an inverted list

hkeyword, doc-id-listi

h“database”, {3, 7, 142, 857, …}i

h“search”, {3, 9, 192, 512, …}i

It helps to sort doc-id-list (why?)

Vocabulary index on keywords

B+-tree or hash-based

How large is an inverted list index?

6

Using inverted lists

Documents containing “database”

Use the vocabulary index to find the inverted list for

“database”

Return documents in the inverted list

Documents containing “database” AND “search”

Return documents in the intersection of the two inverted

lists

OR? NOT?

Union and difference, respectively

Discover Slides of Introduction to Database Management Systems Duke University

Partial preview of the text

Download Web Searching & Indexing: Inverted Lists, Signature Files, and Ranking Result Pages and more Slides Introduction to Database Management Systems in PDF only on Docsity!

Web Searching & Indexing

CPS 116

Introduction to Database Systems

Announcements

Homework #4 due on Thursday (Dec. 2)

Homework #3 graded

Available for pick up in my office tomorrow

Course project demo signup begins tomorrow via

email

Final exam on Friday, Dec. 10

More info and a brief review this Thursday

3

Keyword search

Association for Computing Machinery Founded in 1947, ACM is the world’s first educational and scientific computing society. Today, our members—…

CPS 216: Advanced Database Systems (Fall 2001) Course Information Course Description / Time and Place / Books Resources: Staff…

The Internet Movie Database (IMDb)…

… Search the Internet Movie Database. For more search options, please visit Search central…

database AND search Search

What are the documents containing both “database” and “search”?

4

Keywords × documents

Inverted lists: store the matrix by rows

Signature files: store the matrix by columns

1 1 1 … 1 1 1 0 … 0 0 0 1 … 0 0 1 0 … 1 0 0 1 … 0 … … … … …

Document 1Document 2Document 3Document

n

All documents

“a”

“database”

“cat”

“dog” “search”

All keywords

1 means keyword appears in the document 0 means otherwise

5

Inverted lists

Store the matrix by rows

For each keyword, store an inverted list

h keyword , doc-id-list i

h“database”, {3, 7, 142, 857, …}i

h“search”, {3, 9, 192, 512, …}i

It helps to sort doc-id-list (why?)

Vocabulary index on keywords

B+^ -tree or hash-based

How large is an inverted list index?

6

Using inverted lists

Documents containing “database”

Use the vocabulary index to find the inverted list for

“database”

Return documents in the inverted list

Documents containing “database” AND “search”

Return documents in the intersection of the two inverted

lists

OR? NOT?

Union and difference, respectively

What are “all” the keywords?

All sequences of letters (up to a given length)?

… that actually appear in documents!

All words in English?

Plus all phrases?

Alternative: approximate phrase search by proximity

Minus all stop words

They appear in nearly every document, e.g., a, of, the, it Not useful in search

Combine words with common stems

Example: database, databases They can be treated as the same for the purpose of search

Frequency and proximity

Frequency

h keyword , { h doc-id , number-of-occurrences i,

h doc-id , number-of-occurrences i,

… }i

Proximity (and frequency)

h keyword , { h doc-id , h position-of-occurrence 1 ,

position-of-occurrence 2 , …i,

h doc-id , h position-of-occurrnece 1 , …ii,

… }i

When doing AND, check for positions that are near

9

Signature files

Store the matrix by columns and compress them

For each document, store a w -bit signature

Each word is hashed into a w -bit value, with only s

< w bits turned on

Signature is computed by taking the bit-wise OR of

the hash values of all words on the document

) Some false positives; no false negatives

hash (“database”) = 0110 hash (“dog”) = 1100 hash (“cat”) = 0010

doc 1 contains “database”: 0110 doc 2 contains “dog”: 1100 doc 3 contains “cat” and “dog”: 1110

Does doc 3 contain “database”?

10

Bit-sliced signature files

Motivation

To check if a document contains a word, we only need to check the bits that are set in the word’s hash value So why bother retrieving all w bits of the signature?

Instead of storing n signature

files, store w bit slices

Only check the slices that

correspond to the set bits in the

word’s hash value

Start from the sparse slices

doc signature 1 0 0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 3 0 0 0 1 1 0 1 0 4 0 1 1 0 1 1 0 0 … … n 0 0 0 0 1 0 1 0

Bit-sliced signature files

Slice 7 …Slice 0

Starting to look like an inverted list again!

11

Inverted lists versus signatures

Inverted lists better for most purposes ( TODS , 1998)

Problems of signature files

False positives Hard to use because s , w , and the hash function need tuning to work well Long documents will likely have mostly 1’s in signatures Common words will create mostly 1’s for their slices Difficult to extend with features such as frequency, proximity

Saving grace of signature files

Sizes are tunable Good for lots of search terms Good for computing similarity of documents

12

Ranking result pages

A single search may return many pages

A user will not look at all result pages Complete result may be unnecessary )Result pages need to be ranked

Possible ranking criteria

Based on content

Number of occurrences of the search terms
Similarity to the query text Based on link structure
Backlink count
PageRank And more…

Random surfer model

A random surfer

Starts with a random page

Randomly selects a link on the page to visit next

Never uses the “back” button

PageRank( p ) measures the probability that a random

surfer visits page p

Problems with the naïve PageRank

Dead end: a page with no

outgoing links

A dead end causes all

importance to “leak”

eventually out of the Web

Spider trap: a group of

pages with no links out of

the group

A spider trap will eventually

accumulate all importance

of the Web

Netscape

Amazon Microsoft

Netscape

Amazon Microsoft

21

Practical PageRank

d : decay factor

PageRank( p ) =

d · Σ q ∈ B ( p ) (PageRank( q ) ⁄ N ( q )) + (1 – d )

Intuition in the random surfer model

A surfer occasionally gets bored and jump to a random

page on the Web instead of following a random link on

the current page

22

Google (1998)

Inverted lists in practice contain a lot of context information

PageRank is not the final ranking

Type-weight: depends on the type of the occurrence

For example, large font weights more than small font Count-weight: depends on the number of occurrences
Increases linearly first but then tapers off For multiple search terms, nearby occurrences are matched together and a proximity measure is computed
Closer proximity weights more

Capitalization

Relative font size In URL/title/meta tag In anchor text

Within the page Within the page Within the anchor URL associated with the anchor

Web Searching & Indexing: Inverted Lists, Signature Files, and Ranking Result Pages, Slides of Introduction to Database Management Systems

Related documents

Partial preview of the text

Download Web Searching & Indexing: Inverted Lists, Signature Files, and Ranking Result Pages and more Slides Introduction to Database Management Systems in PDF only on Docsity!

Web Searching & Indexing

CPS 116

Introduction to Database Systems

Announcements

 Homework #4 due on Thursday (Dec. 2)

 Homework #3 graded

 Available for pick up in my office tomorrow

 Course project demo signup begins tomorrow via

email

 Final exam on Friday, Dec. 10

 More info and a brief review this Thursday

Keyword search

Keywords × documents

 Inverted lists: store the matrix by rows

 Signature files: store the matrix by columns

Inverted lists

 Store the matrix by rows

 For each keyword, store an inverted list

 h keyword , doc-id-list i

 h“database”, {3, 7, 142, 857, …}i

 h“search”, {3, 9, 192, 512, …}i

 It helps to sort doc-id-list (why?)

 Vocabulary index on keywords

 B+^ -tree or hash-based

 How large is an inverted list index?

Using inverted lists

 Documents containing “database”

 Use the vocabulary index to find the inverted list for

“database”

 Return documents in the inverted list

 Documents containing “database” AND “search”

 Return documents in the intersection of the two inverted

lists

 OR? NOT?

 Union and difference, respectively

What are “all” the keywords?

 All sequences of letters (up to a given length)?

 All words in English?

 Plus all phrases?

 Minus all stop words

 Combine words with common stems

Frequency and proximity

 Frequency

 h keyword , { h doc-id , number-of-occurrences i,

h doc-id , number-of-occurrences i,

… }i

 Proximity (and frequency)

 h keyword , { h doc-id , h position-of-occurrence 1 ,

position-of-occurrence 2 , …i,

h doc-id , h position-of-occurrnece 1 , …ii,

… }i

 When doing AND, check for positions that are near

Signature files

 Store the matrix by columns and compress them

 For each document, store a w -bit signature

 Each word is hashed into a w -bit value, with only s

< w bits turned on

 Signature is computed by taking the bit-wise OR of

the hash values of all words on the document

) Some false positives; no false negatives

Bit-sliced signature files

 Motivation

 Instead of storing n signature

files, store w bit slices

 Only check the slices that

correspond to the set bits in the

word’s hash value

 Start from the sparse slices

Inverted lists versus signatures

 Inverted lists better for most purposes ( TODS , 1998)

 Problems of signature files

 Saving grace of signature files

Ranking result pages

 A single search may return many pages

 Possible ranking criteria

Random surfer model

Homework #4 due on Thursday (Dec. 2)

Homework #3 graded

Available for pick up in my office tomorrow

Course project demo signup begins tomorrow via

Final exam on Friday, Dec. 10

More info and a brief review this Thursday

Inverted lists: store the matrix by rows

Signature files: store the matrix by columns

Store the matrix by rows

For each keyword, store an inverted list

h keyword , doc-id-list i

h“database”, {3, 7, 142, 857, …}i

h“search”, {3, 9, 192, 512, …}i

It helps to sort doc-id-list (why?)

Vocabulary index on keywords

B+^ -tree or hash-based

How large is an inverted list index?

Documents containing “database”

Use the vocabulary index to find the inverted list for

Return documents in the inverted list

Documents containing “database” AND “search”

Return documents in the intersection of the two inverted

OR? NOT?

Union and difference, respectively

All sequences of letters (up to a given length)?

All words in English?

Plus all phrases?

Minus all stop words

Combine words with common stems

Frequency

h keyword , { h doc-id , number-of-occurrences i,

Proximity (and frequency)

h keyword , { h doc-id , h position-of-occurrence 1 ,

When doing AND, check for positions that are near

Store the matrix by columns and compress them

For each document, store a w -bit signature

Each word is hashed into a w -bit value, with only s

Signature is computed by taking the bit-wise OR of

Motivation

Instead of storing n signature

Only check the slices that

Start from the sparse slices

Inverted lists better for most purposes ( TODS , 1998)

Problems of signature files

Saving grace of signature files

A single search may return many pages

Possible ranking criteria

A random surfer

Starts with a random page

Randomly selects a link on the page to visit next

Never uses the “back” button

PageRank( p ) measures the probability that a random

Dead end: a page with no

A dead end causes all

Spider trap: a group of

A spider trap will eventually

d : decay factor

PageRank( p ) =

Intuition in the random surfer model

A surfer occasionally gets bored and jump to a random

Inverted lists in practice contain a lot of context information

PageRank is not the final ranking

A tree with edges labeled by characters

A node represents the string obtained by

Compact trie: replace a path without branches by a

Internal nodes have fan-out ≥ 2 (except the root)

No two edges out of the same node can share the

Instead of inlining the string labels, store pointers to

Instead of labeling each edge by a string, only label by the

Leaves point to strings

A Pat tree indexes all suffixes of a string in a Patricia trie

A String B-tree uses a Patricia trie to store and compare

General tree-based string indexing tricks

Trie, Patricia trie, String B-tree

Two general ways to index for substring queries

Index words: inverted lists, signature files

Index all suffixes: suffix tree, Pat tree, suffix array (not

Web search and information retrieval go beyond

IDF, PageRank, …