Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Advanced Database Systems-Lecture 10 Slides-Computer Science, Slides of Database Management Systems (DBMS)

Duke University Database Management Systems (DBMS)

Indexing, Keywords × Documents, Keyword Search, Inverted Lists, Frequency and Proximity, Signature Files, Bit-sliced Signature Files, Motivation, Ranking Result Pages, Textual Similarity, Content-based Ranking, Backlink, Intuition, Google’s PageRank, Naïve PageRank, Random Surfer Model, Dead End, Spider Trap, Practical PageRank, Suffix Arrays, Trie: a String Index, Suffix Tree, Patricia Trie, Pat Tree, String B-tree

Typology: Slides

2011/2012

Uploaded on 01/28/2012

arold 🇺🇸

4.7

(24)

372 documents

1 / 11

This page cannot be seen from the preview

Don't miss anything!

Indexing: Part IV

CPS 216

Advanced Database Systems

Announcements (February 17)

Homework #2 due in two weeks

Reading assignments for this and next week

“The” query processing survey by Graefe

Due next Wednesday

Midterm and course project proposal in three weeks

Keyword search

Google…

Web | Images | Groups

| Directory

Google Search | I’m

Feeling Lucky

Advanced Search |

Preferences | Language

Tools…

Association for

Computing Machinery

Founded in 1947,

ACM is the world’s

first educational and

scientific computing

society. Today, our

members—…

CPS 216: Advanced

Database Systems

(Fall 2001)

Course Information

Course Description /

Time and Place /

Books

Resources: Staff…

The Internet Movie

Database (IMDb)…

… Search the Internet

Movie Database. For

more search options,

please visit Search

central…

database AND search Search

What are the documents containing both “database” and “search”?

Discover Slides of Database Management Systems (DBMS) Duke University

Partial preview of the text

Download Advanced Database Systems-Lecture 10 Slides-Computer Science and more Slides Database Management Systems (DBMS) in PDF only on Docsity!

Indexing: Part IV

CPS 216

Advanced Database Systems

Announcements (February 17)

Homework #2 due in two weeks

Reading assignments for this and next week

“The” query processing survey by Graefe

Due next Wednesday

Midterm and course project proposal in three weeks

Keyword search

Association for Computing Machinery Founded in 1947, ACM is the world’s first educational and scientific computing society. Today, our members—…

CPS 216: Advanced Database Systems (Fall 2001) Course Information Course Description / Time and Place / Books Resources: Staff…

The Internet Movie Database (IMDb)… … Search the Internet Movie Database. For more search options, please visit Search central…

database AND search Search

What are the documents containing both “database” and “search”?

Keywords × documents

Inverted lists: store the matrix by rows

Signature files: store the matrix by columns

With compression, of course!

1 1 1 … 1 1 1 0 … 0 0 0 1 … 0 0 1 0 … 1 0 0 1 … 0 … … … … …

Document 1Document 2Document 3Document

All documents

“a”

“database”

“cat”

“dog” “search”

All keywords

1 means keyword appears in the document 0 means otherwise

Inverted lists

Store the matrix by rows

For each keyword, store an inverted list

h keyword , doc-id-list i

h“database”, {3, 7, 142, 857, …}i

h“search”, {3, 9, 192, 512, …}i

It helps to sort doc-id-list (why?)

Vocabulary index on keywords

B+^ -tree or hash-based

How large is an inverted list index?

Using inverted lists

Documents containing “database”

Use the vocabulary index to find the inverted list for

“database”

Return documents in the inverted list

Documents containing “database” AND “search”

OR? NOT?

Bit-sliced signature files

Motivation

To check if a document contains a word, we only need to check the bits that are set in the word’s hash value So why bother retrieving all w bits of the signature?

Instead of storing n signature

files, store w bit slices

Only check the slices that

correspond to the set bits in the

word’s hash value

Start from the sparse slices

doc signature 1 0 0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 3 0 0 0 1 1 0 1 0 4 0 1 1 0 1 1 0 0 … … n 0 0 0 0 1 0 1 0

Bit-sliced signature files

Slice 7 …Slice 0

Starting to look like an inverted list again!

Inverted lists versus signatures

Inverted lists better for most purposes ( TODS , 1998)

Problems of signature files

False positives Hard to use because s , w , and the hash function need tuning to work well Long documents will likely have mostly 1’s in signatures Common words will create mostly 1’s for their slices Difficult to extend with features such as frequency, proximity

Saving grace of signature files

Sizes are tunable Good for lots of search terms Good for computing similarity of documents

Ranking result pages

A single search may return many pages

A user will not look at all result pages Complete result may be unnecessary )Result pages need to be ranked

Possible ranking criteria

Based on content

Number of occurrences of the search terms
Similarity to the query text Based on link structure
Backlink count
PageRank And more…

Textual similarity

Vocabulary: [ w 1 , …, w n ]

IDF (Inverse Document Frequency): [ f 1 , …, f n ]

f i = log 2 (total # of docs / # of docs containing w i )

TF (Term Frequency): [ p 1 , …, p n ]

pi = # of times wi appears on p

Significance of words on page p : [ p 1 f 1 , …, p n f n ]

Textual similarity between two pages p and q is

defined to be [ p 1 f 1 , …, p n f n ] · [ q 1 f 1 , …, qn f n ] =

p 1 q 1 f 12 + … + p n qn f n^2

q could be the query text

Why weight significance by IDF?

Problems with content-based ranking

Many pages containing search terms may be of poor

quality or irrelevant

Example: a page with just a line “search engine”

Many high-quality or relevant pages do not even

contain the search terms

Example: Google homepage

Page containing more occurrences of the search

terms are ranked higher; spamming is easy

Example: a page with line “search engine” repeated

many times

Random surfer model

A random surfer

Starts with a random page

Randomly selects a link on the page to visit next

Never uses the “back” button

PageRank( p ) measures the probability that a random

surfer visits page p

Problems with the naïve PageRank

Dead end: a page with no

outgoing links

A dead end causes all

importance to “leak”

eventually out of the Web

Spider trap: a group of

pages with no links out of

the group

A spider trap will eventually

accumulate all importance

of the Web

Netscape

Amazon Microsoft

Netscape

Amazon Microsoft

Practical PageRank

d : decay factor

PageRank( p ) =

d · Σ q ∈ B ( p ) (PageRank( q ) ⁄ N ( q )) + (1 – d )

Intuition in the random surfer model

A surfer occasionally gets bored and jump to a random

page on the Web instead of following a random link on

the current page

Google (1998)

Inverted lists in practice contain a lot of context information

PageRank is not the final ranking

Type-weight: depends on the type of the occurrence

For example, large font weights more than small font Count-weight: depends on the number of occurrences
Increases linearly first but then tapers off For multiple search terms, nearby occurrences are matched together and a proximity measure is computed
Closer proximity weights more

Capitalization

Relative font size In URL/title/meta tag In anchor text

Within the page Within the page Within the anchor URL associated with the anchor

Suffix arrays ( SODA , 1990)

Another index for searching text

Conceptually, to construct a suffix array for string S

Enumerate all | S | suffixes of S

Sort these suffixes in lexicographical order

To search for occurrences of a substring

Do a binary search on the suffix array

Suffix array example

Suffixes: mississippi ississippi ssissippi sissippi issippi ssippi sippi ippi ppi pi i

Sorted suffixes: i ippi issippi ississippi mississippi pi ppi sippi sissippi ssippi ssissippi

No need to store the suffix strings; just store where they start

Suffix array: 10 7 4 1 0 9 8 6 3 5 2

S = mississippi q = sip

O (| q | · log | S |)

Trie: a string index

A tree with edges labeled by characters

A node represents the string obtained by

concatenating all characters along the path from the

root

Compact trie: replace a path without branches by a

single edge labeled by a string

c (^) l

b p

p e e l

b pp

le e l

What’s the max fan-out?

Advanced Database Systems-Lecture 10 Slides-Computer Science, Slides of Database Management Systems (DBMS)

Related documents

Partial preview of the text

Download Advanced Database Systems-Lecture 10 Slides-Computer Science and more Slides Database Management Systems (DBMS) in PDF only on Docsity!

Indexing: Part IV

CPS 216

Advanced Database Systems

Announcements (February 17)

 Homework #2 due in two weeks

 Reading assignments for this and next week

 “The” query processing survey by Graefe

 Due next Wednesday

 Midterm and course project proposal in three weeks

Keyword search

Keywords × documents

 Inverted lists: store the matrix by rows

 Signature files: store the matrix by columns

With compression, of course!

Inverted lists

 Store the matrix by rows

 For each keyword, store an inverted list

 h keyword , doc-id-list i

 h“database”, {3, 7, 142, 857, …}i

 h“search”, {3, 9, 192, 512, …}i

 It helps to sort doc-id-list (why?)

 Vocabulary index on keywords

 B+^ -tree or hash-based

 How large is an inverted list index?

Using inverted lists

 Documents containing “database”

 Use the vocabulary index to find the inverted list for

“database”

 Return documents in the inverted list

 Documents containing “database” AND “search”

 OR? NOT?

Bit-sliced signature files

 Motivation

 Instead of storing n signature

files, store w bit slices

 Only check the slices that

correspond to the set bits in the

word’s hash value

 Start from the sparse slices

Inverted lists versus signatures

 Inverted lists better for most purposes ( TODS , 1998)

 Problems of signature files

 Saving grace of signature files

Ranking result pages

 A single search may return many pages

 Possible ranking criteria

Textual similarity

 Vocabulary: [ w 1 , …, w n ]

 IDF (Inverse Document Frequency): [ f 1 , …, f n ]

 f i = log 2 (total # of docs / # of docs containing w i )

 TF (Term Frequency): [ p 1 , …, p n ]

 pi = # of times wi appears on p

 Significance of words on page p : [ p 1 f 1 , …, p n f n ]

 Textual similarity between two pages p and q is

defined to be [ p 1 f 1 , …, p n f n ] · [ q 1 f 1 , …, qn f n ] =

p 1 q 1 f 12 + … + p n qn f n^2

 q could be the query text

Why weight significance by IDF?

Problems with content-based ranking

 Many pages containing search terms may be of poor

quality or irrelevant

 Example: a page with just a line “search engine”

 Many high-quality or relevant pages do not even

contain the search terms

 Example: Google homepage

 Page containing more occurrences of the search

terms are ranked higher; spamming is easy

 Example: a page with line “search engine” repeated

many times

Random surfer model

 A random surfer

 Starts with a random page

 Randomly selects a link on the page to visit next

 Never uses the “back” button

 PageRank( p ) measures the probability that a random

surfer visits page p

Homework #2 due in two weeks

Reading assignments for this and next week

“The” query processing survey by Graefe

Due next Wednesday

Midterm and course project proposal in three weeks

Inverted lists: store the matrix by rows

Signature files: store the matrix by columns

Store the matrix by rows

For each keyword, store an inverted list

h keyword , doc-id-list i

h“database”, {3, 7, 142, 857, …}i

h“search”, {3, 9, 192, 512, …}i

It helps to sort doc-id-list (why?)

Vocabulary index on keywords

B+^ -tree or hash-based

How large is an inverted list index?

Documents containing “database”

Use the vocabulary index to find the inverted list for

Return documents in the inverted list

Documents containing “database” AND “search”

OR? NOT?

Motivation

Instead of storing n signature

Only check the slices that

Start from the sparse slices

Inverted lists better for most purposes ( TODS , 1998)

Problems of signature files

Saving grace of signature files

A single search may return many pages

Possible ranking criteria

Vocabulary: [ w 1 , …, w n ]

IDF (Inverse Document Frequency): [ f 1 , …, f n ]

f i = log 2 (total # of docs / # of docs containing w i )

TF (Term Frequency): [ p 1 , …, p n ]

pi = # of times wi appears on p

Significance of words on page p : [ p 1 f 1 , …, p n f n ]

Textual similarity between two pages p and q is

q could be the query text

Many pages containing search terms may be of poor

Example: a page with just a line “search engine”

Many high-quality or relevant pages do not even

Example: Google homepage

Page containing more occurrences of the search

Example: a page with line “search engine” repeated

A random surfer

Starts with a random page

Randomly selects a link on the page to visit next

Never uses the “back” button

PageRank( p ) measures the probability that a random

Dead end: a page with no

A dead end causes all

Spider trap: a group of

A spider trap will eventually

d : decay factor

PageRank( p ) =

Intuition in the random surfer model

A surfer occasionally gets bored and jump to a random

Inverted lists in practice contain a lot of context information

PageRank is not the final ranking

Another index for searching text

Conceptually, to construct a suffix array for string S

Enumerate all | S | suffixes of S

Sort these suffixes in lexicographical order

To search for occurrences of a substring

Do a binary search on the suffix array

A tree with edges labeled by characters

A node represents the string obtained by

Compact trie: replace a path without branches by a

Internal nodes have fan-out ≥ 2 (except the root)

No two edges out of the same node can share the

Instead of inlining the string labels, store pointers to

Instead of labeling each edge by a string, only label by the

Leaves point to strings

A Pat tree indexes all suffixes of a large string in a Patricia

A String B-tree uses a Patricia trie to store and compare

General tree-based string indexing tricks

Trie, Patricia trie, String B-tree

Good exercise: put them in a GiST! ☺

Two general ways to index for substring queries

Index words: inverted lists, signature files