Advanced Database Systems-Lecture 10 Slides-Computer Science, Slides of Database Management Systems (DBMS)

Indexing, Keywords × Documents, Keyword Search, Inverted Lists, Frequency and Proximity, Signature Files, Bit-sliced Signature Files, Motivation, Ranking Result Pages, Textual Similarity, Content-based Ranking, Backlink, Intuition, Google’s PageRank, Naïve PageRank, Random Surfer Model, Dead End, Spider Trap, Practical PageRank, Suffix Arrays, Trie: a String Index, Suffix Tree, Patricia Trie, Pat Tree, String B-tree

Typology: Slides

2011/2012

Uploaded on 01/28/2012

arold
arold 🇺🇸

4.7

(24)

372 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Indexing: Part IV
CPS 216
Advanced Database Systems
2
Announcements (February 17)
Homework #2 due in two weeks
Reading assignments for this and next week
“The” query processing survey by Graefe
Due next Wednesday
Midterm and course project proposal in three weeks
3
Keyword search
Google…
Web | Images | Groups
| Directory
Google Search | I’m
Feeling Lucky
Advanced Search |
Preferences | Language
Tools
Association for
Computing Machinery
Founded in 1947,
ACM is the world’s
first educational and
scientific computing
society. Today, our
members—…
CPS 216: Advanced
Database Systems
(Fall 2001)
Course Information
Course Description /
Time and Place /
Books
Resources: Staff…
The Internet Movie
Database (IMDb)…
… Search the Internet
Movie Database. For
more search options,
please visit Search
central…
database AND search Search
What are the documents containing both “database” and “search”?
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Advanced Database Systems-Lecture 10 Slides-Computer Science and more Slides Database Management Systems (DBMS) in PDF only on Docsity!

Indexing: Part IV

CPS 216

Advanced Database Systems

2

Announcements (February 17)

™ Homework #2 due in two weeks

™ Reading assignments for this and next week

ƒ “The” query processing survey by Graefe

ƒ Due next Wednesday

™ Midterm and course project proposal in three weeks

3

Keyword search

Google… Web | Images | Groups | Directory Google Search | I’m Feeling Lucky Advanced Search | Preferences | Language Tools…

Association for Computing Machinery Founded in 1947, ACM is the world’s first educational and scientific computing society. Today, our members—…

CPS 216: Advanced Database Systems (Fall 2001) Course Information Course Description / Time and Place / Books Resources: Staff…

The Internet Movie Database (IMDb)… … Search the Internet Movie Database. For more search options, please visit Search central…

database AND search Search

What are the documents containing both “database” and “search”?

Keywords × documents

™ Inverted lists: store the matrix by rows

™ Signature files: store the matrix by columns

With compression, of course!

1 1 1 … 1 1 1 0 … 0 0 0 1 … 0 0 1 0 … 1 0 0 1 … 0 … … … … …

Document 1Document 2Document 3Document

n

All documents

“a”

“database”

“cat”

“dog” “search”

All keywords

1 means keyword appears in the document 0 means otherwise

5

Inverted lists

™ Store the matrix by rows

™ For each keyword, store an inverted list

ƒ h keyword , doc-id-list i

ƒ h“database”, {3, 7, 142, 857, …}i

ƒ h“search”, {3, 9, 192, 512, …}i

ƒ It helps to sort doc-id-list (why?)

™ Vocabulary index on keywords

ƒ B+^ -tree or hash-based

™ How large is an inverted list index?

6

Using inverted lists

™ Documents containing “database”

ƒ Use the vocabulary index to find the inverted list for

“database”

ƒ Return documents in the inverted list

™ Documents containing “database” AND “search”

™ OR? NOT?

Bit-sliced signature files

™ Motivation

ƒ To check if a document contains a word, we only need to check the bits that are set in the word’s hash value ƒ So why bother retrieving all w bits of the signature?

™ Instead of storing n signature

files, store w bit slices

™ Only check the slices that

correspond to the set bits in the

word’s hash value

™ Start from the sparse slices

doc signature 1 0 0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 3 0 0 0 1 1 0 1 0 4 0 1 1 0 1 1 0 0 … … n 0 0 0 0 1 0 1 0

Bit-sliced signature files

Slice 7 …Slice 0

Starting to look like an inverted list again!

11

Inverted lists versus signatures

™ Inverted lists better for most purposes ( TODS , 1998)

™ Problems of signature files

ƒ False positives ƒ Hard to use because s , w , and the hash function need tuning to work well ƒ Long documents will likely have mostly 1’s in signatures ƒ Common words will create mostly 1’s for their slices ƒ Difficult to extend with features such as frequency, proximity

™ Saving grace of signature files

ƒ Sizes are tunable ƒ Good for lots of search terms ƒ Good for computing similarity of documents

12

Ranking result pages

™ A single search may return many pages

ƒ A user will not look at all result pages ƒ Complete result may be unnecessary )Result pages need to be ranked

™ Possible ranking criteria

ƒ Based on content

  • Number of occurrences of the search terms
  • Similarity to the query text ƒ Based on link structure
  • Backlink count
  • PageRank ƒ And more…

Textual similarity

™ Vocabulary: [ w 1 , …, w n ]

™ IDF (Inverse Document Frequency): [ f 1 , …, f n ]

ƒ f i = log 2 (total # of docs / # of docs containing w i )

™ TF (Term Frequency): [ p 1 , …, p n ]

ƒ pi = # of times wi appears on p

™ Significance of words on page p : [ p 1 f 1 , …, p n f n ]

™ Textual similarity between two pages p and q is

defined to be [ p 1 f 1 , …, p n f n ] · [ q 1 f 1 , …, qn f n ] =

p 1 q 1 f 12 + … + p n qn f n^2

ƒ q could be the query text

14

Why weight significance by IDF?

15

Problems with content-based ranking

™ Many pages containing search terms may be of poor

quality or irrelevant

ƒ Example: a page with just a line “search engine”

™ Many high-quality or relevant pages do not even

contain the search terms

ƒ Example: Google homepage

™ Page containing more occurrences of the search

terms are ranked higher; spamming is easy

ƒ Example: a page with line “search engine” repeated

many times

Random surfer model

™ A random surfer

ƒ Starts with a random page

ƒ Randomly selects a link on the page to visit next

ƒ Never uses the “back” button

™ PageRank( p ) measures the probability that a random

surfer visits page p

20

Problems with the naïve PageRank

™ Dead end: a page with no

outgoing links

ƒ A dead end causes all

importance to “leak”

eventually out of the Web

™ Spider trap: a group of

pages with no links out of

the group

ƒ A spider trap will eventually

accumulate all importance

of the Web

Netscape

Amazon Microsoft

Netscape

Amazon Microsoft

21

Practical PageRank

™ d : decay factor

™ PageRank( p ) =

d · Σ q ∈ B ( p ) (PageRank( q ) ⁄ N ( q )) + (1 – d )

™ Intuition in the random surfer model

ƒ A surfer occasionally gets bored and jump to a random

page on the Web instead of following a random link on

the current page

Google (1998)

™ Inverted lists in practice contain a lot of context information

™ PageRank is not the final ranking

ƒ Type-weight: depends on the type of the occurrence

  • For example, large font weights more than small font ƒ Count-weight: depends on the number of occurrences
  • Increases linearly first but then tapers off ƒ For multiple search terms, nearby occurrences are matched together and a proximity measure is computed
  • Closer proximity weights more

Capitalization

Relative font size In URL/title/meta tag In anchor text

Within the page Within the page Within the anchor URL associated with the anchor

23

Suffix arrays ( SODA , 1990)

™ Another index for searching text

™ Conceptually, to construct a suffix array for string S

ƒ Enumerate all | S | suffixes of S

ƒ Sort these suffixes in lexicographical order

™ To search for occurrences of a substring

ƒ Do a binary search on the suffix array

24

Suffix array example

Suffixes: mississippi ississippi ssissippi sissippi issippi ssippi sippi ippi ppi pi i

Sorted suffixes: i ippi issippi ississippi mississippi pi ppi sippi sissippi ssippi ssissippi

No need to store the suffix strings; just store where they start

Suffix array: 10 7 4 1 0 9 8 6 3 5 2

S = mississippi q = sip

O (| q | · log | S |)

Trie: a string index

™ A tree with edges labeled by characters

™ A node represents the string obtained by

concatenating all characters along the path from the

root

™ Compact trie: replace a path without branches by a

single edge labeled by a string

a

c (^) l

b p

p e e l

a

c

b pp

le e l

What’s the max fan-out?

29

Suffix tree

Index all suffixes of a large string in a compact trie

) Can support the same queries as a suffix array

™ Internal nodes have fan-out ≥ 2 (except the root)

™ No two edges out of the same node can share the

same first character

To get linear space

™ Instead of inlining the string labels, store pointers to

them in the original string

30

Patricia trie, Pat tree, String B-tree

A Patricia trie is just like a compact trie, but

™ Instead of labeling each edge by a string, only label by the

first character and the string length

™ Leaves point to strings

) Faster search (especially for external memory) because of

inlining of the first character

) But must validate answer at leaves for skipped characters

™ A Pat tree indexes all suffixes of a large string in a Patricia

trie

™ A String B-tree uses a Patricia trie to store and compare

strings in B-tree nodes

Summary

™ General tree-based string indexing tricks

ƒ Trie, Patricia trie, String B-tree

ƒ Good exercise: put them in a GiST! ☺

™ Two general ways to index for substring queries

ƒ Index words: inverted lists, signature files

ƒ Index all suffixes: suffix array, suffix tree, Pat tree

™ Web search and information retrieval go beyond

substring queries

ƒ TF/IDF, PageRank, …