Download Advanced Database Systems-Lecture 10 Slides-Computer Science and more Slides Database Management Systems (DBMS) in PDF only on Docsity!
Indexing: Part IV
CPS 216
Advanced Database Systems
2
Announcements (February 17)
Homework #2 due in two weeks
Reading assignments for this and next week
“The” query processing survey by Graefe
Due next Wednesday
Midterm and course project proposal in three weeks
3
Keyword search
Google… Web | Images | Groups | Directory Google Search | I’m Feeling Lucky Advanced Search | Preferences | Language Tools…
Association for Computing Machinery Founded in 1947, ACM is the world’s first educational and scientific computing society. Today, our members—…
CPS 216: Advanced Database Systems (Fall 2001) Course Information Course Description / Time and Place / Books Resources: Staff…
The Internet Movie Database (IMDb)… … Search the Internet Movie Database. For more search options, please visit Search central…
database AND search Search
What are the documents containing both “database” and “search”?
Keywords × documents
Inverted lists: store the matrix by rows
Signature files: store the matrix by columns
With compression, of course!
1 1 1 … 1 1 1 0 … 0 0 0 1 … 0 0 1 0 … 1 0 0 1 … 0 … … … … …
Document 1Document 2Document 3Document
n
All documents
“a”
“database”
“cat”
“dog” “search”
All keywords
1 means keyword appears in the document 0 means otherwise
5
Inverted lists
Store the matrix by rows
For each keyword, store an inverted list
h keyword , doc-id-list i
h“database”, {3, 7, 142, 857, …}i
h“search”, {3, 9, 192, 512, …}i
It helps to sort doc-id-list (why?)
Vocabulary index on keywords
B+^ -tree or hash-based
How large is an inverted list index?
6
Using inverted lists
Documents containing “database”
Use the vocabulary index to find the inverted list for
“database”
Return documents in the inverted list
Documents containing “database” AND “search”
OR? NOT?
Bit-sliced signature files
Motivation
To check if a document contains a word, we only need to check the bits that are set in the word’s hash value So why bother retrieving all w bits of the signature?
Instead of storing n signature
files, store w bit slices
Only check the slices that
correspond to the set bits in the
word’s hash value
Start from the sparse slices
doc signature 1 0 0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 3 0 0 0 1 1 0 1 0 4 0 1 1 0 1 1 0 0 … … n 0 0 0 0 1 0 1 0
Bit-sliced signature files
Slice 7 …Slice 0
Starting to look like an inverted list again!
11
Inverted lists versus signatures
Inverted lists better for most purposes ( TODS , 1998)
Problems of signature files
False positives Hard to use because s , w , and the hash function need tuning to work well Long documents will likely have mostly 1’s in signatures Common words will create mostly 1’s for their slices Difficult to extend with features such as frequency, proximity
Saving grace of signature files
Sizes are tunable Good for lots of search terms Good for computing similarity of documents
12
Ranking result pages
A single search may return many pages
A user will not look at all result pages Complete result may be unnecessary )Result pages need to be ranked
Possible ranking criteria
Based on content
- Number of occurrences of the search terms
- Similarity to the query text Based on link structure
- Backlink count
- PageRank And more…
Textual similarity
Vocabulary: [ w 1 , …, w n ]
IDF (Inverse Document Frequency): [ f 1 , …, f n ]
f i = log 2 (total # of docs / # of docs containing w i )
TF (Term Frequency): [ p 1 , …, p n ]
pi = # of times wi appears on p
Significance of words on page p : [ p 1 f 1 , …, p n f n ]
Textual similarity between two pages p and q is
defined to be [ p 1 f 1 , …, p n f n ] · [ q 1 f 1 , …, qn f n ] =
p 1 q 1 f 12 + … + p n qn f n^2
q could be the query text
14
Why weight significance by IDF?
15
Problems with content-based ranking
Many pages containing search terms may be of poor
quality or irrelevant
Example: a page with just a line “search engine”
Many high-quality or relevant pages do not even
contain the search terms
Example: Google homepage
Page containing more occurrences of the search
terms are ranked higher; spamming is easy
Example: a page with line “search engine” repeated
many times
Random surfer model
A random surfer
Starts with a random page
Randomly selects a link on the page to visit next
Never uses the “back” button
PageRank( p ) measures the probability that a random
surfer visits page p
20
Problems with the naïve PageRank
Dead end: a page with no
outgoing links
A dead end causes all
importance to “leak”
eventually out of the Web
Spider trap: a group of
pages with no links out of
the group
A spider trap will eventually
accumulate all importance
of the Web
Netscape
Amazon Microsoft
Netscape
Amazon Microsoft
21
Practical PageRank
d : decay factor
PageRank( p ) =
d · Σ q ∈ B ( p ) (PageRank( q ) ⁄ N ( q )) + (1 – d )
Intuition in the random surfer model
A surfer occasionally gets bored and jump to a random
page on the Web instead of following a random link on
the current page
Google (1998)
Inverted lists in practice contain a lot of context information
PageRank is not the final ranking
Type-weight: depends on the type of the occurrence
- For example, large font weights more than small font Count-weight: depends on the number of occurrences
- Increases linearly first but then tapers off For multiple search terms, nearby occurrences are matched together and a proximity measure is computed
- Closer proximity weights more
Capitalization
Relative font size In URL/title/meta tag In anchor text
Within the page Within the page Within the anchor URL associated with the anchor
23
Suffix arrays ( SODA , 1990)
Another index for searching text
Conceptually, to construct a suffix array for string S
Enumerate all | S | suffixes of S
Sort these suffixes in lexicographical order
To search for occurrences of a substring
Do a binary search on the suffix array
24
Suffix array example
Suffixes: mississippi ississippi ssissippi sissippi issippi ssippi sippi ippi ppi pi i
Sorted suffixes: i ippi issippi ississippi mississippi pi ppi sippi sissippi ssippi ssissippi
No need to store the suffix strings; just store where they start
Suffix array: 10 7 4 1 0 9 8 6 3 5 2
S = mississippi q = sip
O (| q | · log | S |)
Trie: a string index
A tree with edges labeled by characters
A node represents the string obtained by
concatenating all characters along the path from the
root
Compact trie: replace a path without branches by a
single edge labeled by a string
a
c (^) l
b p
p e e l
a
c
b pp
le e l
What’s the max fan-out?
29
Suffix tree
Index all suffixes of a large string in a compact trie
) Can support the same queries as a suffix array
Internal nodes have fan-out ≥ 2 (except the root)
No two edges out of the same node can share the
same first character
To get linear space
Instead of inlining the string labels, store pointers to
them in the original string
30
Patricia trie, Pat tree, String B-tree
A Patricia trie is just like a compact trie, but
Instead of labeling each edge by a string, only label by the
first character and the string length
Leaves point to strings
) Faster search (especially for external memory) because of
inlining of the first character
) But must validate answer at leaves for skipped characters
A Pat tree indexes all suffixes of a large string in a Patricia
trie
A String B-tree uses a Patricia trie to store and compare
strings in B-tree nodes
Summary
General tree-based string indexing tricks
Trie, Patricia trie, String B-tree
Good exercise: put them in a GiST! ☺
Two general ways to index for substring queries
Index words: inverted lists, signature files
Index all suffixes: suffix array, suffix tree, Pat tree
Web search and information retrieval go beyond
substring queries
TF/IDF, PageRank, …