Download Web Searching & Indexing: Inverted Lists, Signature Files, and Ranking Result Pages and more Slides Introduction to Database Management Systems in PDF only on Docsity!
Web Searching & Indexing
CPS 116
Introduction to Database Systems
Announcements
Homework #4 due on Thursday (Dec. 2)
Homework #3 graded
Available for pick up in my office tomorrow
Course project demo signup begins tomorrow via
email
Final exam on Friday, Dec. 10
More info and a brief review this Thursday
3
Keyword search
Google… Web | Images | Groups | Directory Google Search | I’m Feeling Lucky Advanced Search | Preferences | Language Tools…
Association for Computing Machinery Founded in 1947, ACM is the world’s first educational and scientific computing society. Today, our members—…
CPS 216: Advanced Database Systems (Fall 2001) Course Information Course Description / Time and Place / Books Resources: Staff…
The Internet Movie Database (IMDb)…
… Search the Internet Movie Database. For more search options, please visit Search central…
database AND search Search
What are the documents containing both “database” and “search”?
4
Keywords × documents
Inverted lists: store the matrix by rows
Signature files: store the matrix by columns
1 1 1 … 1 1 1 0 … 0 0 0 1 … 0 0 1 0 … 1 0 0 1 … 0 … … … … …
Document 1Document 2Document 3Document
n
All documents
“a”
“database”
“cat”
“dog” “search”
All keywords
1 means keyword appears in the document 0 means otherwise
5
Inverted lists
Store the matrix by rows
For each keyword, store an inverted list
h keyword , doc-id-list i
h“database”, {3, 7, 142, 857, …}i
h“search”, {3, 9, 192, 512, …}i
It helps to sort doc-id-list (why?)
Vocabulary index on keywords
B+^ -tree or hash-based
How large is an inverted list index?
6
Using inverted lists
Documents containing “database”
Use the vocabulary index to find the inverted list for
“database”
Return documents in the inverted list
Documents containing “database” AND “search”
Return documents in the intersection of the two inverted
lists
OR? NOT?
Union and difference, respectively
What are “all” the keywords?
All sequences of letters (up to a given length)?
… that actually appear in documents!
All words in English?
Plus all phrases?
Alternative: approximate phrase search by proximity
Minus all stop words
They appear in nearly every document, e.g., a, of, the, it Not useful in search
Combine words with common stems
Example: database, databases They can be treated as the same for the purpose of search
Frequency and proximity
Frequency
h keyword , { h doc-id , number-of-occurrences i,
h doc-id , number-of-occurrences i,
… }i
Proximity (and frequency)
h keyword , { h doc-id , h position-of-occurrence 1 ,
position-of-occurrence 2 , …i,
h doc-id , h position-of-occurrnece 1 , …ii,
… }i
When doing AND, check for positions that are near
9
Signature files
Store the matrix by columns and compress them
For each document, store a w -bit signature
Each word is hashed into a w -bit value, with only s
< w bits turned on
Signature is computed by taking the bit-wise OR of
the hash values of all words on the document
) Some false positives; no false negatives
hash (“database”) = 0110 hash (“dog”) = 1100 hash (“cat”) = 0010
doc 1 contains “database”: 0110 doc 2 contains “dog”: 1100 doc 3 contains “cat” and “dog”: 1110
Does doc 3 contain “database”?
10
Bit-sliced signature files
Motivation
To check if a document contains a word, we only need to check the bits that are set in the word’s hash value So why bother retrieving all w bits of the signature?
Instead of storing n signature
files, store w bit slices
Only check the slices that
correspond to the set bits in the
word’s hash value
Start from the sparse slices
doc signature 1 0 0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 3 0 0 0 1 1 0 1 0 4 0 1 1 0 1 1 0 0 … … n 0 0 0 0 1 0 1 0
Bit-sliced signature files
Slice 7 …Slice 0
Starting to look like an inverted list again!
11
Inverted lists versus signatures
Inverted lists better for most purposes ( TODS , 1998)
Problems of signature files
False positives Hard to use because s , w , and the hash function need tuning to work well Long documents will likely have mostly 1’s in signatures Common words will create mostly 1’s for their slices Difficult to extend with features such as frequency, proximity
Saving grace of signature files
Sizes are tunable Good for lots of search terms Good for computing similarity of documents
12
Ranking result pages
A single search may return many pages
A user will not look at all result pages Complete result may be unnecessary )Result pages need to be ranked
Possible ranking criteria
Based on content
- Number of occurrences of the search terms
- Similarity to the query text Based on link structure
- Backlink count
- PageRank And more…
Random surfer model
A random surfer
Starts with a random page
Randomly selects a link on the page to visit next
Never uses the “back” button
PageRank( p ) measures the probability that a random
surfer visits page p
Problems with the naïve PageRank
Dead end: a page with no
outgoing links
A dead end causes all
importance to “leak”
eventually out of the Web
Spider trap: a group of
pages with no links out of
the group
A spider trap will eventually
accumulate all importance
of the Web
Netscape
Amazon Microsoft
Netscape
Amazon Microsoft
21
Practical PageRank
d : decay factor
PageRank( p ) =
d · Σ q ∈ B ( p ) (PageRank( q ) ⁄ N ( q )) + (1 – d )
Intuition in the random surfer model
A surfer occasionally gets bored and jump to a random
page on the Web instead of following a random link on
the current page
22
Google (1998)
Inverted lists in practice contain a lot of context information
PageRank is not the final ranking
Type-weight: depends on the type of the occurrence
- For example, large font weights more than small font Count-weight: depends on the number of occurrences
- Increases linearly first but then tapers off For multiple search terms, nearby occurrences are matched together and a proximity measure is computed
- Closer proximity weights more
Capitalization
Relative font size In URL/title/meta tag In anchor text
Within the page Within the page Within the anchor URL associated with the anchor
23
Trie: a string index
A tree with edges labeled by characters
A node represents the string obtained by
concatenating all characters along the path from the
root
Compact trie: replace a path without branches by a
single edge labeled by a string
a
c (^) l
b p
p
e e l
a
c
b pp
le e l
What’s the max fan-out?
24
Suffix tree
Index all suffixes of a large string in a compact trie
) Can support arbitrary substring matching
Internal nodes have fan-out ≥ 2 (except the root)
No two edges out of the same node can share the
same first character
To get linear space
Instead of inlining the string labels, store pointers to
them in the original string
) Bad for external memory
Patricia trie, Pat tree, String B-tree
A Patricia trie is just like a compact trie, but
Instead of labeling each edge by a string, only label by the
first character and the string length
Leaves point to strings
) Faster search (especially for external memory) because of
inlining of the first character
) But must validate answer at leaves for skipped characters
A Pat tree indexes all suffixes of a string in a Patricia trie
A String B-tree uses a Patricia trie to store and compare
strings in B-tree nodes
Summary
General tree-based string indexing tricks
Trie, Patricia trie, String B-tree
Two general ways to index for substring queries
Index words: inverted lists, signature files
Index all suffixes: suffix tree, Pat tree, suffix array (not
covered)
Web search and information retrieval go beyond
substring queries
IDF, PageRank, …