Web Searching & Indexing: Inverted Lists, Signature Files, and Ranking Result Pages, Slides of Introduction to Database Management Systems

Various techniques used in web searching and indexing, including inverted lists, signature files, and ranking result pages. Inverted lists store the matrix by rows for each keyword and help sort document ids. Signature files store the matrix by columns and compress them for each document, allowing for efficient bit-wise comparisons. Ranking result pages is necessary as users will not look at all result pages, and pages need to be ranked based on content and link structure.

Typology: Slides

2011/2012

Uploaded on 01/29/2012

arold
arold 🇺🇸

4.7

(24)

372 documents

1 / 5

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Web Searching & Indexing
CPS 116
Introduction to Database Systems
2
Announcements
Homework #4 due on Thursday (Dec. 2)
Homework #3 graded
Available for pick up in my office tomorrow
Course project demo signup begins tomorrow via
email
Final exam on Friday, Dec. 10
More info and a brief review this Thursday
3
Keyword search
Google…
Web | Images | Groups
| Directory
Google Search | I’m
Feeling Lucky
Advanced Search |
Preferences | Language
Tools
Association for
Computing Machinery
Founded in 1947,
ACM is the world’s
first educational and
scientific computing
society. Today, our
members—…
CPS 216: Advanced
Database Systems
(Fall 2001)
Course Information
Course Description /
Time and Place /
Books
Resources: Staff…
The Internet Movie
Database (IMDb)…
… Search the Internet
Movie Database. For
more search options,
please visit Search
central…
database AND search Search
What are the documents containing both “database” and “search”?
4
Keywords ×documents
Inverted lists: store the matrix by rows
Signature files: store the matrix by columns
1111
1100
0010
0101
0010
……………
Document 1
Document 2
Document 3
Document n
All documents
“a”
“database”
“cat”
“dog”
“search”
All keywords
1 means keyword appears in the document
0 means otherwise
5
Inverted lists
Store the matrix by rows
For each keyword, store an inverted list
hkeyword, doc-id-listi
h“database”, {3, 7, 142, 857, …}i
h“search”, {3, 9, 192, 512, …}i
It helps to sort doc-id-list (why?)
Vocabulary index on keywords
B+-tree or hash-based
How large is an inverted list index?
6
Using inverted lists
Documents containing “database”
Use the vocabulary index to find the inverted list for
“database”
Return documents in the inverted list
Documents containing “database” AND “search”
Return documents in the intersection of the two inverted
lists
OR? NOT?
Union and difference, respectively
pf3
pf4
pf5

Partial preview of the text

Download Web Searching & Indexing: Inverted Lists, Signature Files, and Ranking Result Pages and more Slides Introduction to Database Management Systems in PDF only on Docsity!

Web Searching & Indexing

CPS 116

Introduction to Database Systems

Announcements

™ Homework #4 due on Thursday (Dec. 2)

™ Homework #3 graded

ƒ Available for pick up in my office tomorrow

™ Course project demo signup begins tomorrow via

email

™ Final exam on Friday, Dec. 10

ƒ More info and a brief review this Thursday

3

Keyword search

Google… Web | Images | Groups | Directory Google Search | I’m Feeling Lucky Advanced Search | Preferences | Language Tools…

Association for Computing Machinery Founded in 1947, ACM is the world’s first educational and scientific computing society. Today, our members—…

CPS 216: Advanced Database Systems (Fall 2001) Course Information Course Description / Time and Place / Books Resources: Staff…

The Internet Movie Database (IMDb)…

… Search the Internet Movie Database. For more search options, please visit Search central…

database AND search Search

What are the documents containing both “database” and “search”?

4

Keywords × documents

™ Inverted lists: store the matrix by rows

™ Signature files: store the matrix by columns

1 1 1 … 1 1 1 0 … 0 0 0 1 … 0 0 1 0 … 1 0 0 1 … 0 … … … … …

Document 1Document 2Document 3Document

n

All documents

“a”

“database”

“cat”

“dog” “search”

All keywords

1 means keyword appears in the document 0 means otherwise

5

Inverted lists

™ Store the matrix by rows

™ For each keyword, store an inverted list

ƒ h keyword , doc-id-list i

ƒ h“database”, {3, 7, 142, 857, …}i

ƒ h“search”, {3, 9, 192, 512, …}i

ƒ It helps to sort doc-id-list (why?)

™ Vocabulary index on keywords

ƒ B+^ -tree or hash-based

™ How large is an inverted list index?

6

Using inverted lists

™ Documents containing “database”

ƒ Use the vocabulary index to find the inverted list for

“database”

ƒ Return documents in the inverted list

™ Documents containing “database” AND “search”

ƒ Return documents in the intersection of the two inverted

lists

™ OR? NOT?

ƒ Union and difference, respectively

What are “all” the keywords?

™ All sequences of letters (up to a given length)?

ƒ … that actually appear in documents!

™ All words in English?

™ Plus all phrases?

ƒ Alternative: approximate phrase search by proximity

™ Minus all stop words

ƒ They appear in nearly every document, e.g., a, of, the, it ƒ Not useful in search

™ Combine words with common stems

ƒ Example: database, databases ƒ They can be treated as the same for the purpose of search

Frequency and proximity

™ Frequency

ƒ h keyword , { h doc-id , number-of-occurrences i,

h doc-id , number-of-occurrences i,

… }i

™ Proximity (and frequency)

ƒ h keyword , { h doc-id , h position-of-occurrence 1 ,

position-of-occurrence 2 , …i,

h doc-id , h position-of-occurrnece 1 , …ii,

… }i

ƒ When doing AND, check for positions that are near

9

Signature files

™ Store the matrix by columns and compress them

™ For each document, store a w -bit signature

™ Each word is hashed into a w -bit value, with only s

< w bits turned on

™ Signature is computed by taking the bit-wise OR of

the hash values of all words on the document

) Some false positives; no false negatives

hash (“database”) = 0110 hash (“dog”) = 1100 hash (“cat”) = 0010

doc 1 contains “database”: 0110 doc 2 contains “dog”: 1100 doc 3 contains “cat” and “dog”: 1110

Does doc 3 contain “database”?

10

Bit-sliced signature files

™ Motivation

ƒ To check if a document contains a word, we only need to check the bits that are set in the word’s hash value ƒ So why bother retrieving all w bits of the signature?

™ Instead of storing n signature

files, store w bit slices

™ Only check the slices that

correspond to the set bits in the

word’s hash value

™ Start from the sparse slices

doc signature 1 0 0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 3 0 0 0 1 1 0 1 0 4 0 1 1 0 1 1 0 0 … … n 0 0 0 0 1 0 1 0

Bit-sliced signature files

Slice 7 …Slice 0

Starting to look like an inverted list again!

11

Inverted lists versus signatures

™ Inverted lists better for most purposes ( TODS , 1998)

™ Problems of signature files

ƒ False positives ƒ Hard to use because s , w , and the hash function need tuning to work well ƒ Long documents will likely have mostly 1’s in signatures ƒ Common words will create mostly 1’s for their slices ƒ Difficult to extend with features such as frequency, proximity

™ Saving grace of signature files

ƒ Sizes are tunable ƒ Good for lots of search terms ƒ Good for computing similarity of documents

12

Ranking result pages

™ A single search may return many pages

ƒ A user will not look at all result pages ƒ Complete result may be unnecessary )Result pages need to be ranked

™ Possible ranking criteria

ƒ Based on content

  • Number of occurrences of the search terms
  • Similarity to the query text ƒ Based on link structure
  • Backlink count
  • PageRank ƒ And more…

Random surfer model

™ A random surfer

ƒ Starts with a random page

ƒ Randomly selects a link on the page to visit next

ƒ Never uses the “back” button

™ PageRank( p ) measures the probability that a random

surfer visits page p

Problems with the naïve PageRank

™ Dead end: a page with no

outgoing links

ƒ A dead end causes all

importance to “leak”

eventually out of the Web

™ Spider trap: a group of

pages with no links out of

the group

ƒ A spider trap will eventually

accumulate all importance

of the Web

Netscape

Amazon Microsoft

Netscape

Amazon Microsoft

21

Practical PageRank

™ d : decay factor

™ PageRank( p ) =

d · Σ q ∈ B ( p ) (PageRank( q ) ⁄ N ( q )) + (1 – d )

™ Intuition in the random surfer model

ƒ A surfer occasionally gets bored and jump to a random

page on the Web instead of following a random link on

the current page

22

Google (1998)

™ Inverted lists in practice contain a lot of context information

™ PageRank is not the final ranking

ƒ Type-weight: depends on the type of the occurrence

  • For example, large font weights more than small font ƒ Count-weight: depends on the number of occurrences
  • Increases linearly first but then tapers off ƒ For multiple search terms, nearby occurrences are matched together and a proximity measure is computed
  • Closer proximity weights more

Capitalization

Relative font size In URL/title/meta tag In anchor text

Within the page Within the page Within the anchor URL associated with the anchor

23

Trie: a string index

™ A tree with edges labeled by characters

™ A node represents the string obtained by

concatenating all characters along the path from the

root

™ Compact trie: replace a path without branches by a

single edge labeled by a string

a

c (^) l

b p

p

e e l

a

c

b pp

le e l

What’s the max fan-out?

24

Suffix tree

Index all suffixes of a large string in a compact trie

) Can support arbitrary substring matching

™ Internal nodes have fan-out ≥ 2 (except the root)

™ No two edges out of the same node can share the

same first character

To get linear space

™ Instead of inlining the string labels, store pointers to

them in the original string

) Bad for external memory

Patricia trie, Pat tree, String B-tree

A Patricia trie is just like a compact trie, but

™ Instead of labeling each edge by a string, only label by the

first character and the string length

™ Leaves point to strings

) Faster search (especially for external memory) because of

inlining of the first character

) But must validate answer at leaves for skipped characters

™ A Pat tree indexes all suffixes of a string in a Patricia trie

™ A String B-tree uses a Patricia trie to store and compare

strings in B-tree nodes

Summary

™ General tree-based string indexing tricks

ƒ Trie, Patricia trie, String B-tree

™ Two general ways to index for substring queries

ƒ Index words: inverted lists, signature files

ƒ Index all suffixes: suffix tree, Pat tree, suffix array (not

covered)

™ Web search and information retrieval go beyond

substring queries

ƒ IDF, PageRank, …