Download Database Management Systems: Vector Space Model and Document Similarity and more Slides Database Management Systems (DBMS) in PDF only on Docsity!
Database Management Systems, R. Ramakrishnan 1
Computing Relevance, Similarity:
The Vector Space Model
Chapter 27, Part B
Based on Larson and Hearst’s slides at
UC-Berkeley
http://www.sims.berkeley.edu/courses/is202/f00/
Database Management Systems, R. Ramakrishnan 2
Document Vectors
Documents are represented as “bags of
words”
Represented as vectors when used
computationally
- A vector is like an array of floating point
- Has direction and magnitude
- Each vector holds a place for every term in the collection
- Therefore, most vectors are sparse
Document Vectors:
One location for each word.
nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3
A B C D E F G H I
“Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.)
Database Management Systems, R. Ramakrishnan 4
Document Vectors
nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3
A B C D E F G H I
Document ids
Database Management Systems, R. Ramakrishnan 5
We Can Plot the Vectors
Star
Diet
Doc about astronomy Doc about movie stars
Doc about mammal behavior
Assumption: Documents that are “close” in space are similar.
Database Management Systems, R. Ramakrishnan 6
Vector Space Model
Documents are represented as vectors in term
space
- Terms are usually stems
- Documents represented by binary vectors of terms
Queries represented the same as documents
A vector distance measure between the query
and documents is used to rank retrieved
documents
- Query and Document similarity is based on length and direction of their vectors
- Vector operations to capture boolean query conditions
Database Management Systems, R. Ramakrishnan 10
Raw Term Weights
The frequency of occurrence for the term in
each document is included in the vector
docs t1 t2 t D1 2 0 3 D2 1 0 0 D3 0 4 7 D4 3 0 0 D5 1 6 3 D6 3 5 0 D7 0 8 0 D8 0 10 0 D9 0 0 1 D10 0 3 5 D11 4 0 1
Database Management Systems, R. Ramakrishnan 11
TF x IDF Weights
tf x idf measure:
- Term Frequency (tf)
- Inverse Document Frequency (idf) -- a way to deal with the problems of the Zipf distribution
Goal: Assign a tf * idf weight to each term in
each document
TF x IDF Calculation
wik = tfik * log( N / nk )
log
thenumberofdocumentsin thatcontainT
totalnumberofdocumentsin thecollection
inversedocumentfrequencyoftermTin
frequencyoftermTindocument
term indocument
n
idf N
n C
N C
idf C
tf D
T k D
k k
k k
k k
ik k i
k i
Database Management Systems, R. Ramakrishnan 13
Inverse Document Frequency
IDF provides high values for rare words and
low values for common words
log
log^10000
log^10000
log^10000
For a collection of 10000 documents
Database Management Systems, R. Ramakrishnan 14
= t
k ik k
ik k ik
tf Nn
tf Nn
w
1
( )^2 [log( / )]^2
log( / )
TF x IDF Normalization
Normalize the term weights (so longer
documents are not unfairly given more
weight)
- The longer the document, the more likely it is for a given term to appear in it, and the more often a given term is likely to appear in it. So, we want to reduce the importance attached to a term appearing in a document based on the length of the document.
Pair-wise Document Similarity
nova galaxy heat h’wood film role diet fur 1 3 1 5 2 2 1 5 4 1
A B C D
How to compute document similarity?
Database Management Systems, R. Ramakrishnan 19
Computing Relevance Scores
[( 0. 4 ) ( 0. 8 )]*[( 0. 2 ) ( 0. 7 )]
Whatdoestheirsimilaritycomparisonyield?
Also,document ( 0. 2 , 0. 7 )
Say wehavequery vector ( 0. 4 , 0. 8 )
(^22222)
2
simQ D
D
Q
Database Management Systems, R. Ramakrishnan 20
Vector Space with Term Weights
and Cosine Matching
0 0.2 0.4 0.6 0.8 1.
D (^2)
D (^1)
Q
α 1
α 2
Term B
Term A
Di =( d (^) i1,wdi1;d (^) i2, wdi2;…;d (^) it , wdit ) Q =( q (^) i1,wqi1;q (^) i2, wqi2;…;q (^) it , wqit )
= =
= = t j
t q j d
t j q d i j ij
j ij w w
ww simQD 1 1 2 2
1 ( ) ( )
(, )
Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7)
98
42
64
[( 0. 4 ) ( 0. 8 )][( 0. 2 ) ( 0. 7 )]
( , 2 ) (^0.^40.^2 ) (^0.^80.^7 ) 2 2 2 2
= =
simQD = ⋅ + ⋅
( , ).^56
simQD 1 = =
Text Clustering
Finds overall similarities among groups of
documents
Finds overall similarities among groups of
tokens
Picks out some themes, ignores others
Database Management Systems, R. Ramakrishnan 22
Text Clustering
Term 1
Term 2
Clustering is
“The art of finding groups in data.” -- Kaufmann and Rousseeu
Database Management Systems, R. Ramakrishnan 23
Problems with Vector Space
There is no real theoretical basis for the
assumption of a term space
- It is more for visualization than having any real basis
- Most similarity measures work about the same
Terms are not really orthogonal dimensions
- Terms are not independent of all other terms; remember our discussion of correlated terms in text
Probabilistic Models
Rigorous formal model attempts to predict
the probability that a given document will be
relevant to a given query
Ranks retrieved documents according to this
probability of relevance (Probability Ranking
Principle)
Relies on accurate estimates of probabilities
Database Management Systems, R. Ramakrishnan 28
Relevance Feedback
Main Idea:
- Modify existing query based on relevance judgements - Extract terms from relevant documents and add them to the query - AND/OR re-weight the terms already in the query
There are many variations:
- Usually positive weights for terms from relevant docs
- Sometimes negative weights for terms from non- relevant docs
Users, or the system, guide this process by
l ti t f t ti ll
Database Management Systems, R. Ramakrishnan 29
Rocchio Method
Rocchio automatically
- Re-weights terms
- Adds in new terms (from relevant docs)
- have to be careful when using negative terms
- Rocchio is not a machine learning algorithm
Rocchio Method
(insomestudiesbest toset to0.75and to 0.25)
, and tunetheimportanceofrelevantandnonrelevant terms
thenumberofnon-relevantdocumentschosen
thenumberofrelevantdocumentschosen
thevectorfor thenon-relevantdocument
thevectorfor therelevantdocument
thevectorfor theinitialquery
2
1
0
1 1 2 1
1 0
1 2
= =
n
n
S i
R i
Q
where
S
n
R
n
Q Q
i
i
i
n
i
n
i
i
Database Management Systems, R. Ramakrishnan 31
Rocchio/Vector Illustration
Retrieval
Information
D 1
D 2
Q 0
Q’
Q”
Q 0 = retrieval of information = (0.7,0.3) D 1 = information science = (0.2,0.8) D 2 = retrieval systems = (0.9,0.1)
Q’ = ½Q 0 + ½ * D 1 = (0.45,0.55) Q” = ½Q 0 + ½ * D 2 = (0.80,0.20)
Database Management Systems, R. Ramakrishnan 32
Alternative Notions of Relevance Feedback
Find people whose taste is “similar” to yours.
- Will you like what they like?
Follow a user’s actions in the background.
- Can this be used to predict what the user will want to see next?
Track what lots of people are doing.
- Does this implicitly indicate what they think is good and not good?
Collaborative Filtering (Social Filtering)
If Pam liked the paper, I’ll like the paper
If you liked Star Wars, you’ll like
Independence Day
Rating based on ratings of similar people
- Ignores text, so also works on sound, pictures etc.
- But: Initial users can bias ratings of future users
Sally Bob Chris Lynn Karen
Star Wars 7 7 3 4 7
Jurassic Park 6 4 7 4 4
Terminator II 3 4 7 6 3
Independence Day 7 7 2 2?