Database Management Systems: Vector Space Model and Document Similarity, Slides of Database Management Systems (DBMS)

The vector space model in database management systems, where documents are represented as vectors in term space and queries are represented the same way. The similarity between queries and documents is determined by a vector distance measure, and document similarity is based on the length and direction of their vectors. The text also covers the concept of inverse document frequency and its role in assigning weights to terms.

Typology: Slides

2011/2012

Uploaded on 02/15/2012

arien
arien 🇺🇸

4.8

(24)

309 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Database Management Systems, R.Ramakrishnan 1
Computing Relevance, Similarity:
The Vector Space Model
Chapter 27, Part B
Based on Larson and Hearst’s slides at
UC-Berkeley
http://www.sims.berkeley.edu/courses/is202/f00/
Database Management Systems, R.Ramakrishnan 2
Document Vectors
Documents are represented as “bags of
words”
Represented as vectors when used
computationally
A vector is like an array of floating point
Has direction and magnitude
Each vector holds a place for every term in the
collection
Therefore, most vectors are sparse
Database Management Systems, R.Ramakrishnan 3
Document Vectors:
One location for each word.
nova galaxy heat h’wood film role diet fur
10 5 3
510
10 8 7
9105
10 10
910
57 9
610 28
75 13
A
B
C
D
E
F
G
H
I
“Nova” occurs 10 times in text A
“Galaxy” occurs 5 times in text A
“Heat” occurs 3 times in text A
(Blank means 0 occurrences.)
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Database Management Systems: Vector Space Model and Document Similarity and more Slides Database Management Systems (DBMS) in PDF only on Docsity!

Database Management Systems, R. Ramakrishnan 1

Computing Relevance, Similarity:

The Vector Space Model

Chapter 27, Part B

Based on Larson and Hearst’s slides at

UC-Berkeley

http://www.sims.berkeley.edu/courses/is202/f00/

Database Management Systems, R. Ramakrishnan 2

Document Vectors

™ Documents are represented as “bags of

words”

™ Represented as vectors when used

computationally

  • A vector is like an array of floating point
  • Has direction and magnitude
  • Each vector holds a place for every term in the collection
  • Therefore, most vectors are sparse

Document Vectors:

One location for each word.

nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3

A B C D E F G H I

“Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.)

Database Management Systems, R. Ramakrishnan 4

Document Vectors

nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3

A B C D E F G H I

Document ids

Database Management Systems, R. Ramakrishnan 5

We Can Plot the Vectors

Star

Diet

Doc about astronomy Doc about movie stars

Doc about mammal behavior

Assumption: Documents that are “close” in space are similar.

Database Management Systems, R. Ramakrishnan 6

Vector Space Model

™ Documents are represented as vectors in term

space

  • Terms are usually stems
  • Documents represented by binary vectors of terms

™ Queries represented the same as documents

™ A vector distance measure between the query

and documents is used to rank retrieved

documents

  • Query and Document similarity is based on length and direction of their vectors
  • Vector operations to capture boolean query conditions

Database Management Systems, R. Ramakrishnan 10

Raw Term Weights

™ The frequency of occurrence for the term in

each document is included in the vector

docs t1 t2 t D1 2 0 3 D2 1 0 0 D3 0 4 7 D4 3 0 0 D5 1 6 3 D6 3 5 0 D7 0 8 0 D8 0 10 0 D9 0 0 1 D10 0 3 5 D11 4 0 1

Database Management Systems, R. Ramakrishnan 11

TF x IDF Weights

™ tf x idf measure:

  • Term Frequency (tf)
  • Inverse Document Frequency (idf) -- a way to deal with the problems of the Zipf distribution

™ Goal: Assign a tf * idf weight to each term in

each document

TF x IDF Calculation

wik = tfik * log( N / nk )

log

thenumberofdocumentsin thatcontainT

totalnumberofdocumentsin thecollection

inversedocumentfrequencyoftermTin

frequencyoftermTindocument

term indocument

n

idf N

n C

N C

idf C

tf D

T k D

k k

k k

k k

ik k i

k i

Database Management Systems, R. Ramakrishnan 13

Inverse Document Frequency

™ IDF provides high values for rare words and

low values for common words

log

log^10000

log^10000

log^10000

For a collection of 10000 documents

Database Management Systems, R. Ramakrishnan 14

= t

k ik k

ik k ik

tf Nn

tf Nn

w

1

( )^2 [log( / )]^2

log( / )

TF x IDF Normalization

™ Normalize the term weights (so longer

documents are not unfairly given more

weight)

  • The longer the document, the more likely it is for a given term to appear in it, and the more often a given term is likely to appear in it. So, we want to reduce the importance attached to a term appearing in a document based on the length of the document.

Pair-wise Document Similarity

nova galaxy heat h’wood film role diet fur 1 3 1 5 2 2 1 5 4 1

A B C D

How to compute document similarity?

Database Management Systems, R. Ramakrishnan 19

Computing Relevance Scores

[( 0. 4 ) ( 0. 8 )]*[( 0. 2 ) ( 0. 7 )]

Whatdoestheirsimilaritycomparisonyield?

Also,document ( 0. 2 , 0. 7 )

Say wehavequery vector ( 0. 4 , 0. 8 )

(^22222)

2

simQ D

D

Q

Database Management Systems, R. Ramakrishnan 20

Vector Space with Term Weights

and Cosine Matching

0 0.2 0.4 0.6 0.8 1.

D (^2)

D (^1)

Q

α 1

α 2

Term B

Term A

Di =( d (^) i1,wdi1;d (^) i2, wdi2;…;d (^) it , wdit ) Q =( q (^) i1,wqi1;q (^) i2, wqi2;…;q (^) it , wqit )

= =

= = t j

t q j d

t j q d i j ij

j ij w w

ww simQD 1 1 2 2

1 ( ) ( )

(, )

Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7)

  1. 98

  2. 42

  3. 64

[( 0. 4 ) ( 0. 8 )][( 0. 2 ) ( 0. 7 )]

( , 2 ) (^0.^40.^2 ) (^0.^80.^7 ) 2 2 2 2

= =

  • ⋅ +

simQD = ⋅ + ⋅

( , ).^56

simQD 1 = =

Text Clustering

™ Finds overall similarities among groups of

documents

™ Finds overall similarities among groups of

tokens

™ Picks out some themes, ignores others

Database Management Systems, R. Ramakrishnan 22

Text Clustering

Term 1

Term 2

Clustering is

“The art of finding groups in data.” -- Kaufmann and Rousseeu

Database Management Systems, R. Ramakrishnan 23

Problems with Vector Space

™ There is no real theoretical basis for the

assumption of a term space

  • It is more for visualization than having any real basis
  • Most similarity measures work about the same

™ Terms are not really orthogonal dimensions

  • Terms are not independent of all other terms; remember our discussion of correlated terms in text

Probabilistic Models

™ Rigorous formal model attempts to predict

the probability that a given document will be

relevant to a given query

™ Ranks retrieved documents according to this

probability of relevance (Probability Ranking

Principle)

™ Relies on accurate estimates of probabilities

Database Management Systems, R. Ramakrishnan 28

Relevance Feedback

™ Main Idea:

  • Modify existing query based on relevance judgements - Extract terms from relevant documents and add them to the query - AND/OR re-weight the terms already in the query

™ There are many variations:

  • Usually positive weights for terms from relevant docs
  • Sometimes negative weights for terms from non- relevant docs

™ Users, or the system, guide this process by

l ti t f t ti ll

Database Management Systems, R. Ramakrishnan 29

Rocchio Method

™ Rocchio automatically

  • Re-weights terms
  • Adds in new terms (from relevant docs)
    • have to be careful when using negative terms
    • Rocchio is not a machine learning algorithm

Rocchio Method

(insomestudiesbest toset to0.75and to 0.25)

, and tunetheimportanceofrelevantandnonrelevant terms

thenumberofnon-relevantdocumentschosen

thenumberofrelevantdocumentschosen

thevectorfor thenon-relevantdocument

thevectorfor therelevantdocument

thevectorfor theinitialquery

2

1

0

1 1 2 1

1 0

1 2

= =

n

n

S i

R i

Q

where

S

n

R

n

Q Q

i

i

i

n

i

n

i

i

Database Management Systems, R. Ramakrishnan 31

Rocchio/Vector Illustration

Retrieval

Information

D 1

D 2

Q 0

Q’

Q”

Q 0 = retrieval of information = (0.7,0.3) D 1 = information science = (0.2,0.8) D 2 = retrieval systems = (0.9,0.1)

Q’ = ½Q 0 + ½ * D 1 = (0.45,0.55) Q” = ½Q 0 + ½ * D 2 = (0.80,0.20)

Database Management Systems, R. Ramakrishnan 32

Alternative Notions of Relevance Feedback

™ Find people whose taste is “similar” to yours.

  • Will you like what they like?

™ Follow a user’s actions in the background.

  • Can this be used to predict what the user will want to see next?

™ Track what lots of people are doing.

  • Does this implicitly indicate what they think is good and not good?

Collaborative Filtering (Social Filtering)

™ If Pam liked the paper, I’ll like the paper

™ If you liked Star Wars, you’ll like

Independence Day

™ Rating based on ratings of similar people

  • Ignores text, so also works on sound, pictures etc.
  • But: Initial users can bias ratings of future users

Sally Bob Chris Lynn Karen

Star Wars 7 7 3 4 7

Jurassic Park 6 4 7 4 4

Terminator II 3 4 7 6 3

Independence Day 7 7 2 2?