Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Database Management Systems: Vector Space Model and Document Similarity, Slides of Database Management Systems (DBMS)

University of Wisconsin (UW) - Madison Database Management Systems (DBMS)

The vector space model in database management systems, where documents are represented as vectors in term space and queries are represented the same way. The similarity between queries and documents is determined by a vector distance measure, and document similarity is based on the length and direction of their vectors. The text also covers the concept of inverse document frequency and its role in assigning weights to terms.

Typology: Slides

2011/2012

Uploaded on 02/15/2012

arien 🇺🇸

4.8

(24)

309 documents

1 / 12

This page cannot be seen from the preview

Don't miss anything!

Database Management Systems, R.Ramakrishnan 1

Computing Relevance, Similarity:

The Vector Space Model

Chapter 27, Part B

Based on Larson and Hearst’s slides at

UC-Berkeley

http://www.sims.berkeley.edu/courses/is202/f00/

Database Management Systems, R.Ramakrishnan 2

Document Vectors

Documents are represented as “bags of

words”

Represented as vectors when used

computationally

•A vector is like an array of floating point

•Has direction and magnitude

•Each vector holds a place for every term in the

collection

•Therefore, most vectors are sparse

Database Management Systems, R.Ramakrishnan 3

Document Vectors:

One location for each word.

nova galaxy heat h’wood film role diet fur

10 5 3

510

10 8 7

9105

10 10

910

57 9

610 28

75 13

“Nova” occurs 10 times in text A

“Galaxy” occurs 5 times in text A

“Heat” occurs 3 times in text A

(Blank means 0 occurrences.)

Discover Slides of Database Management Systems (DBMS) University of Wisconsin (UW) - Madison

Partial preview of the text

Download Database Management Systems: Vector Space Model and Document Similarity and more Slides Database Management Systems (DBMS) in PDF only on Docsity!

Database Management Systems, R. Ramakrishnan 1

Computing Relevance, Similarity:

The Vector Space Model

Chapter 27, Part B

Based on Larson and Hearst’s slides at

UC-Berkeley

http://www.sims.berkeley.edu/courses/is202/f00/

Database Management Systems, R. Ramakrishnan 2

Document Vectors

Documents are represented as “bags of

words”

Represented as vectors when used

computationally

A vector is like an array of floating point
Has direction and magnitude
Each vector holds a place for every term in the collection
Therefore, most vectors are sparse

Document Vectors:

One location for each word.

nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3

A B C D E F G H I

“Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.)

Database Management Systems, R. Ramakrishnan 4

Document Vectors

nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3

A B C D E F G H I

Document ids

Database Management Systems, R. Ramakrishnan 5

We Can Plot the Vectors

Star

Diet

Doc about astronomy Doc about movie stars

Doc about mammal behavior

Assumption: Documents that are “close” in space are similar.

Database Management Systems, R. Ramakrishnan 6

Vector Space Model

Documents are represented as vectors in term

space

Terms are usually stems
Documents represented by binary vectors of terms

Queries represented the same as documents

A vector distance measure between the query

and documents is used to rank retrieved

documents

Query and Document similarity is based on length and direction of their vectors
Vector operations to capture boolean query conditions

Database Management Systems, R. Ramakrishnan 10

Raw Term Weights

The frequency of occurrence for the term in

each document is included in the vector

docs t1 t2 t D1 2 0 3 D2 1 0 0 D3 0 4 7 D4 3 0 0 D5 1 6 3 D6 3 5 0 D7 0 8 0 D8 0 10 0 D9 0 0 1 D10 0 3 5 D11 4 0 1

Database Management Systems, R. Ramakrishnan 11

TF x IDF Weights

tf x idf measure:

Term Frequency (tf)
Inverse Document Frequency (idf) -- a way to deal with the problems of the Zipf distribution

Goal: Assign a tf * idf weight to each term in

each document

TF x IDF Calculation

wik = tfik * log( N / nk )

log

thenumberofdocumentsin thatcontainT

totalnumberofdocumentsin thecollection

inversedocumentfrequencyoftermTin

frequencyoftermTindocument

term indocument

n

idf N

n C

N C

idf C

tf D

T k D

k k

ik k i

k i

Database Management Systems, R. Ramakrishnan 13

Inverse Document Frequency

IDF provides high values for rare words and

low values for common words

log

log^10000

For a collection of 10000 documents

Database Management Systems, R. Ramakrishnan 14

= t

k ik k

ik k ik

tf Nn

w

( )^2 [log( / )]^2

log( / )

TF x IDF Normalization

Normalize the term weights (so longer

documents are not unfairly given more

weight)

The longer the document, the more likely it is for a given term to appear in it, and the more often a given term is likely to appear in it. So, we want to reduce the importance attached to a term appearing in a document based on the length of the document.

Pair-wise Document Similarity

nova galaxy heat h’wood film role diet fur 1 3 1 5 2 2 1 5 4 1

A B C D

How to compute document similarity?

Database Management Systems, R. Ramakrishnan 19

Computing Relevance Scores

[( 0. 4 ) ( 0. 8 )]*[( 0. 2 ) ( 0. 7 )]

Whatdoestheirsimilaritycomparisonyield?

Also,document ( 0. 2 , 0. 7 )

Say wehavequery vector ( 0. 4 , 0. 8 )

(^22222)

simQ D

D

Q

Database Management Systems, R. Ramakrishnan 20

Vector Space with Term Weights

and Cosine Matching

0 0.2 0.4 0.6 0.8 1.

D (^2)

D (^1)

α 1

α 2

Term B

Term A

Di =( d (^) i1,wdi1;d (^) i2, wdi2;…;d (^) it , wdit ) Q =( q (^) i1,wqi1;q (^) i2, wqi2;…;q (^) it , wqit )

= =

= = t j

t q j d

t j q d i j ij

j ij w w

ww simQD 1 1 2 2

1 ( ) ( )

(, )

Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7)

[( 0. 4 ) ( 0. 8 )][( 0. 2 ) ( 0. 7 )]

( , 2 ) (^0.^40.^2 ) (^0.^80.^7 ) 2 2 2 2

= =

⋅ +

simQD = ⋅ + ⋅

( , ).^56

simQD 1 = =

Text Clustering

Finds overall similarities among groups of

documents

Finds overall similarities among groups of

tokens

Picks out some themes, ignores others

Database Management Systems, R. Ramakrishnan 22

Text Clustering

Term 1

Term 2

Clustering is

“The art of finding groups in data.” -- Kaufmann and Rousseeu

Database Management Systems, R. Ramakrishnan 23

Problems with Vector Space

There is no real theoretical basis for the

assumption of a term space

It is more for visualization than having any real basis
Most similarity measures work about the same

Terms are not really orthogonal dimensions

Terms are not independent of all other terms; remember our discussion of correlated terms in text

Probabilistic Models

Rigorous formal model attempts to predict

the probability that a given document will be

relevant to a given query

Ranks retrieved documents according to this

probability of relevance (Probability Ranking

Principle)

Relies on accurate estimates of probabilities

Database Management Systems, R. Ramakrishnan 28

Relevance Feedback

Main Idea:

Modify existing query based on relevance judgements - Extract terms from relevant documents and add them to the query - AND/OR re-weight the terms already in the query

There are many variations:

Usually positive weights for terms from relevant docs
Sometimes negative weights for terms from non- relevant docs

Users, or the system, guide this process by

l ti t f t ti ll

Database Management Systems, R. Ramakrishnan 29

Rocchio Method

Rocchio automatically

Re-weights terms
Adds in new terms (from relevant docs)
- have to be careful when using negative terms
- Rocchio is not a machine learning algorithm

Rocchio Method

(insomestudiesbest toset to0.75and to 0.25)

, and tunetheimportanceofrelevantandnonrelevant terms

thenumberofnon-relevantdocumentschosen

thenumberofrelevantdocumentschosen

thevectorfor thenon-relevantdocument

thevectorfor therelevantdocument

thevectorfor theinitialquery

1 1 2 1

1 0

1 2

= =

S i

R i

Q

where

S

R

Q Q

Database Management Systems, R. Ramakrishnan 31

Rocchio/Vector Illustration

Retrieval

Information

D 1

D 2

Q 0

Q’

Q”

Q 0 = retrieval of information = (0.7,0.3) D 1 = information science = (0.2,0.8) D 2 = retrieval systems = (0.9,0.1)

Q’ = ½Q 0 + ½ * D 1 = (0.45,0.55) Q” = ½Q 0 + ½ * D 2 = (0.80,0.20)

Database Management Systems, R. Ramakrishnan 32

Alternative Notions of Relevance Feedback

Find people whose taste is “similar” to yours.

Will you like what they like?

Follow a user’s actions in the background.

Can this be used to predict what the user will want to see next?

Track what lots of people are doing.

Does this implicitly indicate what they think is good and not good?

Collaborative Filtering (Social Filtering)

If Pam liked the paper, I’ll like the paper

If you liked Star Wars, you’ll like

Independence Day

Rating based on ratings of similar people

Ignores text, so also works on sound, pictures etc.
But: Initial users can bias ratings of future users

Database Management Systems: Vector Space Model and Document Similarity, Slides of Database Management Systems (DBMS)

Related documents

Partial preview of the text

Download Database Management Systems: Vector Space Model and Document Similarity and more Slides Database Management Systems (DBMS) in PDF only on Docsity!

Computing Relevance, Similarity:

The Vector Space Model

Chapter 27, Part B

Based on Larson and Hearst’s slides at

UC-Berkeley

Document Vectors

 Documents are represented as “bags of

words”

 Represented as vectors when used

computationally

Document Vectors:

One location for each word.

Document Vectors

We Can Plot the Vectors

Vector Space Model

 Documents are represented as vectors in term

space

 Queries represented the same as documents

 A vector distance measure between the query

and documents is used to rank retrieved

documents

Raw Term Weights

 The frequency of occurrence for the term in

each document is included in the vector

TF x IDF Weights

 tf x idf measure:

 Goal: Assign a tf * idf weight to each term in

each document

TF x IDF Calculation

wik = tfik * log( N / nk )

log

thenumberofdocumentsin thatcontainT

totalnumberofdocumentsin thecollection

inversedocumentfrequencyoftermTin

frequencyoftermTindocument

term indocument

n

idf N

n C

N C

idf C

tf D

T k D

Inverse Document Frequency

 IDF provides high values for rare words and

low values for common words

= t

tf Nn

tf Nn

w

( )^2 [log( / )]^2

log( / )

TF x IDF Normalization

 Normalize the term weights (so longer

documents are not unfairly given more

weight)

Pair-wise Document Similarity

How to compute document similarity?

Computing Relevance Scores

[( 0. 4 ) ( 0. 8 )]*[( 0. 2 ) ( 0. 7 )]

Whatdoestheirsimilaritycomparisonyield?

Also,document ( 0. 2 , 0. 7 )

Say wehavequery vector ( 0. 4 , 0. 8 )

simQ D

D

Q

Vector Space with Term Weights

and Cosine Matching

( , ).^56

Text Clustering

 Finds overall similarities among groups of

documents

 Finds overall similarities among groups of

tokens

 Picks out some themes, ignores others

Text Clustering

Documents are represented as “bags of

Represented as vectors when used

Documents are represented as vectors in term

Queries represented the same as documents

A vector distance measure between the query

The frequency of occurrence for the term in

tf x idf measure:

Goal: Assign a tf * idf weight to each term in

IDF provides high values for rare words and

Normalize the term weights (so longer

Finds overall similarities among groups of

Finds overall similarities among groups of

Picks out some themes, ignores others

There is no real theoretical basis for the

Terms are not really orthogonal dimensions

Rigorous formal model attempts to predict

Ranks retrieved documents according to this

Relies on accurate estimates of probabilities

Main Idea:

There are many variations:

Users, or the system, guide this process by

Rocchio automatically

Find people whose taste is “similar” to yours.

Follow a user’s actions in the background.

Track what lots of people are doing.

If Pam liked the paper, I’ll like the paper

If you liked Star Wars, you’ll like

Rating based on ratings of similar people