Inverse Document Frequency (IDF) and Document Similarity, Slides of Fundamentals of E-Commerce

The concepts of inverse document frequency (idf) and full weighting (tf-idf) in information retrieval. Idf measures how much a term helps to discriminate between documents, while tf-idf calculates the weight of a term in a document based on its frequency and idf. Document similarity is measured using the cosine coefficient of their vector representations. The document also covers document retrieval and evaluation measures, probabilistic retrieval, and latent semantic analysis.

Typology: Slides

2012/2013

Uploaded on 07/29/2013

masti
masti 🇮🇳

4.5

(10)

121 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Inverse document frequency (IDF)
A term that occurs in a few documents is likely to be a
better discriminator than a term that appears in most
or all documents
nj - Number of documents which contain the term
ω
j
n - total number of documents in the set
Inverse document frequency
j
j
n
n
IDF log=
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Inverse Document Frequency (IDF) and Document Similarity and more Slides Fundamentals of E-Commerce in PDF only on Docsity!

Inverse document frequency (IDF) • A term that occurs in a few documents is likely to be abetter discriminator than a term that appears in mostor all documents • n- Number of documents which contain the termj^

ωj

-^ n^ - total number of documents in the set •^ Inverse document frequency

j j^

n n IDF^

log=

Inverse document frequency (IDF)

Document Similarity

-^ Ranks documents by measuring the similaritybetween each document and the query •^ Similarity between two documents

d^ and

d^ ′^ is

a function

s(d, d

′)^ ∈^ R

-^ In a vector-space representation the cosinecoefficient of two document vectors is ameasure of similarity

Cosine Coefficient

-^ The cosine of the angle formed by two documentvectors

x^ and

x^ ′^ is

-^ Documents with many common terms will havevectors close to each other, than documents withfewer overlapping terms

'^ ' '^ ), cos(^

xx xx xx

T =^ ⋅

Retrieval and Evaluation Measures • Precision (

π) - Fraction of retrieved documents that are actually relevant • Recall (

ρ) - Fraction of relevant documents that are retrieved

  • RRR = π
  • RR ∩ * R = ρ

Probabilistic Retrieval

-^ Probabilistic Ranking Principle (PRP)(Robertson, 1977)^ –

ranking of the documents in the order ofdecreasing probability of relevance to the userquery – probabilities are estimated as accurately aspossible on basis of available data – overall effectiveness of such as system will be thebest obtainable

Latent Semantic Analysis

-^ Why need it?^ –

serious problems for retrieval methods based onterm matching^ •^ vector-space similarity approach works only if theterms of the query are explicitly present in the relevantdocuments – rich expressive power of natural language^ •^ often queries contain terms that express

concepts

related to text to be retrieved