






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The concepts of inverse document frequency (idf) and full weighting (tf-idf) in information retrieval. Idf measures how much a term helps to discriminate between documents, while tf-idf calculates the weight of a term in a document based on its frequency and idf. Document similarity is measured using the cosine coefficient of their vector representations. The document also covers document retrieval and evaluation measures, probabilistic retrieval, and latent semantic analysis.
Typology: Slides
1 / 10
This page cannot be seen from the preview
Don't miss anything!







-^ n^ - total number of documents in the set •^ Inverse document frequency
j j^
n n IDF^
log=
-^ Ranks documents by measuring the similaritybetween each document and the query •^ Similarity between two documents
d^ and
d^ ′^ is
a function
s(d, d
′)^ ∈^ R
-^ In a vector-space representation the cosinecoefficient of two document vectors is ameasure of similarity
-^ The cosine of the angle formed by two documentvectors
x^ and
-^ Documents with many common terms will havevectors close to each other, than documents withfewer overlapping terms
'^ ' '^ ), cos(^
xx xx xx
T =^ ⋅
π) - Fraction of retrieved documents that are actually relevant • Recall (
ρ) - Fraction of relevant documents that are retrieved
-^ Probabilistic Ranking Principle (PRP)(Robertson, 1977)^ –
ranking of the documents in the order ofdecreasing probability of relevance to the userquery – probabilities are estimated as accurately aspossible on basis of available data – overall effectiveness of such as system will be thebest obtainable
-^ Why need it?^ –
serious problems for retrieval methods based onterm matching^ •^ vector-space similarity approach works only if theterms of the query are explicitly present in the relevantdocuments – rich expressive power of natural language^ •^ often queries contain terms that express
concepts
related to text to be retrieved