Download Information Retrieval Performance, Lecture Slides - Computer Science and more Slides Artificial Intelligence in PDF only on Docsity!
Week 4
Evaluation of IR Performance
- (IIR Chapter 8)
- Evaluation Metrics
- Test Collections
Query Refinement
- (IIR Chapter 9)
- Query Expansion
- Relevance Feedback
Relevance
In what ways can a document be relevant
to a query?
- Answer question precisely (or partially)
- Suggest a source for more information
- Give background information
- Remind the user of other knowledge
Degree of Relevance
- Binary: relevant or not?
- On a scale: 0-
Intrinsic or Subjective
What to Evaluate?
What can be measured that reflects
users’ ability to use system? (Cleverdon 66)
- Coverage of Information
- Form of Presentation
- Effort required/Ease of Use
- Time and Space Efficiency
- Recall
- proportion of relevant material actually retrieved
- Precision
- proportion of retrieved material actually relevant
Problems with Precision/Recall
Can’t know true recall value
- except in small collections
Precision/Recall measure different
aspects of search quality
- A combined measure sometimes is more
appropriate
Focused somewhat on set evaluation vs.
ranked lists
Precision/Recall Curves There is a tradeoff between precision and recall So measure precision at different levels of recall
- Interpolate
- Average over multiple queries precision recall x x x x x x x x x x x Which of these three curves is better?
Existing Test Collections
Historical, Small
– CACM (3204), CISI (1460), CRAN (1397)
– INSPEC (12684), MED (1033), REUTERS (21578)
More recent
- TREC: 1.5M newsprint; 10/18/100GB Web; 1 TB
- CLEF: 1M+ newspaper articles from 1994 – 1995, 13 langs
- NTCIR: Asian languages (JP, KR, ZH)
- FIRE: Hindi, Marathi, Bengali (about 100k docs each)
- Reuters RV1 (880,000)
Text REtrieval Conference (TREC)
Annual bake-off for text retrieval systems
Sponsored by
Roughly 2.5 gigabytes of text, newswire
- 50 “topics” (queries)
- Return top 1000 documents per topic (~80 groups)
- Results judged by retired intelligence analysts
- Documents are relevant or not
Numerous tracks
- Cross-Language
- Spoken Documents
- Question Answering
http://trec.nist.gov/
Sample TREC Topic
Number: 285 Topic: World submarine forces Description: Determine the number of submarines, both nuclear-powered and conventional, presently in the inventories of all the countries in the world. Narrative: We are looking for a count of operable submarines in any country that currently has a navy with submarines. To be relevant a document should give a specific number of submarines, but not necessarily its entire fleet of submarines (although, that is our ultimate goal). A report of a French submarine suffering a mishap in the North sea would not be relevant. However, a report of a new submarine being built in Shanghai that contains other valuable information, such as “this is the third reported unit constructed at this base” would be relevant. Any information that would be considered useful as an intelligence tool in determining a country’s submarine order of battle would be relevant. SGML Markup Short Phrase Sentence Paragraph
How Test Runs are Evaluated
First ranked doc is relevant, which is 10% of the total relevant. Therefore Precision at the 10% Recall level is 100% Next Relevant gives us 66% Precision at 20% recall level Etc….
1. d 123 *
2. d 84
3. d 56 *
4. d 6
5. d 8
6. d 9 *
7. d 511
8. d 129
9. d 187
10. d 25 *
11. d 38
12. d 48
13. d 250
14. d 113
15. d 3 *
Rq={d 3 ,d 5 ,d 9 ,d 25 ,d 39 ,d 44 ,d 56 ,d 71 ,d 89 ,d 123 } : 10 Relevant Example from Chapter 3 in MIR
Interpolation
So, at recall levels 0%, 10%, 20%, and
30% the interpolated precision is 33.3%
At recall levels 40%, 50%, and 60%
interpolated precision is 25%
And at recall levels 70%, 80%, 90% and
100%, interpolated precision is 20%
Giving graph…
X
Interpolation
P R E C I S I O N
RECALL
X X
Computing Average Precision
We sum over the 5 seen relevant documents (1 + 0.66 + 0.5 + 0.4 + 0.3) We must divide by the number of relevant (10 docs)
- 5 relevant document’s weren’t observed Average Precision is 0. Mean average precision is an average over multiple queries
1. d 123 *
2. d 84
3. d 56 *
4. d 6
5. d 8
6. d 9 *
7. d 511
8. d 129
9. d 187
10. d 25 *
11. d 38
12. d 48
13. d 250
14. d 113
15. d 3 *
Rq={d 3 ,d 5 ,d 9 ,d 25 ,d 39 ,d 44 ,d 56 ,d 71 ,d 89 ,d 123 } : 10 Relevant
Using TREC_EVAL
Developed from SMART evaluation programs
for use in TREC
- trec_eval [-q] [-a] [-d] trec_qrel_file top_ranked_file
Uses:
- List of top-ranked documents
- QID iter docno rank sim runid
- 030 Q0 ZF08-175-870 0 4238 prise
- QRELS file for collection
- QID docno rel
- 251 0 FT911-1003 1
- 251 0 FT911-101 1
- 251 0 FT911-1300 0
Query Refinement
Global Methods
- Context independent thesaurus expansion
Local Methods
- Manual Feedback
- Automated Relevance Feedback
Query Modification
Problem: how to reformulate the query?
- Thesaurus expansion:
- Suggest terms similar to query terms
- Global method: see car, add ‘automobile’
- Relevance feedback:
- Suggest terms (and documents) similar to retrieved documents that have been judged (by a user) to be relevant (or use top k documents)
- Local method: based on retrieved document set
- Term re-weighting
Thesauri
Some electronic thesauri exist
- E.g., Roget’s
- Domain specific thesauri (e.g., chemistry)
- might map NaCl, salt, sodium chloride
Other approach is to induce one from a
collection of text, statistically
- Can capture synonymy
- But, might also get other word relations
My opinion
- Thesaurus-based query expansion is best
with controlled vocabulary search
Automagically building thesauri
Church (“One Term or Two”, SIGIR 1995)
looked at correlations between forms of
words in texts
If hostage and hostages both occur in a
document
- Worth more than single occurrence, but less
than two. Possibly 1.
Church proposes novel term weighting
scheme
hostages! hostages hostage 619 479 ! hostage 648 78,
Lexical Associations
Subjects write first word that comes to mind
- doctor/nurse; black/white (Palermo & Jenkins 64)
- Text corpora yield similar associations One measure: Mutual Information
- See Church and Hanks (Comp. Ling. ’90) If word occurrences are independent, the numerator and denominator become equal
- when measured across a large collection
Associations with “Doctor”
(AP Corpus, N=15 million, Church & Hanks)
MI(x,y) f(x,y) f(x) x f(y) y 11.3 12 111 honorary 621 doctor 11.3 8 1105 doctors 44 dentists 10.7 30 1105 doctors 241 nurses 9.4 8 1105 doctors 154 treating 9.0 6 275 examined 621 doctor 8.9 11 11.05 doctors 317 treat 8.7 25 621 doctor 1407 bills 0.96 6 621 doctor 73785 with 0.95 41 284690 a 1105 doctors 0.93 12 84716 is 1105 doctors
Term Similarity Calculations
Multiple metrics possible
- See Chung & Lee, JASIST 52(4), 2001 for comparison Contingency table Present Absent Present a b a+b Absent c d c+d a+c b+d N snow plow
Relevance Feedback
Main Idea:
- Modify existing query based on relevance
judgments
- Extract terms from relevant documents and add them to the query
- and/or re-weight the terms already in the query
- Manually
- Users select relevant documents
- Users/system select terms from an automatically- generated list
- Automated (blind/pseudo) rel. feedback
- Assume top k docs are relevant (e.g., 5 to 20)