Download Distributed Info Retrieval: Algorithms for Partitioning, Files, and Queries and more Slides Artificial Intelligence in PDF only on Docsity!
Lecture 9
Distributed Retrieval
- Algorithms for distributing processing
- Source selection
- Collection fusion
Retrieval of Speech
Distributed Systems Review
Splitting work among N processors
might decrease response time by as
much as a factor of N
Actual speedup is less than N
- Inherently sequential operations
- Rate limiting steps
- Costs of communication and fusion
Time of best sequential algorithm Time of parallel algorithm
Performance Measures
Speedup
Amdahl’s Law
Efficiency
Elements processed by a search algorithm
Can be split by terms, or by documents
- Which is better? k 1 k 2... ki... kt d 1 w1,1 w2,1... wi,1... wt, d 2 w1,2 w2,2... wi,2... wt, ..................... dj w1,j w2,j... wi,j... wt,j ..................... dN w1,N w2,N... wi,N... wt,N Indexing Items D o c u m e n t s
Architecture
Logical Document Partitioning
Dictionary representation is extended
item i P P P P Inverted List Term i Dictionary
LDP: Algorithms
Inverted File Construction
- The indexer partitions the documents among the processors
- Each indexing process generates a batch of inverted lists, sorted by indexing item
- A merge step creates the final inverted file
Query Evaluation
- Each process executes the same document scoring algorithm on its document subcollection
- The search processes records document scores in a single shared array of document score accumulators
- A broker produces the final ranked list of documents
Physical Document Partitioning
Data Partitioning
- The documents are physically partitioned
into separate collections, one for each
parallel processor
- Each subcollection has its own inverted file
for a fraction of the index
- Generally, the documents are disjoint (i.e.,
not duplicated)
PDP: Algorithms
Inverted File Construction
- Each processor creates, in parallel, its own complete index corresponding to its documents
- Optionally, a merge step is performed to accumulate the global statistics for all of the partitions and distribute them to each of the partition dictionaries
Query Evaluation
- The broker distributes the query to all of the parallel search processes
- Each parallel search process evaluates the query on its portion of the document collection, producing an intermediate hit-list
- The broker collects the intermediate hit-lists from all of the parallel search processes and merges them into a final hit-list
Example
Document Text 1 Pease porridge hot 2 Pease porridge cold 3 Pease porridge in the pot 4 Pease porridge hot, pease porridge not cold 5 Pease porridge cold, pease porridge not hot 6 Pease porridge hot in the pot <6,1> cold hot in not pease porridge pot the <1,1> <2,1> <3,1> <4,2> <5,2> Dictionary <2,1> <4,1> <1,1> <4,1> <5,1> <6,1> <3,1> <6,1> <4,1> <5,1> <1,1> <2,1> <3,1> <4,2> <5,2> <6,1> <3,1> <6,1> <3,1> <6,1> Inverted Lists <5,1>
Example: Logical Doc Partitioning
<6,1> cold hot in not pease porridge pot P P P the <1,1> <2,1> <3,1> <4,2> <5,2> Inverted List Term “pease” Dictionary
Example: Physical Doc Partitioning
cold hot in not pease porridge pot the <3,1> <4,2> <4,1> <4,1> <3,1> <4,1> <3,1> <4,2> <3,1> <3,1> P hot pease porridge <1,1> <2,1> <1,1> <1,1> <2,1> P cold (^) <2,1> <6,1> hot in not pease porridge pot the <5,2> <5,1> <6,1> <6,1> <5,1> <5,2> <6,1> <6,1> <6,1> P cold (^) <5,1>
Example: Term Partitioning
<6,1> cold hot in not pease porridge pot the <1,1> <2,1> <3,1> <4,2> <5,2> <2,1> <4,1> <1,1> <4,1> <5,1> <6,1> <3,1> <6,1> <4,1> <5,1> <1,1> <2,1> <3,1> <4,2> <5,2> <6,1> <3,1> <6,1> <3,1> <6,1> P P P <5,1>
Server Selection Problem
In a truly distributed system, documents
may still be clustered by time or subject
matter
Then we only need to consider a few of
many subcollections
- Can’t ask for postings lists – too large!
- But we can afford a small ranked list
This is much more efficient that making
requests of all servers
How do we know which servers to talk
to?
Server Selection Using IR
One approach treats collections as if
they were large documents
- We need some knowledge about the DB
To pick which collection(s) is most likely
to contain relevant information
- rank using, say, the vector cosine model
- rank by highest document freq. of query
terms
If more than one collection is searched,
we have a new problem:
- How to merge disparate results (ranked lists)
Result Combination Strategies
Round robin or weighted round robin
(Normalized) scores
- Consider a search of “computer” and two
collections
- Sports Illustrated articles
- ACM Digital Library
- How should scores be normalized?
Some linear combination of scores
Etc…
Collection Weighting
Jamie Callan suggested re-weighting
documents using collection-specific weights:
Callan has since studied collection fusion
using TREC datasets
- His framework was to split data by source/month
- ~ 400 sub collections for TREC
Map/Reduce
CACM: Disk is the new RAM
- April 2008 article: ‘Solving Rubik’s Cube’
- Bandwidth of disks in a cluster is high
Cloud Computing is here to stay
Tamer Elsayed (UMCP)
- All pairs document similarity
- ACL 2008, “Pairwise document similarity with MapReduce”
- 19 node Hadoop cluster Tamer Elsayed (UMCP): Pairwise Document Similarity (^26) in Large Collections with MapReduce
MapReduce Framework
input input input input output output output (k 2 , [v 2 ]) (k 1 , v 1 ) [(k 3 , v 3 )] [k 2 , v 2 ]
27
Indexing (3-doc toy collection)
Clinton Barack Cheney Obama Indexing 2 1 1 1 1 Clinton Obama Clinton 1 1 Clinton Cheney Clinton Barack Obama Clinton Obama Clinton Clinton Cheney Clinton Barack Obama Tamer Elsayed (UMCP): Pairwise Document Similarity in Large Collections with MapReduce 28
Pairwise Similarity
Clinton Barack Cheney Obama 2 1 1 1 1 1 1 Tamer Elsayed (UMCP): Pairwise Document Similarity in Large Collections with MapReduce
Meta Search on the Web
Evidence (rankings)
from multiple
search engines can
be combined
Relies on
redundancy
- How should results be combined?
Examples
- Dogpile
- MonsterCrawler
- info.com
Multimedia Retrieval
Image
- Large storage requirements
- Methods are based on content
- Features: color histograms, intensity values, gradients, edges
- On Web, anchor text / captions can be used Video
- Takes massive amount of space to store
- With shot boundary detection, reverts to image retrieval Speech
- Acoustic data much larger than text
- Conversion to text shrinks storage and simplifies search; creates word errors OCR’d Text
- Coping with Misspellings
Retrieval of Speech
Speech sources
- Teleconferences and meetings
- Video with audio (e.g., television)
- Radio
- Medical & Legal transcription
Automated recognition is possible in real time
- Speaker dependent
- Small vocabulary
- Speaker independent and large vocabulary
Classic example
Other examples
I scream (ice cream)
Grade A (gray day)
sixty sick sheep (sixty six sheep)
The boys are hoarse (the boy’s a horse)
The stuffy nose (the stuff he knows)
Rap city (rhapsody)
a girl with colitis goes by (a girl with
kaleidoscope eyes)
Gladly the cross-eyed bear. (Gladly the
cross I'd bear.)
TREC Initiatives
Created ‘large’ collections
- TRECs 7-
- Voice of America, ABC News, CNN
- 20 to 100 hours
Format
- Speech signals with transcripts available
- Queries were text
‘High accuracy’ Broadcast News
- Mistakes in recognition hurt IR performance
- Many out-of-vocabulary words
A Solved Problem?
Speech retrieval was studied in the mid/late 90s at TREC – then dropped 11 years ago
- “progress has occurred so quickly, that one might conclude that SDR is a solved problem” - Garofolo et al. (2000), ‘The TREC Spoken Document Retrieval Track: A Success Story’
A Typical IR/Speech System
Indexing Human Speech (processing per audio sequence) Text-based Search of Speech Archive (processing per query) Acoustic Model Language Model Hypothesis Selection Signal Processing & Feature Extraction Index Creation Generate Terms from User Query Identify Candidate Documents (^) SimilarityRank By Measure User Interaction Who is the president of Indonesia? ... Megawati Sukarnoputri, former President of the Republic of Indonesia, ... phones word lattice sentences
Spoken Document Representation
ASR system lexicons are limited (10-100k words)
- Unable to generate out-of-vocabulary (OOV) words
- IR search for an OOV is doomed to failure Subword representations have been proposed for this scenario
- phone – the smallest unit of spoken language
- About 45 phones are present in English
- Phone 3-grams suggested for broadcast news
- K. Ng (2000), ‘Subword-based Approaches for Spoken Document Retrieval’, PhD Thesis, MIT Combine Speech-to-Text (STT) & phonetics
- Provides a way to solve the OOV problem
- Requires extra work