Distributed Info Retrieval: Algorithms for Partitioning, Files, and Queries, Slides of Artificial Intelligence

Various algorithms for distributed retrieval, including logical and physical document partitioning, inverted file construction, and query evaluation. It covers methods for splitting work among processors, partitioning data, and merging results. The document also includes examples and comparisons between different approaches.

Typology: Slides

2010/2011

Uploaded on 11/09/2011

stagist
stagist 🇺🇸

4.1

(27)

265 documents

1 / 22

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Lecture 9
Distributed Retrieval
Algorithms for distributing processing
Source selection
Collection fusion
Retrieval of Speech
Distributed Systems Review
Splitting work among N processors
might decrease response time by as
much as a factor of N
Actual speedup is less than N
Inherently sequential operations
Rate limiting steps
Costs of communication and fusion
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16

Partial preview of the text

Download Distributed Info Retrieval: Algorithms for Partitioning, Files, and Queries and more Slides Artificial Intelligence in PDF only on Docsity!

Lecture 9

 Distributed Retrieval

  • Algorithms for distributing processing
  • Source selection
  • Collection fusion

 Retrieval of Speech

Distributed Systems Review

 Splitting work among N processors

might decrease response time by as

much as a factor of N

 Actual speedup is less than N

  • Inherently sequential operations
  • Rate limiting steps
  • Costs of communication and fusion

Time of best sequential algorithm Time of parallel algorithm

Performance Measures

 Speedup

 Amdahl’s Law

 Efficiency

 Elements processed by a search algorithm

 Can be split by terms, or by documents

  • Which is better? k 1 k 2... ki... kt d 1 w1,1 w2,1... wi,1... wt, d 2 w1,2 w2,2... wi,2... wt, ..................... dj w1,j w2,j... wi,j... wt,j ..................... dN w1,N w2,N... wi,N... wt,N Indexing Items D o c u m e n t s

Architecture

Logical Document Partitioning

 Dictionary representation is extended

item i P P P P Inverted List Term i Dictionary

LDP: Algorithms

 Inverted File Construction

  • The indexer partitions the documents among the processors
  • Each indexing process generates a batch of inverted lists, sorted by indexing item
  • A merge step creates the final inverted file

 Query Evaluation

  • Each process executes the same document scoring algorithm on its document subcollection
  • The search processes records document scores in a single shared array of document score accumulators
  • A broker produces the final ranked list of documents

Physical Document Partitioning

 Data Partitioning

  • The documents are physically partitioned

into separate collections, one for each

parallel processor

  • Each subcollection has its own inverted file

for a fraction of the index

  • Generally, the documents are disjoint (i.e.,

not duplicated)

PDP: Algorithms

 Inverted File Construction

  • Each processor creates, in parallel, its own complete index corresponding to its documents
  • Optionally, a merge step is performed to accumulate the global statistics for all of the partitions and distribute them to each of the partition dictionaries

 Query Evaluation

  • The broker distributes the query to all of the parallel search processes
  • Each parallel search process evaluates the query on its portion of the document collection, producing an intermediate hit-list
  • The broker collects the intermediate hit-lists from all of the parallel search processes and merges them into a final hit-list

Example

Document Text 1 Pease porridge hot 2 Pease porridge cold 3 Pease porridge in the pot 4 Pease porridge hot, pease porridge not cold 5 Pease porridge cold, pease porridge not hot 6 Pease porridge hot in the pot <6,1> cold hot in not pease porridge pot the <1,1> <2,1> <3,1> <4,2> <5,2> Dictionary <2,1> <4,1> <1,1> <4,1> <5,1> <6,1> <3,1> <6,1> <4,1> <5,1> <1,1> <2,1> <3,1> <4,2> <5,2> <6,1> <3,1> <6,1> <3,1> <6,1> Inverted Lists <5,1>

Example: Logical Doc Partitioning

<6,1> cold hot in not pease porridge pot P P P the <1,1> <2,1> <3,1> <4,2> <5,2> Inverted List Term “pease” Dictionary

Example: Physical Doc Partitioning

cold hot in not pease porridge pot the <3,1> <4,2> <4,1> <4,1> <3,1> <4,1> <3,1> <4,2> <3,1> <3,1> P hot pease porridge <1,1> <2,1> <1,1> <1,1> <2,1> P cold (^) <2,1> <6,1> hot in not pease porridge pot the <5,2> <5,1> <6,1> <6,1> <5,1> <5,2> <6,1> <6,1> <6,1> P cold (^) <5,1>

Example: Term Partitioning

<6,1> cold hot in not pease porridge pot the <1,1> <2,1> <3,1> <4,2> <5,2> <2,1> <4,1> <1,1> <4,1> <5,1> <6,1> <3,1> <6,1> <4,1> <5,1> <1,1> <2,1> <3,1> <4,2> <5,2> <6,1> <3,1> <6,1> <3,1> <6,1> P P P <5,1>

Server Selection Problem

 In a truly distributed system, documents

may still be clustered by time or subject

matter

 Then we only need to consider a few of

many subcollections

  • Can’t ask for postings lists – too large!
  • But we can afford a small ranked list

 This is much more efficient that making

requests of all servers

 How do we know which servers to talk

to?

Server Selection Using IR

 One approach treats collections as if

they were large documents

  • We need some knowledge about the DB

 To pick which collection(s) is most likely

to contain relevant information

  • rank using, say, the vector cosine model
  • rank by highest document freq. of query

terms

 If more than one collection is searched,

we have a new problem:

  • How to merge disparate results (ranked lists)

Result Combination Strategies

 Round robin or weighted round robin

  • Has an obvious defect

 (Normalized) scores

  • Consider a search of “computer” and two

collections

  • Sports Illustrated articles
  • ACM Digital Library
  • How should scores be normalized?

 Some linear combination of scores

 Etc…

Collection Weighting

 Jamie Callan suggested re-weighting

documents using collection-specific weights:

 Callan has since studied collection fusion

using TREC datasets

  • His framework was to split data by source/month
  • ~ 400 sub collections for TREC

Map/Reduce

 CACM: Disk is the new RAM

  • April 2008 article: ‘Solving Rubik’s Cube’
  • Bandwidth of disks in a cluster is high

 Cloud Computing is here to stay

 Tamer Elsayed (UMCP)

  • All pairs document similarity
    • ACL 2008, “Pairwise document similarity with MapReduce”
  • 19 node Hadoop cluster Tamer Elsayed (UMCP): Pairwise Document Similarity (^26) in Large Collections with MapReduce

MapReduce Framework

input input input input output output output (k 2 , [v 2 ]) (k 1 , v 1 ) [(k 3 , v 3 )] [k 2 , v 2 ]

27

Indexing (3-doc toy collection)

Clinton Barack Cheney Obama Indexing 2 1 1 1 1 Clinton Obama Clinton 1 1 Clinton Cheney Clinton Barack Obama Clinton Obama Clinton Clinton Cheney Clinton Barack Obama Tamer Elsayed (UMCP): Pairwise Document Similarity in Large Collections with MapReduce 28

Pairwise Similarity

Clinton Barack Cheney Obama 2 1 1 1 1 1 1 Tamer Elsayed (UMCP): Pairwise Document Similarity in Large Collections with MapReduce

Meta Search on the Web

 Evidence (rankings)

from multiple

search engines can

be combined

 Relies on

redundancy

  • How should results be combined?

 Examples

  • Dogpile
  • MonsterCrawler
  • info.com

Multimedia Retrieval

 Image

  • Large storage requirements
  • Methods are based on content
    • Features: color histograms, intensity values, gradients, edges
  • On Web, anchor text / captions can be used  Video
  • Takes massive amount of space to store
  • With shot boundary detection, reverts to image retrieval  Speech
  • Acoustic data much larger than text
  • Conversion to text shrinks storage and simplifies search; creates word errors  OCR’d Text
  • Coping with Misspellings

Retrieval of Speech

 Speech sources

  • Teleconferences and meetings
  • Video with audio (e.g., television)
  • Radio
  • Medical & Legal transcription

 Automated recognition is possible in real time

  • Speaker dependent
  • Small vocabulary
  • Speaker independent and large vocabulary

 Classic example

Other examples

 I scream (ice cream)

 Grade A (gray day)

 sixty sick sheep (sixty six sheep)

 The boys are hoarse (the boy’s a horse)

 The stuffy nose (the stuff he knows)

 Rap city (rhapsody)

 a girl with colitis goes by (a girl with

kaleidoscope eyes)

 Gladly the cross-eyed bear. (Gladly the

cross I'd bear.)

TREC Initiatives

 Created ‘large’ collections

  • TRECs 7-
  • Voice of America, ABC News, CNN
  • 20 to 100 hours

 Format

  • Speech signals with transcripts available
  • Queries were text

 ‘High accuracy’ Broadcast News

  • Mistakes in recognition hurt IR performance
  • Many out-of-vocabulary words

A Solved Problem?

 Speech retrieval was studied in the mid/late 90s at TREC – then dropped 11 years ago

  • “progress has occurred so quickly, that one might conclude that SDR is a solved problem” - Garofolo et al. (2000), ‘The TREC Spoken Document Retrieval Track: A Success Story’

A Typical IR/Speech System

Indexing Human Speech (processing per audio sequence) Text-based Search of Speech Archive (processing per query) Acoustic Model Language Model Hypothesis Selection Signal Processing & Feature Extraction Index Creation Generate Terms from User Query Identify Candidate Documents (^) SimilarityRank By Measure User Interaction Who is the president of Indonesia? ... Megawati Sukarnoputri, former President of the Republic of Indonesia, ... phones word lattice sentences

Spoken Document Representation

 ASR system lexicons are limited (10-100k words)

  • Unable to generate out-of-vocabulary (OOV) words
  • IR search for an OOV is doomed to failure  Subword representations have been proposed for this scenario
  • phone – the smallest unit of spoken language
  • About 45 phones are present in English
  • Phone 3-grams suggested for broadcast news
  • K. Ng (2000), ‘Subword-based Approaches for Spoken Document Retrieval’, PhD Thesis, MIT  Combine Speech-to-Text (STT) & phonetics
  • Provides a way to solve the OOV problem
  • Requires extra work