Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Distributed Info Retrieval: Algorithms for Partitioning, Files, and Queries, Slides of Artificial Intelligence

Johns Hopkins University (JHU)Artificial Intelligence

Various algorithms for distributed retrieval, including logical and physical document partitioning, inverted file construction, and query evaluation. It covers methods for splitting work among processors, partitioning data, and merging results. The document also includes examples and comparisons between different approaches.

Typology: Slides

2010/2011

Uploaded on 11/09/2011

stagist 🇺🇸

4.1

(27)

265 documents

1 / 22

This page cannot be seen from the preview

Don't miss anything!

Lecture 9

 Distributed Retrieval

– Algorithms for distributing processing

– Source selection

– Collection fusion

 Retrieval of Speech

Distributed Systems Review

 Splitting work among N processors

might decrease response time by as

much as a factor of N

 Actual speedup is less than N

– Inherently sequential operations

– Rate limiting steps

– Costs of communication and fusion

Discover Slides of Artificial Intelligence Johns Hopkins University (JHU)

Partial preview of the text

Download Distributed Info Retrieval: Algorithms for Partitioning, Files, and Queries and more Slides Artificial Intelligence in PDF only on Docsity!

Lecture 9

 Distributed Retrieval

Algorithms for distributing processing
Source selection
Collection fusion

 Retrieval of Speech

Distributed Systems Review

 Splitting work among N processors

might decrease response time by as

much as a factor of N

 Actual speedup is less than N

Inherently sequential operations
Rate limiting steps
Costs of communication and fusion

Time of best sequential algorithm Time of parallel algorithm

Performance Measures

 Speedup

 Amdahl’s Law

 Efficiency

 Elements processed by a search algorithm

 Can be split by terms, or by documents

Which is better? k 1 k 2... ki... kt d 1 w1,1 w2,1... wi,1... wt, d 2 w1,2 w2,2... wi,2... wt, ..................... dj w1,j w2,j... wi,j... wt,j ..................... dN w1,N w2,N... wi,N... wt,N Indexing Items D o c u m e n t s

Architecture

Logical Document Partitioning

 Dictionary representation is extended

item i P P P P Inverted List Term i Dictionary

LDP: Algorithms

 Inverted File Construction

The indexer partitions the documents among the processors
Each indexing process generates a batch of inverted lists, sorted by indexing item
A merge step creates the final inverted file

 Query Evaluation

Each process executes the same document scoring algorithm on its document subcollection
The search processes records document scores in a single shared array of document score accumulators
A broker produces the final ranked list of documents

Physical Document Partitioning

 Data Partitioning

The documents are physically partitioned

into separate collections, one for each

parallel processor

Each subcollection has its own inverted file

for a fraction of the index

Generally, the documents are disjoint (i.e.,

not duplicated)

PDP: Algorithms

 Inverted File Construction

Each processor creates, in parallel, its own complete index corresponding to its documents
Optionally, a merge step is performed to accumulate the global statistics for all of the partitions and distribute them to each of the partition dictionaries

 Query Evaluation

The broker distributes the query to all of the parallel search processes
Each parallel search process evaluates the query on its portion of the document collection, producing an intermediate hit-list
The broker collects the intermediate hit-lists from all of the parallel search processes and merges them into a final hit-list

Example

Document Text 1 Pease porridge hot 2 Pease porridge cold 3 Pease porridge in the pot 4 Pease porridge hot, pease porridge not cold 5 Pease porridge cold, pease porridge not hot 6 Pease porridge hot in the pot <6,1> cold hot in not pease porridge pot the <1,1> <2,1> <3,1> <4,2> <5,2> Dictionary <2,1> <4,1> <1,1> <4,1> <5,1> <6,1> <3,1> <6,1> <4,1> <5,1> <1,1> <2,1> <3,1> <4,2> <5,2> <6,1> <3,1> <6,1> <3,1> <6,1> Inverted Lists <5,1>

Example: Logical Doc Partitioning

<6,1> cold hot in not pease porridge pot P P P the <1,1> <2,1> <3,1> <4,2> <5,2> Inverted List Term “pease” Dictionary

Example: Physical Doc Partitioning

cold hot in not pease porridge pot the <3,1> <4,2> <4,1> <4,1> <3,1> <4,1> <3,1> <4,2> <3,1> <3,1> P hot pease porridge <1,1> <2,1> <1,1> <1,1> <2,1> P cold (^) <2,1> <6,1> hot in not pease porridge pot the <5,2> <5,1> <6,1> <6,1> <5,1> <5,2> <6,1> <6,1> <6,1> P cold (^) <5,1>

Example: Term Partitioning

<6,1> cold hot in not pease porridge pot the <1,1> <2,1> <3,1> <4,2> <5,2> <2,1> <4,1> <1,1> <4,1> <5,1> <6,1> <3,1> <6,1> <4,1> <5,1> <1,1> <2,1> <3,1> <4,2> <5,2> <6,1> <3,1> <6,1> <3,1> <6,1> P P P <5,1>

Server Selection Problem

 In a truly distributed system, documents

may still be clustered by time or subject

matter

 Then we only need to consider a few of

many subcollections

Can’t ask for postings lists – too large!
But we can afford a small ranked list

 This is much more efficient that making

requests of all servers

 How do we know which servers to talk

to?

Server Selection Using IR

 One approach treats collections as if

they were large documents

We need some knowledge about the DB

 To pick which collection(s) is most likely

to contain relevant information

rank using, say, the vector cosine model
rank by highest document freq. of query

terms

 If more than one collection is searched,

we have a new problem:

How to merge disparate results (ranked lists)

Result Combination Strategies

 Round robin or weighted round robin

Has an obvious defect

 (Normalized) scores

Consider a search of “computer” and two

collections

Sports Illustrated articles
ACM Digital Library
How should scores be normalized?

 Some linear combination of scores

 Etc…

Collection Weighting

 Jamie Callan suggested re-weighting

documents using collection-specific weights:

 Callan has since studied collection fusion

using TREC datasets

His framework was to split data by source/month
~ 400 sub collections for TREC

Map/Reduce

 CACM: Disk is the new RAM

April 2008 article: ‘Solving Rubik’s Cube’
Bandwidth of disks in a cluster is high

 Cloud Computing is here to stay

 Tamer Elsayed (UMCP)

All pairs document similarity
- ACL 2008, “Pairwise document similarity with MapReduce”
19 node Hadoop cluster Tamer Elsayed (UMCP): Pairwise Document Similarity (^26) in Large Collections with MapReduce

MapReduce Framework

input input input input output output output (k 2 , [v 2 ]) (k 1 , v 1 ) [(k 3 , v 3 )] [k 2 , v 2 ]

Indexing (3-doc toy collection)

Clinton Barack Cheney Obama Indexing 2 1 1 1 1 Clinton Obama Clinton 1 1 Clinton Cheney Clinton Barack Obama Clinton Obama Clinton Clinton Cheney Clinton Barack Obama Tamer Elsayed (UMCP): Pairwise Document Similarity in Large Collections with MapReduce 28

Pairwise Similarity

Clinton Barack Cheney Obama 2 1 1 1 1 1 1 Tamer Elsayed (UMCP): Pairwise Document Similarity in Large Collections with MapReduce

Meta Search on the Web

 Evidence (rankings)

from multiple

search engines can

be combined

 Relies on

redundancy

How should results be combined?

 Examples

Dogpile
MonsterCrawler
info.com

Multimedia Retrieval

 Image

Large storage requirements
Methods are based on content
- Features: color histograms, intensity values, gradients, edges
On Web, anchor text / captions can be used  Video
Takes massive amount of space to store
With shot boundary detection, reverts to image retrieval  Speech
Acoustic data much larger than text
Conversion to text shrinks storage and simplifies search; creates word errors  OCR’d Text
Coping with Misspellings

Retrieval of Speech

 Speech sources

Teleconferences and meetings
Video with audio (e.g., television)
Radio
Medical & Legal transcription

 Automated recognition is possible in real time

Speaker dependent
Small vocabulary
Speaker independent and large vocabulary

 Classic example

Other examples

 I scream (ice cream)

 Grade A (gray day)

 sixty sick sheep (sixty six sheep)

 The boys are hoarse (the boy’s a horse)

 The stuffy nose (the stuff he knows)

 Rap city (rhapsody)

 a girl with colitis goes by (a girl with

kaleidoscope eyes)

 Gladly the cross-eyed bear. (Gladly the

cross I'd bear.)

TREC Initiatives

 Created ‘large’ collections

TRECs 7-
Voice of America, ABC News, CNN
20 to 100 hours

 Format

Speech signals with transcripts available
Queries were text

 ‘High accuracy’ Broadcast News

Mistakes in recognition hurt IR performance
Many out-of-vocabulary words

A Solved Problem?

 Speech retrieval was studied in the mid/late 90s at TREC – then dropped 11 years ago

“progress has occurred so quickly, that one might conclude that SDR is a solved problem” - Garofolo et al. (2000), ‘The TREC Spoken Document Retrieval Track: A Success Story’

A Typical IR/Speech System

Indexing Human Speech (processing per audio sequence) Text-based Search of Speech Archive (processing per query) Acoustic Model Language Model Hypothesis Selection Signal Processing & Feature Extraction Index Creation Generate Terms from User Query Identify Candidate Documents (^) SimilarityRank By Measure User Interaction Who is the president of Indonesia? ... Megawati Sukarnoputri, former President of the Republic of Indonesia, ... phones word lattice sentences

Spoken Document Representation

 ASR system lexicons are limited (10-100k words)

Unable to generate out-of-vocabulary (OOV) words
IR search for an OOV is doomed to failure  Subword representations have been proposed for this scenario
phone – the smallest unit of spoken language
About 45 phones are present in English
Phone 3-grams suggested for broadcast news
K. Ng (2000), ‘Subword-based Approaches for Spoken Document Retrieval’, PhD Thesis, MIT  Combine Speech-to-Text (STT) & phonetics
Provides a way to solve the OOV problem
Requires extra work

Distributed Info Retrieval: Algorithms for Partitioning, Files, and Queries, Slides of Artificial Intelligence

Related documents

Partial preview of the text

Download Distributed Info Retrieval: Algorithms for Partitioning, Files, and Queries and more Slides Artificial Intelligence in PDF only on Docsity!

Lecture 9

 Distributed Retrieval

 Retrieval of Speech

Distributed Systems Review

 Splitting work among N processors

might decrease response time by as

much as a factor of N

 Actual speedup is less than N

Performance Measures

 Speedup

 Amdahl’s Law

 Efficiency

 Elements processed by a search algorithm

 Can be split by terms, or by documents

Architecture

Logical Document Partitioning

 Dictionary representation is extended

LDP: Algorithms

 Inverted File Construction

 Query Evaluation

Physical Document Partitioning

 Data Partitioning

into separate collections, one for each

parallel processor

for a fraction of the index

not duplicated)

PDP: Algorithms

 Inverted File Construction

 Query Evaluation

Example

Example: Logical Doc Partitioning

Example: Physical Doc Partitioning

Example: Term Partitioning

Server Selection Problem

 In a truly distributed system, documents

may still be clustered by time or subject

matter

 Then we only need to consider a few of

many subcollections

 This is much more efficient that making

requests of all servers

 How do we know which servers to talk

to?

Server Selection Using IR

 One approach treats collections as if

they were large documents

 To pick which collection(s) is most likely

to contain relevant information

terms

 If more than one collection is searched,

we have a new problem:

Result Combination Strategies

 Round robin or weighted round robin

 (Normalized) scores

collections

 Some linear combination of scores

 Etc…

Collection Weighting

 Jamie Callan suggested re-weighting

documents using collection-specific weights:

 Callan has since studied collection fusion

using TREC datasets

Map/Reduce

 CACM: Disk is the new RAM

 Cloud Computing is here to stay

 Tamer Elsayed (UMCP)

MapReduce Framework

Indexing (3-doc toy collection)

Pairwise Similarity

Meta Search on the Web

 Evidence (rankings)

from multiple

search engines can

be combined

 Relies on

redundancy