Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Need for Speed, Parallelism Methodologies - Data Warehousing - Lecture Slides, Slides of Data Warehousing

National Institute of Industrial Engineering Data Warehousing

Need for Speed Parallelism Methodologies, Motivation, Data Parallelism Concept, Ensuring Speed UP, Temporal Parallelism, Pipelining Time Chart, Speed Up Calculation. Some Other terms are also described in these data warehousing lecture slides.

Typology: Slides

2011/2012

Uploaded on 11/03/2012

padmal 🇮🇳

4.5

(15)

75 documents

1 / 23

This page cannot be seen from the preview

Don't miss anything!

Data Warehousing

Lecture-25

Need for Speed: Parallelism Methodologies

Docsity.com

Discover Slides of Data Warehousing National Institute of Industrial Engineering

Partial preview of the text

Download Need for Speed, Parallelism Methodologies - Data Warehousing - Lecture Slides and more Slides Data Warehousing in PDF only on Docsity!

Data Warehousing

Lecture-

Need for Speed: Parallelism Methodologies

Motivation

 No need of parallelism if perfect computer

 with single infinitely fast processor  with an infinite memory with infinite bandwidth  and its infinitely cheap too (free!)

 Technology is not delivering (going to Moon analogy)

 The Challenge is to build

 infinitely fast processor out of infinitely many processors of finite speed

 Infinitely large memory with infinite memory bandwidth from infinite many finite storage units of finite speed

No text goes to graphics

Data Parallelism: Example

Emp Table

Partition 1Partition-

Partition-

Partition-k

. . .

440

Query Server-

Query Server-k

. . .

Query Coordinator

Select count () from Emp where age > 50 AND sal > 10,000’;*

Ans = 62 + 440 + ... + 1,123 = 99,

To get a speed-up of N with N partitions, it must be

ensured that:

 There are enough computing resources.

 Query-coordinator is very fast as compared to query

servers.

 Work done in each partition almost same to avoid

performance bottlenecks.

 Same number of records in each partition would not

suffice.

 Need to have uniform distribution of records w.r.t filter

criterion across partitions.

Data Parallelism: Ensuring Speed-UP

No text will go to graphics

Pipelining: Time Chart

Time = T/ []^ []

Time = T/3 Time = T/

Time = T/ []^ []

Time = T/3 Time = T/

Time = T/ []

Time = T/3 Time = T/

T = 0 T = 1^ T = 2

Time = T/ []

Time = T/

T = 3

Pipelining: Speed-Up Calculation

Time for sequential execution of 1 task = T

Time for sequential execution of N tasks = N * T

(Ideal) time for pipelined execution of one task using an M stage pipeline = T

(Ideal) time for pipelined execution of N tasks using an M stage pipeline = T + ((N-1) × (T/M))

Speed-up (S) =

Pipeline parallelism focuses on increasing throughput of task execution, NOT on decreasing sub-task execution time.

Pipelining: Input vs Speed-Up

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Input (N)

Speed-up (S)

Asymptotic limit on speed-up for M stage pipeline is M.

The speed-up will NEVER be M, as initially filling the pipeline took T time units.

Pipelining: Limitations

 Relational pipelines are rarely very long

 Even a chain of length ten is unusual.

 Some relational operators do not produce first

output until consumed all their inputs.

 Aggregate and sort operators have this property. One cannot pipeline these operators.

 Often, execution cost of one operator is much

greater than others hence skew.

 e.g. Sum() or count() vs Group-by() or Join.

No text goes to graphics Docsity.com

Round Robin

 Advantages

 Best suited for sequential scan of entire

relation on each query.

 All disks have almost an equal number of

tuples; retrieval work is thus well balanced

between disks.

 Range queries are difficult to process

 No clustering -- tuples are scattered across

all disks

Partitioning & Queries

yellow goes to graphics

Hash Partitioning

 Good for sequential access

 With uniform hashing and using partitioning attributes as a key, tuples will be equally distributed between disks.

 Good for point queries on partitioning attribute

 Can lookup single disk, leaving others available for answering other queries.

 Index on partitioning attribute can be local to disk, making lookup and update very efficient even joins.

Range queries are difficult to process No clustering -- tuples are scattered across all disks

Partitioning & Queries

yellow goes to graphics

Docsity.com

Parallel Sorting

 Scan in parallel, and range partition on the go.

 As partitioned data becomes available, perform

“local” sorting.

 Resulting data is sorted and again range partitioned.

 Problem: skew or “hot spot”.

 Solution: Sample the data at start to determine

partition points.

data

Processors

1 2 3 4 5

Hot spot P1 P2 P3 P4 P

1 4 1 2 1

Skew in Partitioning

 The distribution of tuples to disks may be skewed

 i.e. some disks have many tuples, while others may have fewer tuples.

 Types of skew:

 Attribute-value skew.  Some values appear in the partitioning attributes of many tuples; all the tuples with the same value for the partitioning attribute end up in the same partition.

 Can occur with range-partitioning and hash-partitioning.

 Partition skew.  With range-partitioning, badly chosen partition vector may assign too many tuples to some partitions and too few to others.

 Less likely with hash-partitioning if a good hash-function is chosen.

yellow goes to graphics

Barriers to Linear Speedup & Scale-up

 Amdahal’ Law

 Startup

 Time needed to start a large number of processors.  Increase with increase in number of individual processors.  May also include time spent in opening files etc.

 Interference

 Slow down that each processor imposes on all others when sharing a common pool of resources “(e.g. memory).

 Skew

 Variance dominating the mean.  Service time of the job is service time of its slowest components.

yellow goes to graphics

Comparison of Partitioning Techniques

Shared disk/memory less sensitive to partitioning.

Shared nothing can benefit from good partitioning.

A…E F…J O…S K…N T…Z

Range

Good for equijoins, range queries, group-by clauses, can result in “hot spots”.

Users

A…E F…J O…S K…N T…Z

Round Robin

Good for load balancing, but impervious to nature of queries.

Users

A…E F…J O…S K…N T…Z

Hash

Good for equijoins, can results in uneven data distribution

Users

Need for Speed, Parallelism Methodologies - Data Warehousing - Lecture Slides, Slides of Data Warehousing

Related documents

Partial preview of the text

Download Need for Speed, Parallelism Methodologies - Data Warehousing - Lecture Slides and more Slides Data Warehousing in PDF only on Docsity!

Data Warehousing

Lecture-

Need for Speed: Parallelism Methodologies

 No need of parallelism if perfect computer

 Technology is not delivering (going to Moon analogy)

 The Challenge is to build

To get a speed-up of N with N partitions, it must be

ensured that:

 There are enough computing resources.

 Query-coordinator is very fast as compared to query

 Work done in each partition almost same to avoid

 Same number of records in each partition would not

 Need to have uniform distribution of records w.r.t filter

output until consumed all their inputs.

greater than others hence skew.

 Best suited for sequential scan of entire

relation on each query.

 All disks have almost an equal number of

tuples; retrieval work is thus well balanced

between disks.

 No clustering -- tuples are scattered across

all disks

 Scan in parallel, and range partition on the go.

 As partitioned data becomes available, perform

“local” sorting.

 Resulting data is sorted and again range partitioned.

 Problem: skew or “hot spot”.

 Solution: Sample the data at start to determine

partition points.

 The distribution of tuples to disks may be skewed

 Types of skew:

 Amdahal’ Law

 Startup

 Interference

 Skew