Need for Speed, Parallelism Methodologies - Data Warehousing - Lecture Slides, Slides of Data Warehousing

Need for Speed Parallelism Methodologies, Motivation, Data Parallelism Concept, Ensuring Speed UP, Temporal Parallelism, Pipelining Time Chart, Speed Up Calculation. Some Other terms are also described in these data warehousing lecture slides.

Typology: Slides

2011/2012

Uploaded on 11/03/2012

padmal
padmal 🇮🇳

4.5

(15)

75 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Data Warehousing
Lecture-25
Need for Speed: Parallelism Methodologies
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Need for Speed, Parallelism Methodologies - Data Warehousing - Lecture Slides and more Slides Data Warehousing in PDF only on Docsity!

1

Data Warehousing

Lecture-

Need for Speed: Parallelism Methodologies

2

Motivation

 No need of parallelism if perfect computer

 with single infinitely fast processor  with an infinite memory with infinite bandwidth  and its infinitely cheap too (free!)

 Technology is not delivering (going to Moon analogy)

 The Challenge is to build

 infinitely fast processor out of infinitely many processors of finite speed

 Infinitely large memory with infinite memory bandwidth from infinite many finite storage units of finite speed

No text goes to graphics

4

Data Parallelism: Example

Emp Table

Partition 1Partition-

Partition-

Partition-k

. . .

62

440

1,

Query Server-

Query Server-

Query Server-k

. . .

Query Coordinator

Select count () from Emp where age > 50 AND sal > 10,000’;*

Ans = 62 + 440 + ... + 1,123 = 99,

5

To get a speed-up of N with N partitions, it must be

ensured that:

 There are enough computing resources.

 Query-coordinator is very fast as compared to query

servers.

 Work done in each partition almost same to avoid

performance bottlenecks.

 Same number of records in each partition would not

suffice.

 Need to have uniform distribution of records w.r.t filter

criterion across partitions.

Data Parallelism: Ensuring Speed-UP

No text will go to graphics

7

Pipelining: Time Chart

Time = T/ []^ []

Time = T/3 Time = T/

Time = T/ []^ []

Time = T/3 Time = T/

Time = T/ []

Time = T/3 Time = T/

T = 0 T = 1^ T = 2

Time = T/ []

Time = T/

T = 3

8

Pipelining: Speed-Up Calculation

Time for sequential execution of 1 task = T

Time for sequential execution of N tasks = N * T

(Ideal) time for pipelined execution of one task using an M stage pipeline = T

(Ideal) time for pipelined execution of N tasks using an M stage pipeline = T + ((N-1) × (T/M))

Speed-up (S) =

Pipeline parallelism focuses on increasing throughput of task execution, NOT on decreasing sub-task execution time.

10

Pipelining: Input vs Speed-Up

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Input (N)

Speed-up (S)

Asymptotic limit on speed-up for M stage pipeline is M.

The speed-up will NEVER be M, as initially filling the pipeline took T time units.

11

Pipelining: Limitations

 Relational pipelines are rarely very long

 Even a chain of length ten is unusual.

 Some relational operators do not produce first

output until consumed all their inputs.

 Aggregate and sort operators have this property. One cannot pipeline these operators.

 Often, execution cost of one operator is much

greater than others hence skew.

 e.g. Sum() or count() vs Group-by() or Join.

No text goes to graphics Docsity.com

13

Round Robin

 Advantages

 Best suited for sequential scan of entire

relation on each query.

 All disks have almost an equal number of

tuples; retrieval work is thus well balanced

between disks.

 Range queries are difficult to process

 No clustering -- tuples are scattered across

all disks

Partitioning & Queries

yellow goes to graphics

14

Hash Partitioning

 Good for sequential access

 With uniform hashing and using partitioning attributes as a key, tuples will be equally distributed between disks.

 Good for point queries on partitioning attribute

 Can lookup single disk, leaving others available for answering other queries.

 Index on partitioning attribute can be local to disk, making lookup and update very efficient even joins.

  • Range queries are difficult to process No clustering -- tuples are scattered across all disks

Partitioning & Queries

yellow goes to graphics

Docsity.com

16

Parallel Sorting

 Scan in parallel, and range partition on the go.

 As partitioned data becomes available, perform

“local” sorting.

 Resulting data is sorted and again range partitioned.

 Problem: skew or “hot spot”.

 Solution: Sample the data at start to determine

partition points.

data

Processors

1 2 3 4 5

Hot spot P1 P2 P3 P4 P

1 4 1 2 1

17

Skew in Partitioning

 The distribution of tuples to disks may be skewed

 i.e. some disks have many tuples, while others may have fewer tuples.

 Types of skew:

Attribute-value skew.  Some values appear in the partitioning attributes of many tuples; all the tuples with the same value for the partitioning attribute end up in the same partition.

 Can occur with range-partitioning and hash-partitioning.

Partition skew.  With range-partitioning, badly chosen partition vector may assign too many tuples to some partitions and too few to others.

 Less likely with hash-partitioning if a good hash-function is chosen.

yellow goes to graphics

19

Barriers to Linear Speedup & Scale-up

 Amdahal’ Law

 Startup

Time needed to start a large number of processors.Increase with increase in number of individual processors.May also include time spent in opening files etc.

 Interference

Slow down that each processor imposes on all others when sharing a common pool of resources “(e.g. memory).

 Skew

Variance dominating the mean.Service time of the job is service time of its slowest components.

yellow goes to graphics

20

Comparison of Partitioning Techniques

Shared disk/memory less sensitive to partitioning.

Shared nothing can benefit from good partitioning.

A…E F…J O…S K…N T…Z

Range

Good for equijoins, range queries, group-by clauses, can result in “hot spots”.

Users

A…E F…J O…S K…N T…Z

Round Robin

Good for load balancing, but impervious to nature of queries.

Users

A…E F…J O…S K…N T…Z

Hash

Good for equijoins, can results in uneven data distribution

Users