Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Applications: Parallel Sorting - Lecture Notes | CSE 160, Study notes of Computer Science

University of California - San Diego Computer Science

Material Type: Notes; Class: Intro to Parallel Computing; Subject: Computer Science & Engineering; University: University of California - San Diego; Term: Spring 2005;

Typology: Study notes

Pre 2010

Uploaded on 03/28/2010

koofers-user-qkf 🇺🇸

10 documents

1 / 14

This page cannot be seen from the preview

Don't miss anything!

Lecture #7, Slide 1

CSE 160 Chien, Spring 2005

Applications: Parallel Sorting

•Last Time

»Caches and Performance

»Threads and Caches

»Parallel Machine Structure

•Today

»Parallel Cluster Communication

»Parallel Sorting

•Reminders/Announcements

»Homework #2 due thursday (April 21)

»CSE 160 Midterm, Thursday April 28, In Class

Lecture #7, Slide 2

CSE 160 Chien, Spring 2005

The HPVM System

•Goals

»Enable tightly coupled and distributed clusters with high efficiency and low

effort (integrated solution)

»Provide usable access thru convenient standard parallel interfaces

»Deliver highest possible performance and simple programming model

Discover Study notes of Computer Science University of California - San Diego

Partial preview of the text

Download Applications: Parallel Sorting - Lecture Notes | CSE 160 and more Study notes Computer Science in PDF only on Docsity!

CSE 160 Chien, Spring 2005 Lecture #7, Slide 1

Applications: Parallel Sorting

Last Time

» Caches and Performance

» Threads and Caches

» Parallel Machine Structure

Today

» Parallel Cluster Communication

» Parallel Sorting

Reminders/Announcements

» Homework #2 due thursday (April 21)

» CSE 160 Midterm, Thursday April 28, In Class

The HPVM System

Goals » Enable tightly coupled and distributed clusters with high efficiency and low effort (integrated solution) » Provide usable access thru convenient standard parallel interfaces » Deliver highest possible performance and simple programming model

CSE 160 Chien, Spring 2005 Lecture #7, Slide 3

Fast Messages Design

Elements

User-level network access
Lightweight protocols » flow control, reliable delivery » tightly-coupled link, buffer, and I/O bus management
Poll-based notification
Streaming API for efficient composition
Many generations 1994- » [IEEE Concurrency, 6/97] » [Supercomputing ’95, 12/95]
Related efforts: UCB AM, RWCP PM, Princeton VMMC/Shrimp, Lyon BIP => VIA standard

Improved Bandwidth

20MB/s -> 200+ MB/s (10x) » Much of advance is software structure: API’s and implementation » Deliver all of the underlying hardware performance

100

150

200

250

1995 1996 1997 1998 1999

Performance (megabytes/sec)

MB/s

CSE 160 Chien, Spring 2005 Lecture #7, Slide 7

HPVM Communication Performance

Delivers underlying performance for small messages, endpoints are the limits
100MB/s at 1K vs 60MB/s at 1000K » >1500x improvement

100

120

4 (^512268835844480537662727168806489609856) 10752116481254413440143361523216128 message size (bytes)

MB/s

FM on Myrinet MPI on FM-Myrinet

N1/2 ~ 400 Bytes

HPVM/FM on VIA

FM Protocol/techniques portable to Giganet VIA
Slightly lower performance, comparable N (^) 1/

4 (^1024294439684992601670408064908810112111361216013184142081523216256) message size (bytes)

MB/s

FM on Giganet VIA MPI-FM on Giganet VIA

N1/2 ~ 400 Bytes

CSE 160 Chien, Spring 2005 Lecture #7, Slide 9

Supercomputer Performance

Characteristics (11/99)

MF/Proc Flops/Byte Flops/NetworkRT

Cray T3E 1200 ~2 ~2,

SGI Origin2000 500 ~0.5 ~1,

HPVM NT Supercluster 600 ~8 ~12,

IBM SP2 (4 or 8-way) 2.6-5.2GF ~12-25 ~150-300K

Beowulf(100Mbit) 600 ~50 ~200,

Sorting

Postcondition: all items in order, data is distributed!
Movement of data can be within a node, or across nodes

13 14 10 8 1 11 3 7 9 2 4 5

12 6

1 2 3 4 5 6 7 8 9

10 11 12 13 14

CSE 160 Chien, Spring 2005 Lecture #7, Slide 13

Bucket Sort Steps

Divide Keys
Each Worker divides his/her

keys and sends/receives the

keys to the appropriate

worker

Each Worker, wait until it has

received all of its keys

(synchronization),

Each Worker sorts its keys

locally, concatenated

sections in order

Implementing Bucket Sort

Divide Keys (easy, but often

hard)

Each Worker divides his/her

keys (easy) and sends/receives

the keys to the appropriate

worker (HARD)

Each Worker, receive keys
Wait until it has received all of

its keys (synchronization)

(HARD)

Each Worker sorts its keys

locally (easy)

Cross-worker interaction

-Threads and Locks

Nodes and Messages

Synchronization

-How to know when have all the keys

-Synchronization Messages or Flags

CSE 160 Chien, Spring 2005 Lecture #7, Slide 15

One more thing…

Data Starts on Disk and Ends on Disk (many gigabytes!)
Read Data from Disk
Divide Keys (easy)
Each Worker divides his/her keys (easy) and sends/receives the

keys to the appropriate worker (HARD)

Each Worker, receive keys
Wait until it has received all of its keys (synchronization) (HARD)
Each Worker sorts its keys locally (easy)
Write Data to Disk

CSE 160 Chien, Spring 2005^ Lecture #7, Slide 16

One Pass MinuteSort

Kayak

Netserver

(Luis Rivera UIUC, Xianan Zhang UCSD)

32 HP Kayaks 3Ware Controllers 4 x 20GB IDE disks

32 HP Netservers 2 x 16GB SCSI disks

HPVM & 1Gbps Myrinet

CSE 160 Chien, Spring 2005 Lecture #7, Slide 19

Myrinet

32 Kayak •300 MHz Pentium II •Dual Processors

•384 MB Memory •Four 20 GB 7200 RPM Disk Connected to a 3Ware Controller

32 Netservers •400 MHz Pentium II •Dual Processors •1 GB Memory

•Two SCSI 18.2 GB 7200 RPM

Modeling Performance for

Scaleup!

•Kayak Read Rate : 50 MB/s •Kayak Communication •Send Rate : 100 MB/s •Receive Rate: 70 MB/s •NetServer Communication •Send/Receive Rate: 100 MB/s •In-core Sorting Rate: 189 MB/s •Total Writing Rate : 60 MB/s •Kayak: 40 MB/s •Netserver: 20 MB/s •Launch Time: 5 - 10 seconds •Total Time: < 52 s

Launch Time Read and Distribution Time In-core Sort Time Write Time and Send results to Kayaks Time

Execution Time

19.2s

10s

16s

CSE 160 Chien, Spring 2005 Lecture #7, Slide 21

Kayak 1

Kayak 2

Netserver 1

Netserver 2

Read and Distribute (hard)

Communication

Thread

Read Thread

Double Buffer

Myrinet Network

CSE 160 Chien, Spring 2005 Lecture #7, Slide 25

Perspective

Local Sort in a Small Fraction of the time

Global Sort

» Sending + Receiving => Sorting amongst the buckets!

» + Disk I/O

» => 90% of the time

Communication is expensive
Variable Numbers of Senders and Receivers Expensive
Synchronization is expensive

Summary

Clusters and High Performance Computing
Heroic Sorting

» Gigabytes disk to disk in a minute

» Easy Parts

Divide the Work
Read and Write Data (but slow)

» Hard Parts

Communication, Many to Many
Synchronization
Initial Exposure: Scaling and Speedup

Applications: Parallel Sorting - Lecture Notes | CSE 160, Study notes of Computer Science

Related documents

Partial preview of the text

Download Applications: Parallel Sorting - Lecture Notes | CSE 160 and more Study notes Computer Science in PDF only on Docsity!

Applications: Parallel Sorting

» Caches and Performance

» Threads and Caches

» Parallel Machine Structure

» Parallel Cluster Communication

» Parallel Sorting

» Homework #2 due thursday (April 21)

» CSE 160 Midterm, Thursday April 28, In Class

The HPVM System

Fast Messages Design

Elements

Improved Bandwidth

HPVM Communication Performance

HPVM/FM on VIA

Supercomputer Performance

Characteristics (11/99)

MF/Proc Flops/Byte Flops/NetworkRT

Cray T3E 1200 ~2 ~2,

SGI Origin2000 500 ~0.5 ~1,

HPVM NT Supercluster 600 ~8 ~12,

IBM SP2 (4 or 8-way) 2.6-5.2GF ~12-25 ~150-300K

Beowulf(100Mbit) 600 ~50 ~200,

Sorting

Bucket Sort Steps

keys and sends/receives the

keys to the appropriate

worker

received all of its keys

(synchronization),

locally, concatenated

sections in order

Implementing Bucket Sort

hard)

keys (easy) and sends/receives

the keys to the appropriate

worker (HARD)

its keys (synchronization)

(HARD)

locally (easy)

Cross-worker interaction

-Threads and Locks

Synchronization

-How to know when have all the keys

-Synchronization Messages or Flags

One more thing…

keys to the appropriate worker (HARD)

One Pass MinuteSort

Kayak

Kayak

Netserver

(Luis Rivera UIUC, Xianan Zhang UCSD)

Read and Distribute (hard)

Myrinet Network

Perspective

» Sending + Receiving => Sorting amongst the buckets!

» + Disk I/O

» => 90% of the time

Summary

» Gigabytes disk to disk in a minute

» Easy Parts

» Hard Parts

» Constant time Benchmark