CS 352H: Computer Systems Architecture, Slides of Computer Architecture and Organization

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2. Introduction. Goal: connecting multiple computers.

Typology: Slides

2022/2023

Uploaded on 05/11/2023

mathieu
mathieu 🇮🇹

4.2

(11)

235 documents

1 / 51

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell
CS 352H: Computer Systems Architecture
Topic 14: Multicores, Multiprocessors, and
Clusters
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33

Partial preview of the text

Download CS 352H: Computer Systems Architecture and more Slides Computer Architecture and Organization in PDF only on Docsity!

CS 352H: Computer Systems Architecture

Topic 14: Multicores, Multiprocessors, and

Clusters

Introduction

Goal: connecting multiple computers

to get higher performance

Multiprocessors Scalability, availability, power efficiency

Job-level (process-level) parallelism

High throughput for independent jobs

Parallel processing program

Single program run on multiple processors

Multicore microprocessors

Chips with multiple processors (cores)

What We’ve Already Covered

§2.11: Parallelism and Instructions

Synchronization

§3.6: Parallelism and Computer Arithmetic

Associativity

§4.10: Parallelism and Advanced Instruction-Level

Parallelism

§5.8: Parallelism and Memory Hierarchies

Cache Coherence

§6.9: Parallelism and I/O:

Redundant Arrays of Inexpensive Disks

Parallel Programming

Parallel software is the problem

Need to get significant performance improvement

Otherwise, just use a faster uniprocessor, since it’s easier!

Difficulties

Partitioning Coordination Communications overhead

Scaling Example

Workload: sum of 10 scalars, and 10 × 10 matrix sum

Speed up from 10 to 100 processors

Single processor: Time = (10 + 100) × t

add

10 processors

Time = 10 × tadd + 100/10 × tadd = 20 × tadd Speedup = 110/20 = 5.5 (55% of potential)

100 processors

Time = 10 × tadd + 100/100 × tadd = 11 × tadd Speedup = 110/11 = 10 (10% of potential)

Assumes load can be balanced across processors

Scaling Example (cont)

What if matrix size is 100 × 100?

Single processor: Time = (10 + 10000) × t

add

10 processors

Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd Speedup = 10010/1010 = 9.9 (99% of potential)

100 processors

Time = 10 × tadd + 10000/100 × tadd = 110 × tadd Speedup = 10010/110 = 91 (91% of potential)

Assuming load balanced

Shared Memory

SMP: shared memory multiprocessor

Hardware provides single physical address space for all processors Synchronize shared variables using locks Memory access time UMA (uniform) vs. NUMA (nonuniform)

Example: Sum Reduction

Sum 100,000 numbers on 100 processor UMA

Each processor has ID: 0 ≤ Pn ≤ 99 Partition 1000 numbers per processor Initial summation on each processor sum[Pn] = 0; for (i = 1000Pn; i < 1000(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i];

Now need to add these partial sums

Reduction: divide and conquer Half the processors add pairs, then quarter, … Need to synchronize between reduction steps

Message Passing

Each processor has private physical address space

Hardware sends/receives messages between processors

Loosely Coupled Clusters

Network of independent computers Each has private memory and OS Connected using I/O system E.g., Ethernet/switch, Internet Suitable for applications with independent tasks Web servers, databases, simulations, … High availability, scalable, affordable Problems Administration cost (prefer virtual machines) Low interconnect bandwidth c.f. processor/memory bandwidth on an SMP

Sum Reduction (Again)

Given send() and receive() operations

limit = 100; half = 100;/* 100 processors / repeat half = (half+1)/2; / send vs. receive dividing line / if (Pn >= half && Pn < limit) send(Pn - half, sum); if (Pn < (limit/2)) sum = sum + receive(); limit = half; / upper limit of senders / until (half == 1); / exit with final sum */ Send/receive also provide synchronization Assumes send/receive take similar time to addition

Grid Computing

Separate computers interconnected by long-haul networks

E.g., Internet connections Work units farmed out, results sent back

Can make use of idle time on PCs

E.g., SETI@home, World Community Grid

Simultaneous Multithreading

In multiple-issue dynamically scheduled processor

Schedule instructions from multiple threads Instructions from independent threads execute when function units are available Within threads, dependencies handled by scheduling and register renaming

Example: Intel Pentium-4 HT

Two threads: duplicated registers, shared function units and caches

Multithreading Example