Parallel Processing and Multiprocessor Architectures: An Overview - Prof. Josep Torrellas, Study notes of Computer Architecture and Organization

An overview of the progress towards multiprocessors, flynn's classification of parallel architectures, and various types of mimd machines, including message passing machines and distributed shared memory systems. It also covers performance metrics for communications and different locking mechanisms.

Typology: Study notes

Pre 2010

Uploaded on 03/16/2009

koofers-user-6t1
koofers-user-6t1 🇺🇸

10 documents

1 / 58

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Copyright Josep Torrellas 1999, 2001, 2002 1
Chapter 6
Instructor: Josep Torrellas
CS433
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a

Partial preview of the text

Download Parallel Processing and Multiprocessor Architectures: An Overview - Prof. Josep Torrellas and more Study notes Computer Architecture and Organization in PDF only on Docsity!

Copyright Josep Torrellas 1999, 2001, 2002

Chapter 6

Instructor: Josep Torrellas

CS

Copyright Josep Torrellas 1999, 2001, 2002

Progress Towards Multiprocessors

  • Rate of speed growth in uniprocessors is saturating+ Modern multiple issue processors are becoming very

complex

Æ

multicores

  • Steady progress in parallel software : the major obstacle to

parallel processing

Copyright Josep Torrellas 1999, 2001, 2002

•^

Multiple I streams, single D stream (MISD) : nocommercial machine

-^

Multiple I streams, multiple D streams (MIMD)– each processor fetches its own instructions and operates

on its own data

  • usually off the shelf

μ

processors

  • architecture of choice for general purpose mps– Flexible: can be used in single user mode or

multiprogrammed

  • use of the shelf

μ

processors

  • Copyright Josep Torrellas 1999, 2001,
  • See figure 6.1 and 6.

Copyright Josep Torrellas 1999, 2001, 2002

  • Also reduces the memory latency– of course interprocessor communication is more costly

and complex

  • often each node is a cluster (bus based multiprocessor)– 2 types, depending on method used for interprocessor

communication:1. Distributed shared memory (DSM) or scalable

shared memory

  1. Message passing machines or multicomputers

Copyright Josep Torrellas 1999, 2001, 2002

DSMs :•^

memories addressed as one shared address space: processorP1 writes address X, processor P2 reads address X

-^

Shared memory means that some address in 2 processorsrefers to same mem location; not that mem is centralized

-^

also called NUMA (Non Uniform Memory Access)

-^

processors communicate implicitly via loads and stores Multicomputers:•^

each processor has its own address space , disjoint to otherprocessors , cannot be addressed by other processors

Copyright Josep Torrellas 1999, 2001, 2002

processors are notified of the arrival of a msg

polling →

interrupt

standard message passing libraries: message passinginterface (MPI) Performance Metrics for Communications1. Communication bandwidth2. Communication latency = sender ovhd + transfer +recv

ovhd

  1. Communication latency hiding

Copyright Josep Torrellas 1999, 2001, 2002

Shared memory communication (DSM)+ Compatibility w/well understood mechanisms in centralized

mps

  • easy of programming /compiler design for pgms w/irregular

communication patterns

  • lower overhead of communication

better use of bandwidth when using small communications

  • reduced remote communication by using automatic caching

of data

Copyright Josep Torrellas 1999, 2001, 2002

Amdahl’s law:fparallel

fparallel

Speedup = 2) Large latency of remote accesses (50-10,000 clock cycles)

(1- f

enh)

F

enh S

penh

( 1- f

parallel

fparallel^100

Round trip

time

Cray T3D

1 μ

sec

Convex Exemplar 2

μsec

KSR-

2-

μ

sec

CM-

10

μ

sec

Intel Paragon

10-

μ

sec

IBM SP-

30-

μ

sec

Example : 10ns machine has a roundtrip latency of 2

μ

sec. 0.5% of remote

requests. Local all hit in cache (CPI = 1)

Whats new CPI?

CPI = 1 + 0.5% * 2000/10 = 2

Copyright Josep Torrellas 1999, 2001, 2002

The Cache Coherence Problem

•^

Caches are critical to modern high-speed processors

-^

Multiple copies of a block can easily get inconsistent–

processor writes. I/O writes,..

P

P

Cache

Cache

A = 5

A = 5

3

A = 7

Memory

A = 5

Copyright Josep Torrellas 1999, 2001, 2002

Snoopy Cache Coherence Schemes

•^

A distributed cache coherence scheme based on the notionof a snoop that watches all activity on a global bus, or isinformed about such activity by some global broadcastmechanism.

-^

Most commonly used method in commercialmultiprocessors

Copyright Josep Torrellas 1999, 2001, 2002

Dirty

Shared

Invalid

Bus Write MissBus invalidateP-read

Bus-readP- Read

P-read P-write

Bus Write Miss

Bus-read

P-write

P-write

P- Read

P-write

Copyright Josep Torrellas 1999, 2001, 2002

Write-Back/Ownership Schemes

•^

When a single cache has ownership of a block, processorwrites do not result in bus writes thus conservingbandwidth.

-^

Most bus-based multiprocessors nowadays use suchschemes.

-^

Many variants of ownership-based protocols exist:– Goodman’s write -once scheme– Berkley ownership scheme– Firefly update protocol– …

-^

We will discuss a few of these

Copyright Josep Torrellas 1999, 2001, 2002

Invalidation vs. Update Strategies

  1. Invalidation : On a write, all other caches with a copy are invalidated2. Update : On a write, all other caches with a copy are updated•^

Invalidation is bad when :–

single producer and many consumers of data.

-^

Update is bad when :–

multiple writes by one PE before data is read by another PE.– Junk data accumulates in large caches (e.g. process migration).

-^

Overall, invalidation schemes are more popular as the default