Multiprocessors and Thread-Level Parallelism - Notes | CS 433, Study notes of Computer Architecture and Organization

Material Type: Notes; Professor: Torrellas; Class: Computer System Organization; Subject: Computer Science; University: University of Illinois - Urbana-Champaign; Term: Fall 2008;

Typology: Study notes

Pre 2010

Uploaded on 03/10/2009

koofers-user-13v
koofers-user-13v 🇺🇸

10 documents

1 / 39

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Copyright Josep Torrellas 1999, 2001, 2002 1
Chapter 4
Multiprocessors and
Thread-Level Parallelism
Instructor: Josep Torrellas
CS433
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27

Partial preview of the text

Download Multiprocessors and Thread-Level Parallelism - Notes | CS 433 and more Study notes Computer Architecture and Organization in PDF only on Docsity!

Copyright Josep Torrellas 1999, 2001, 2002

Chapter 4

Multiprocessors and

Thread-Level Parallelism

Instructor: Josep Torrellas

CS

Copyright Josep Torrellas 1999, 2001, 2002

Progress Towards Multiprocessors

  • Rate of speed growth in uniprocessors is saturating+ Modern multiple issue processors are becoming very

complex

Æ

multicores

  • Steady progress in parallel software : the major obstacle to

parallel processing

Copyright Josep Torrellas 1999, 2001, 2002

•^

Multiple I streams, single D stream (MISD) : nocommercial machine

-^

Multiple I streams, multiple D streams (MIMD)– each processor fetches its own instructions and operates

on its own data

  • usually off the shelf

μ

processors

  • architecture of choice for general purpose mps– Flexible: can be used in single user mode or

multiprogrammed

  • use of the shelf

μ

processors

  • Copyright Josep Torrellas 1999, 2001,
  • See figure 4.1 and 4.

Copyright Josep Torrellas 1999, 2001, 2002

  • Also reduces the memory latency– of course interprocessor communication is more costly

and complex

  • often each node is a cluster (bus based multiprocessor)– 2 types, depending on method used for interprocessor

communication:1. Distributed shared memory (DSM) or scalable

shared memory

  1. Message passing machines or multicomputers

Copyright Josep Torrellas 1999, 2001, 2002

DSMs :•^

memories addressed as one shared address space: processorP1 writes address X, processor P2 reads address X

-^

Shared memory means that some address in 2 processorsrefers to same mem location; not that mem is centralized

-^

also called NUMA (Non Uniform Memory Access)

-^

processors communicate implicitly via loads and stores Multicomputers:•^

each processor has its own address space , disjoint to otherprocessors , cannot be addressed by other processors

Copyright Josep Torrellas 1999, 2001, 2002

processors are notified of the arrival of a msg

polling →

interrupt

standard message passing libraries: message passinginterface (MPI)

Copyright Josep Torrellas 1999, 2001, 2002

Shared memory communication (DSM)+ Compatibility w/well understood mechanisms in centralized

mps

  • easy of programming /compiler design for pgms w/irregular

communication patterns

  • lower overhead of communication

better use of bandwidth when using small communications

  • reduced remote communication by using automatic caching

of data

Copyright Josep Torrellas 1999, 2001, 2002

Amdahl’s law:fparallel

fparallel

Speedup = 2) Large latency of remote accesses (50-1,000 clock cycles)

(1- f

enh)

F

enh S

penh

( 1- f

parallel

fparallel^100

Example : 0.5 ns machine has a round

trip latency of 200 ns. 0.2% ofinstructions cause a cache miss (processor stall).

Base CPI without

misses is 0.5. Whats new CPI?

CPI = 0.5 + 0.2% * 200/0.5 = 1.

Copyright Josep Torrellas 1999, 2001, 2002

The Cache Coherence Problem

•^

Caches are critical to modern high-speed processors

-^

Multiple copies of a block can easily get inconsistent–

processor writes. I/O writes,..

P

P

Cache

Cache

A = 5

A = 5

3

A = 7

Memory

A = 5

Copyright Josep Torrellas 1999, 2001, 2002

Snoopy Cache Coherence Schemes

•^

A distributed cache coherence scheme based on the notionof a snoop that watches all activity on a global bus, or isinformed about such activity by some global broadcastmechanism.

-^

Most commonly used method in commercialmultiprocessors

Copyright Josep Torrellas 1999, 2001, 2002

Dirty

Shared

Invalid

Bus Write MissBus invalidateP-read

Bus-readP- Read

P-read P-write

Bus Write Miss

Bus-read

P-write

P-write

P- Read

P-write

Copyright Josep Torrellas 1999, 2001, 2002

Write-Back/Ownership Schemes

•^

When a single cache has ownership of a block, processorwrites do not result in bus writes thus conservingbandwidth.

-^

Most bus-based multiprocessors nowadays use suchschemes.

-^

Many variants of ownership-based protocols exist:– Goodman’s write -once scheme– Berkley ownership scheme– Firefly update protocol– …

-^

We will discuss a few of these

Copyright Josep Torrellas 1999, 2001, 2002

Invalidation vs. Update Strategies

  1. Invalidation : On a write, all other caches with a copy are invalidated2. Update : On a write, all other caches with a copy are updated•^

Invalidation is bad when :–

single producer and many consumers of data.

-^

Update is bad when :–

multiple writes by one PE before data is read by another PE.– Junk data accumulates in large caches (e.g. process migration).

-^

Overall, invalidation schemes are more popular as the default