Download Lecture Slides on Beyond Low Level Parallelism | ECE 511 and more Study notes Computer Architecture and Organization in PDF only on Docsity!
Lecture 19
Beyond Low-level Parallelism
Outline
- Models for exploiting large grained
parallelism
- Issues in parallelization: communication,
synchronization, Amdahl’s Law
- Basic Single-Bus Shared Memory Machines
Flynn’s Classification
- Single Instruction, Single Data (SISD)
- simple CPU (1-wide pipeline)
- Single Instruction, Multiple Data (SIMD)
- vector computer, multimedia processors
- Multiple Instruction, Single Data (MISD)
- ???, systolic arrays perhaps?
- Multiple Instruction, Multiple Data (MIMD)
- multiprocessor (now also, superscalar CPU)
- Also: Single Program Multiple Data (SPMD)
A Simple Problem
min = D[0];
for (i = 0; i < n; i++) {
if (D[i] < min)
min = D[i];
cout << min;
Program to find the minimum of a large set of input data.
- Objective is to write parallel programs that can use as many
processors as are available.
- Parallelization is limited by communication.
- In terms of latency
- In terms of amount of communication
Issue 1: Communication
Issue 2: Synchronization
- Often, parallel tasks need to coordinate
and wait for each other.
- Example: barrier in parallel min
algorithm
- The art of parallel programming involves
minimizing synchronization
Example Speedup Graph
Speedup over a single processor
Num of Processors
Ideal Speedup
Observed Speedup
2 parallel programming models
- Single, shared address space
- implicit communication through LD/ST insts
- complicates hardware
- Multiple address spaces
- explicit communication through send/receive
- simplifies hardware
Message Passing Machines
- Network characteristic are of primary
importance
- Latency, bisection bandwidths, node
bandwidth, occupancy of communication
Latency = sender overhead + transport latency + receiver overhead
Synchronization support for
shared memory machines
- Example: two processors trying to increment a shared
variable
- Requires an atomic memory update instr
- example: test and set, or compare and exchange
- Using cmpxchg, implement barrier();
- Scalability issues with traditional synchronization
Memory Coherence, continued
CPU 0 CPU 1
Loc X: val A (^) Cache Loc X: val A Cache
X A (^) Memory
Essential Question: What happens when CPU 0 writes value B to memory location X?
One solution : snooping caches
CPU CPU
Mem
Cache Cache
cache cache
snoop tags
snoop tags
- Snoop logic watches bus transactions and either invalidates or updates matching cache lines.
- Requires an extra set of cache tags.
- Generally, write-back caches are used. Why?
- What about the L1 cache?
The Illinois Protocol
M E S I^ Tag
The tag for each cache block now contains 4 bits that specify whether the block is m odified, e xclusively owned, s hared, or i nvalid.
Illinois Protocol
from the CPU’s perspective
Invalid or no tag match
Modified Shared
Read, supplied by memory
Write, but invalidate others
Read, supplied by another CPU
Write, but invalidate others
Write
Exclusive
Read
Read
Read or Write