Download Advanced Computer Architecture: Lecture 13 - Multiprocessor and Memory Coherence - Prof. H and more Study notes Computer Architecture and Organization in PDF only on Docsity!
ECE 4100/
Advanced Computer Architecture
Lecture 13 Multiprocessor and Memory Coherence
Prof. Hsien-Hsin Sean Lee
School of Electrical and Computer Engineering
Georgia Institute of Technology
Memory Hierarchy in a Multiprocessor
P P P
Cache
Memory
Shared cache
P P P
$
Bus-based shared memory
$ $
Memory
P P P
$
Memory
Fully-connected shared memory
(Dancehall)
$ $
Memory
Interconnection Network
P
$ Memory
Interconnection Network
P
$ Memory
Distributed shared memory
3
Cache Coherency
• Closest cache level is private
• Multiple copies of cache line can be present
across different processor nodes
• Local updates
– Lead to incoherent state
– Problem exhibits in both write-through and
writeback caches
• Bus-based globally visible
• Point-to-point interconnect visible only to
communicated processor nodes
Example (Writeback Cache)
P
Cache
Memory
P
X= -
X= -
Cache
P
Cache X= -100X= 505
Rd?
X= -
Rd?
7
Sounds Easy?
P0 P1 P2 P
T1 A=1 B=
A=0 B=
T2 A=1 A=1 B=2 B=
T3 A=1 A=1 B=
B=2 A=
B=
T3 A=1 A=1 B=
B=2 A=
B=
B=2 A=
See A’s update before B’s See B’s update before A’s
Bus Snooping based on Write-Through Cache
• All the writes will be shown as a transaction
on the shared bus to memory
• Two protocols
– Update-based Protocol
– Invalidation-based Protocol
9
Bus Snooping
(Update-based Protocol on Write-Through cache)
- Each processor’s cache controller constantly snoops on the bus
- Update local copies upon snoop hit
P
Cache
Memory
P
X= -
X= -
Cache
P
Cache X= 505
Bus transaction
Bus snoop
X= 505
X= 505
- Each processor’s cache controller constantly snoops on the bus
- Invalidate local copies upon snoop hit
P
Cache
Memory
P
X= -
X= -
Cache
P
Cache X= 505
Bus transaction
Bus snoop
X= 505
Load X
X= 505
Bus Snooping
(Invalidation-based Protocol on Write-Through cache)
13
Cache Coherence Protocols for WB caches
- A cache has an exclusive copy of a line if
- It is the only cache having a valid copy
- Memory may or may not have it
- Modified (dirty) cache line
- The cache having the line is the owner of the line, because it must supply the block
Cache Coherence Protocol
(Update-based Protocol on Writeback cache)
- Update data for all processor nodes who share the same data
- For a processor node keeps updating the memory location, a lot of traffic
will be incurred
P
Cache
Memory
P
Cache
P
Cache
Bus transaction
X= -100 X= -100 X= -
Store X
X= 505
update
update
X= 505 X= 505
15
Cache Coherence Protocol
(Update-based Protocol on Writeback cache)
- Update data for all processor nodes who share the same data
- For a processor node keeps updating the memory location, a lot of traffic
will be incurred
P
Cache
Memory
P
Cache
P
Cache
Bus transaction
X= 505 X= 505 X= 505
Load X
Hit!
Store X
X= 333
update update
X= 333 X= 333
Cache Coherence Protocol
(Invalidation-based Protocol on Writeback cache)
- Invalidate the data copies for the sharing processor nodes
- Reduced traffic when a processor node keeps updating the same
memory location
P
Cache
Memory
P
Cache
P
Cache
Bus transaction
X= -100 X= -100 X= -
Store X
invalidate
invalidate
X= 505
19
MSI Writeback Invalidation Protocol
• Modified
– Dirty
– Only this cache has a valid copy
• Shared
– Memory is consistent
– One or more caches have a valid copy
• Invalid
• Writeback protocol: A cache line can be
written multiple times before the memory is
updated.
MSI Writeback Invalidation Protocol
• Two types of request from the processor
– PrRd
– PrWr
• Three types of bus transactions post by cache
controller
– BusRd
- PrRd misses the cache
- Memory or another cache supplies the line
– BusRd eXclusive (Read-to-own)
- PrWr is issued to a line which is not in the Modified state
– BusWB
- Writeback due to replacement
- Processor does not directly involve in initiating this operation
21
MSI Writeback Invalidation Protocol
(Processor Request)
Modified
Invalid
Shared
PrRd / BusRd
PrRd / ---
PrWr / BusRdX
PrWr / ---
PrRd / ---
PrWr / BusRdX
Processor-initiated
MSI Writeback Invalidation Protocol
(Bus Transaction)
- Flush data on the bus
- Both memory and requestor will grab the copy
- The requestor get data by
- Cache-to-cache transfer; or
- Memory
Modified
Invalid
Shared
Bus-snooper-initiated
BusRd / ---
BusRd / Flush
BusRdX / Flush BusRdX / ---
25
MSI Example
P
Cache
P2 P
Bus
Cache Cache
MEMORY
BusRd
Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier
P1 reads X S --- --- BusRd Memory
X=
X=10 (^) SS
MSI Example
P
Cache
P2 P
Bus
Cache Cache
MEMORY
X=10 SS
Processor Action (^) State in P1 State in P2 State in P3 Bus Transaction Data Supplier
P1 reads X S ---^ ---^ BusRd^ Memory P3 reads X
BusRd
X=10 (^) SS
S ---^ S^ BusRd^ Memory
X=
27
MSI Example
P
Cache
P2 P
Bus
Cache Cache
MEMORY
X=10 (^) SS
Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier
P1 reads X S --- --- BusRd Memory P3 reads X
X=10 SS
S --- S BusRd Memory P3 writes X
BusRdX
--- (^) II (^) MM
I ---^ M^ BusRdX
X=
X=-
MSI Example
P
Cache
P2 P
Bus
Cache Cache
MEMORY
Processor Action (^) State in P1 State in P2 State in P3 Bus Transaction Data Supplier
P1 reads X S ---^ ---^ BusRd^ Memory P3 reads X
X=-25 (^) MM
S ---^ S^ BusRd^ Memory P3 writes X
--- II
I --- M BusRdX P1 reads X
BusRd
X=-25 SS^ SS
S ---^ S^ BusRd^ P3 Cache
X=-25X=
31
MESI Writeback Invalidation Protocol Processor Request (Illinois Protocol)
Invalid
Exclusive Modified
Shared
PrRd / BusRd (not-S)
PrWr / ---
Processor-initiated
PrRd / --- PrRd, PrWr / ---
S: Shared Signal PrRd / ---
PrWr / BusRdX
PrRd / BusRd (S)
PrWr / BusRdX
32
MESI Writeback Invalidation Protocol Bus Transactions (Illinois Protocol)
Invalid
Exclusive Modified
Shared
Bus-snooper-initiated
BusRd / Flush
BusRdX / Flush
BusRd / Flush*
Flush: Flush for data supplier; no action for other sharers*
BusRdX / Flush*
BusRd / Flush Or ---)
BusRdX / ---
- Whenever possible, Illinois protocol performs $-to-$ transfer rather than having memory to supply the data
- Use a Selection algorithm if there are multiple suppliers (Alternative: add an O state or force update memory)
- Most of the MESI implementations simply write to memory
33
MESI Writeback Invalidation Protocol
(Illinois Protocol)
Invalid
Exclusive Modified
Shared
Bus-snooper-initiated
BusRd / Flush
BusRdX / Flush
BusRd / Flush* BusRdX / Flush*
PrRd / BusRd BusRdX / --- (not-S)
PrWr / ---
Processor-initiated PrRd / BusRd (S)
PrRd / --- PrRd, PrWr / ---
PrRd / ---
PrWr / BusRdX
S: Shared Signal
PrWr / BusRdX
BusRd / Flush (or ---)
Flush: Flush for data supplier; no action for other sharers*
MOESI Protocol
• Add one additional state ─ Owner state
• Similar to Shared state
• The O state processor will be responsible for
supplying data (copy in memory may be stale)
• Employed by
– Sun UltraSparc
– AMD Opteron
- In dual-core Opteron, cache-to-cache
transfer is done through a system
request interface (SRI) running at full
CPU speed
CPU
L
CPU
L
System Request Interface
Crossbar
Hyper- Transport
Mem Controller
37
Implication on Multi-Level Caches
• How do you guarantee coherence in a multi-level
cache hierarchy?
– Snoop all cache levels?
– Intel’s 8870 chipset has a “snoop filter” for quad-core
• Maintaining inclusion property
– Ensure data in the outer level must be present in the
inner level
– Only snoop the outermost level (e.g. L2)
– L2 needs to know L1 has write hits
- Use Write-Through cache
- Use Write-back but maintain another “modified-but-stale” bit in
L
Inclusion Property
• Not so easy …
– Replacement: Different bus observes different
access activities, e.g. L2 may replace a line
frequently accessed in L
– Split L1 caches: Imagine all caches are direct-
mapped.
– Different cache line sizes
39
Inclusion Property
• Use specific cache configurations
– E.g., DM L1 + bigger DM or set-associative L2 with the
same cache line size
• Explicitly propagate L2 action to L
– L2 replacement will flush the corresponding L1 line
– Observed BusRdX bus transaction will invalidate the
corresponding L1 line
– To avoid excess traffic, L2 maintains an Inclusion bit for
filtering (to indicate in L1 or not)
Directory-based Coherence Protocol
- Snooping-based protocol
- N transactions for an N-node MP
- All caches need to watch every memory request from each processor
- Not a scalable solution for maintaining coherence in large shared
memory systems
- Directory protocol
- Directory-based control of who has what;
- HW overheads to keep the directory (~ # lines * # processors)
P
$
P
$
P
$
P
$
Memory
Interconnection Network
Directory
Modified bit Presence bits, one for each node