Understanding Snooping Protocols & Directory-Based Approaches for Cache Coherency - Prof. , Study notes of Computer Architecture and Organization

The cache coherency problem in multiprocessor systems, focusing on snooping protocols and directory-based approaches. It covers the fundamentals of cache coherency, performance issues, and industry standard mesi protocol. The text also discusses scaling multiprocessors and additional concepts related to cache coherency.

Typology: Study notes

Pre 2010

Uploaded on 08/05/2009

koofers-user-2n7
koofers-user-2n7 🇺🇸

4

(1)

10 documents

1 / 21

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
© Sudhakar Yalamanchili, Georgia Institute of Technology
Memory Coherence and
Memory Coherence and
Consistency
Consistency
ECE 4100/6100 (2)
Reading for this Module
Memory Coherence
Snooping bus protocols
Section 6.3 and 6.4
Directory protocols
Section 6.5 and Section 6.6
Memory Consistency
Section 6.8
Reference: Adve, S. V. and K. Gharachorloo, “ Shared
memory consistency models: A tutorial,” IEEE Computer,
December 1996, pp. 66-76
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15

Partial preview of the text

Download Understanding Snooping Protocols & Directory-Based Approaches for Cache Coherency - Prof. and more Study notes Computer Architecture and Organization in PDF only on Docsity!

© Sudhakar Yalamanchili, Georgia Institute of Technology

Memory Coherence and Memory Coherence and

ConsistencyConsistency

Reading for this Module

• Memory Coherence

  • Snooping bus protocols
    • Section 6.3 and 6.
  • Directory protocols
    • Section 6.5 and Section 6.

• Memory Consistency

  • Section 6.
  • Reference: Adve, S. V. and K. Gharachorloo, “ Shared

memory consistency models: A tutorial,” IEEE Computer,

December 1996, pp. 66-

ECE 4100/6100 (3)

Shared Memory Multiprocessors

  • Additional processors are used to improve

performance

  • We adopt a simplified model wherein each

processor executes a distinct thread

  • Threads share data

Memory

P

Memory Memory

Network

A

P

B

P

C

Cache line

code data

Registers

stack stack stack

Registers Registers

Shared Address Space

Thread 1 Thread 2 Thread 3

cache

Handling Shared Data

  • Intuitively we must ensure that the most “recent”

value of a shared variable is read Æ coherency!

Memory

P

Memory

Network

A=

P2 P

A=

A=

A=

What happens to this value?

When is this value updated?

  • Multiple cache copies exist during read-only sharing

cache

ECE 4100/6100 (7)

Performance Issues

  • Use of memory bandwidth
    • Different protocols make different demands on bandwidth

of memory and the bus

  • Memory traffic
    • Different protocols produce different levels of bus traffic
  • Implementation complexity
    • Hardware complexity of the cache state machines
    • Impact on bus protocol

System Model: Snooping Protocols

  • Single physical address space with uniform memory

access (UMA) times

  • Basic cache operation remains unchanged
  • State of a cache line indicates sharing status
    • State is associated with the physical, processor cache line

and not with the contents of the line

: : : : :

: 31 0

State Bits Tag (^) Data

: : : : : : :

: 31 0

State Bits Tag (^) Data

: :

Store the state of this physical cache line

Processor 0 Processor N-

BUS

Snooping cntrler Snooping cntrler

ECE 4100/6100 (9)

Cache State Transitions: Based on

CPU Requests

INVALID SHARED

EXCLUSIVE

CPU Read: Place read miss on bus

CPU Read hit

CPU Read miss: place read miss on bus

CPU Read miss: write-back block, place

read on bus

CPU Write: place write miss on bus

CPU Read hit

CPU Write hit

CPU write miss: write back cache block, place write miss on bus

CPU write: place write miss

on bus

: : : : :

: 31 0

Mux

State Bits Tag (^) Data

: (^) :

Invalid, shared or exclusive

Cache State Transitions: Based on

Bus Requests

INVALID SHARED

EXCLUSIVE

Write miss for this block

CPU read miss

Read miss for this block: write back block, abort memory access

Write miss for this block: write back block, abort memory access

Memory

P

Memory

Network

A

P

A

ECE 4100/6100 (13)

Implementation Issues

  • In reality the preceding state transitions are not

atomic

  • For example, miss, acquire the bus, and receive a

response will not in practice be atomic

  • Split transaction buses introduce non-atomic operations
  • Multiple coordinating entities on the same bus
  • Interference between snooping and CPU accesses
  • Duplicate the cache tags
  • Use multi-level inclusion for L2/L3 caches

Further Optimizations

  • In practice, protocols distinguish between write hits

and write misses

  • Utilize the notions of invalidations and “ownership”
  • Distinguish the exclusive, consistent state of the

cache line

  • Let us refer to this as a clean-private state
  • MESI protocol
  • Allow blocks to be shared without writing back
  • Distinguish shared, but dirty state.

ECE 4100/6100 (15)

A commercial protocol: MESI

Protocol

inv

Mod Exc

Sh Inv

Mod Exc

Sh

Read miss, shared

Write hit

Write miss Write hit

Write hit

Read hit

Read miss, exclusive

Snoop hit on write or Read with intent to modify

Snoop hit on read

  • Industry standard, invalidation-based protocol for

SMPs

  • Reading: Find a complete specification as used in

the Pentium and understand all of the transitions

Major Transitions

Scaling Multiprocessors

  • A bus is a bottleneck to scaling to a large systems
    • Electrical issues
    • Contention issues
  • Goal: scalable memory and interconnection bandwidth
    • Message passing networks for scalable bandwidth
    • Physically distributed memories for scalable memory bandwidth
  • Problem: snooping schemes are not scalable

Memory

P

Memory Memory Memory

cache

P

cache

P

cache

P

cache

ECE 4100/6100 (19)

Using Distributed Directories

  • Single physical, distributed address space with non-

uniform memory access (NUMA) times

  • Basic snooping protocol state machine transitions

are preserved

P + C

Dir

Memory

P + C

Dir

Memory

P + C

Dir

Memory

P + C

Dir

Memory

Interconnection Network

Some Additional Concepts

P + C

Dir

Memory

P + C

Dir

Memory

P + C

Dir

Memory

Local node generates a memory reference Remote node has a copy of block

Home node is the physical memory location of a memory reference

Generating the request

Network

  • Messages are received in

the order sent

  • Directory entry indicates

state of cached blocks

and the members of the

sharing set

ECE 4100/6100 (21)

Directory Protocol Features

  • The {sharing set} is the set of processors with a

copy of a memory block

  • Implementation
    • Bit vectors and fully mapped entries
    • Linked lists
  • Consistency strategy
  • Shared lines always consistent with home copy
  • Notification strategy
  • Invalidation rather than update

The Local Processor State Machine

INVALID SHARED

EXCLUSIVE

CPU read miss

CPU read hit

Invalidate

CPU Read: send read miss msg

CPU write miss: data write back CPU write hit CPU read hit

Fetch Invalidate: data write back

CPU write: send write miss msg

CPU read miss: data write

back

Fetch; data write back

CPU write: data send write msg

P + C

Dir memory

Network

cache

: : : : :

: 31 0

State Bits Tag Data

: :

ECE 4100/6100 (25)

Example

Summary

  • Performance scaling is achieved via the use of

multiple processors each working on one part of the

application

  • Caching of shared data leads to the cache

coherency problem

  • Essentially a synchronization problem
  • Solutions depend on the scale of the system
  • Small scale machines using a shared bus Æ snooping

protocols

  • Large scale machines using a message passing network

Æ directory based protocols

© Sudhakar Yalamanchili, Georgia Institute of Technology

Memory Consistency Models Memory Consistency Models

Memory Consistency

• What can the programmer assume about the

servicing of memory operations?

  • For example, will they occur in program order?
  • Why are these assumptions important?

Network

P

P P

Memory

Memory

Memory

ECE 4100/6100 (31)

Sequential Consistency

[Lamport] “ A multiprocessor system is sequentially consistent if

the result of any execution is the same as if the operations of

all processors were executed in some sequential order, and

the operations of each individual processor appear in this

sequence in the order specified by the program ”

P

SD

LD

LD

P P

Memory

Memory references from all processors are serialized

Implications of Sequential

Consistency

  • Program order requirement
    • Note that memory systems may be parallel and that the network

between processors and memories may re-order instructions

  • Atomicity requirement
    • Informally, a write takes place instantaneously with respect to the

ability of all other processors to read it

P

Memory

P P

Memory Memory Memory

P

Network

Reference: Adve, S. V. and K. Gharachorloo, “ Shared memory consistency models: A tutorial,” IEEE Computer, December 1996, pp. 66-

ECE 4100/6100 (33)

Program Ordering Issues

  • Violation of program order requirement Æ can lead

to incorrect parallel programs

  • Use of write buffers
    • Note: Does not violate data dependence in uniprocessor

systems

  • Ordering issues arise naturally in systems with

caches

  • Compiler re-ordering of instructions lead violations

of sequential consistency

Atomicity Issues in the Presence of

Caches

  • Consider the above example and an update based protocol
  • Atomicity can be ensured by the following two conditions
    • Writes completion: result of a write cannot be used until all copies have been updated/invalidated - A write must be atomic “system wide”
    • Writes to a location are serialized, i.e., all processors see writes to the same location in the same order

Memory

P

Memory Memory Memory

Network

A

P

A

P

A

P

A

A= 1 A= 2

What order do P3 and P4 see the updates?

ECE 4100/6100 (37)

Relaxed Memory Models (cont.)

  • Processor Consistency
    • Writes by any processor are seen by all processor in the order

they were issued

  • For any variable, all processors see writes in the same order
  • Weaker than sequential consistency since the same ordering is not guaranteed to be see seen by all processors
  • Weak Consistency
  • Distinguish between data operations and synchronization

operations

  • Synchronization operations are sequentially consistent
    • All processors see synchronization operations in the same order
  • When a synchronization operation is issued, the memory pipeline

is flushed

  • All pending writes must complete before the synchronization operation executes

Relaxed Memory Models (cont.)

  • Release Consistency
    • Increases overlap in memory operations restricted by the

weak consistency model

  • Similarity with instruction issue and dependences?

ECE 4100/6100 (39)

The Programmers View

  • The use of synchronized programs
    • All accesses to shared data are synchronized
      • Data references are ordered by synchronization primitives
    • Thus these programs are data-race free
      • Outcome does not depend on the relative speed of

processors, network, and system software

  • Utilize a programmer-centric view of consistency
    • Do not have to reason about ordering and atomicity

constraints

  • Write programs to conform to program semantics and let

the compiler and system libraries bring optimizations to

bear based on the model supported in the language

Summary

  • Consistency models are a set of rules that can b

relied on by the programmer and compiler

  • Consistency Models determine the system

optimizations that are possible

  • Overlapping of memory operations from multiple

processors

  • Many optimizations can violate consistency model

semantics

  • Leads to incorrect execution
  • Consistency models are distinct from coherence
  • The latter is concerned with updates/invalidations to a

single shared variable

  • The former is concerned with the behavior of memory

references from multiple concurrent threads