Properties of Parallel Computers, Lecture notes of Computer Networks

The main properties of parallel computers and the challenges they pose to programmers. It emphasizes the lack of a standard architecture and the need for original thinking about numerical analysis and data management. The document also covers different types of parallel computers, including multicore, symmetric multiprocessors, large scale parallel machines, and clusters. It explains the shared memory model and the challenges of backing up in the network. The document also discusses multithreading and the facts of instruction execution. It concludes with a discussion of the MESI protocol and its four states.

Typology: Lecture notes

2021/2022

Uploaded on 05/11/2023

tomcrawford
tomcrawford 🇺🇸

4.2

(15)

257 documents

1 / 41

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Part II: Architecture
Goal: Understand the main properties of parallel
computers
The parallel approach to computing … does require
that some original thinking be done about numerical
analysis and data management in order to secure
efficient use. In an environment which has
represented the absence of the need to think as the
highest virtue, this is a decided disadvantage.
-- Dan Slotnick, 1967
What’s The Deal With Hardware?
Facts Concerning Hardware
Parallel computers differ dramatically from
each other -- there is no standard architecture
No single programming target!
Parallelism introduces costs not present in vN
machines -- communication; influence of
external events
Many parallel architectures have failed
Details of parallel computer are of no greater
concern to programmers than details of vN
The “no single target” is key problem to solve
should be
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29

Partial preview of the text

Download Properties of Parallel Computers and more Lecture notes Computer Networks in PDF only on Docsity!

Part II: Architecture

Goal: Understand the main properties of parallel computers The parallel approach to computing … does require that some original thinking be done about numerical analysis and data management in order to secure efficient use. In an environment which has represented the absence of the need to think as the highest virtue, this is a decided disadvantage. -- Dan Slotnick, 1967

What’s The Deal With Hardware?

 Facts Concerning Hardware  Parallel computers differ dramatically from each other -- there is no standard architecture  No single programming target!  Parallelism introduces costs not present in vN machines -- communication; influence of external events  Many parallel architectures have failed  Details of parallel computer are of no greater concern to programmers than details of vN The “no single target” is key problem to solve should be

Our Plan

 Think about the problem abstractly  Introduce instances of basic || designs  Multicore  Symmetric Multiprocessors (SMPs)  Large scale parallel machines  Clusters  Blue Gene/L  Formulate a model of computation  Assess the model of computation

Shared Memory

 Global memory shared among ||processors is the natural generalization of the sequential memory model  Thinking about it, programmers assume sequential consistency when they think ||ism  Recall Lamport’s definition of SC:  "...the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program."

Reduce Contention

 Replace bus with network, an early design  Network delays cause memory latency to be higher for a single reference than with a the bus, but simultaneous use should help when many references are in the air (MT) M M M M M M M M P P P P P P P P Interconnection Network (Dance Hall)

An Implementation

 Ω-Network is one possible interconnect  Processor 2 references memory 6 (110) 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111 Processor ID Hi Memory Bits

Backing Up In Network

 Even if processors work on different data, the requests can back up in the network  Everyone references data in memory 6 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111

One-At-A-Time Use

 The critical problem is that only one processor at a time can use/change data  Cache read-only data (& pgms) only  Check-in/Check-out model most appropriate  Conclusion: Processors stall a lot …  Solution: Multi-threading  When stalled, change to another waiting activity  Must make transition quickly, keeping context  Need ample supply of waiting activities  Available at different granularities

Fine Grain Multithreading: Tera

Figure from: [email protected]

Coarse Grain Multithreading: Alewife

Simultaneous Multi-threading: SMT

Multi-threading Grain Size

 The point when the activity switches can be  Instruction level, at memory reference: Tera MTA  Basic block level, with L1 cache miss: Alewife  …  At process level, with page fault: Time sharing  Another variation (3-address code level) is to execute many threads ( P*log P ) in batches, called Bulk Synchronous Programming No individual activity improved, but less wait time

Diversity Among Small Systems

Intel CoreDuo

 2 32-bit Pentiums  Private 32K L1s  Shared 2M-4M L  MESI cc-protocol  Shared bus control and memory bus L1-I L1-D Memory Bus Controller Processor P Processor P L1-I L1-D L2 Cache Front Side Bus

MESI Protocol

 Standard Protocol for cache - coherent shared memory  Mechanism for multiple caches to give single memory image  We will not study it  4 states can be amazingly rich Thanks: Slater & Tibrewala of CMU

MESI, Intuitively

 Upon loading, a line is marked E, subsequent reads are OK; write marks M  Seeing another load, mark as S  A write to an S, sends I to all, marks as M  Another’s read to an M line, writes it back, marks it S  Read/write to an I misses  Related scheme: MOESI (used by AMD) Modified Exclusive Shared Invalid

Comparing Core Duo/Dual Core

L1-I L1-D Memory Bus Controller Processor P Processor P L1-I L1-D L2 Cache Front Side Bus Intel System Request Interface L1-I L1-D Mem Ctlr Processor P Processor P L1-I L1-D L2 Cache HT L2 Cache Cross-Bar Interconnect AMD AMD

Comparing Core Duo/Dual Core

L1-I L1-D Memory Bus Controller Processor P Processor P L1-I L1-D L2 Cache Front Side Bus System Request Interface L1-I L1-D Mem Ctlr Processor P Processor P L1-I L1-D L2 Cache HT L2 Cache Cross-Bar Interconnect System Request Interface L1-I L1-D Mem Ctlr Processor P Processor P L1-I L1-D L2 Cache HT L2 Cache Cross-Bar Interconnect Intel AMD^ AMD^ AMD^ AMD

Symmetric Multiprocessor on a Bus

 The bus is a point that serializes references  A serializing point is a shared mem enabler Bus L1-I L1-D Processor P L2 Cache Cache Control Memory Memory Memory Memory L1-I L1-D Processor P L2 Cache Cache Control L1-I L1-D Processor P L2 Cache Cache Control L1-I L1-D Processor P L2 Cache Cache Control

Sun Fire E25K

Co-Processor Architectures

 A powerful parallel design is to add 1 or more subordinate processors to std design  Floating point instructions once implemented this way  Graphics Processing Units - deep pipelining  Cell Processor - multiple SIMD units  Attached FPGA chip(s) - compile to a circuit  These architectures will be discussed later

Clusters

 Interconnecting with InfiniBand  Switch-based technology  Host channel adapters (HCA)  Peripheral computer interconnect (PCI) Thanks: IBM’s Clustering sytems using InfiniBand Hardware

Clusters

 Cheap to build using commodity technologies  Effective when interconnect is “switched”  Easy to extend, usually in increments of 1  Processors often have disks “nearby”  No shared memory  Latencies are usually large  Programming uses message passing

Networks

Torus (Mesh) Hyper- Cube Fat Tree Omega Network

Summarizing Architectures

 Two main classes  Complete connection: CMPs, SMPs, X-bar  Preserve single memory image  Complete connection limits scaling to …  Available to everyone  Sparse connection: Clusters, Supercomputers, Networked computers used for parallelism (Grid)  Separate memory images  Can grow “arbitrarily” large  Available to everyone with air conditioning  Differences are significant; world views diverge

Break

 During the break, consider which aspects of the architectures we’ve seen should be high-lighted and which should be abstracted away

The Parallel Programming Problem

 Some computations can be platform specific  Most should be platform independent  Parallel Software Development Problem: How do we neutralize the machine differences given that  Some knowledge of execution behavior is needed to write programs that perform  Programs must port across platforms effortlessly, meaning, by at most recompilation

Options for Solving the PPP

 Leave the problem to the compiler …