

































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The main properties of parallel computers and the challenges they pose to programmers. It emphasizes the lack of a standard architecture and the need for original thinking about numerical analysis and data management. The document also covers different types of parallel computers, including multicore, symmetric multiprocessors, large scale parallel machines, and clusters. It explains the shared memory model and the challenges of backing up in the network. The document also discusses multithreading and the facts of instruction execution. It concludes with a discussion of the MESI protocol and its four states.
Typology: Lecture notes
1 / 41
This page cannot be seen from the preview
Don't miss anything!


































Goal: Understand the main properties of parallel computers The parallel approach to computing … does require that some original thinking be done about numerical analysis and data management in order to secure efficient use. In an environment which has represented the absence of the need to think as the highest virtue, this is a decided disadvantage. -- Dan Slotnick, 1967
Facts Concerning Hardware Parallel computers differ dramatically from each other -- there is no standard architecture No single programming target! Parallelism introduces costs not present in vN machines -- communication; influence of external events Many parallel architectures have failed Details of parallel computer are of no greater concern to programmers than details of vN The “no single target” is key problem to solve should be
Think about the problem abstractly Introduce instances of basic || designs Multicore Symmetric Multiprocessors (SMPs) Large scale parallel machines Clusters Blue Gene/L Formulate a model of computation Assess the model of computation
Global memory shared among ||processors is the natural generalization of the sequential memory model Thinking about it, programmers assume sequential consistency when they think ||ism Recall Lamport’s definition of SC: "...the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program."
Replace bus with network, an early design Network delays cause memory latency to be higher for a single reference than with a the bus, but simultaneous use should help when many references are in the air (MT) M M M M M M M M P P P P P P P P Interconnection Network (Dance Hall)
Ω-Network is one possible interconnect Processor 2 references memory 6 (110) 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111 Processor ID Hi Memory Bits
Even if processors work on different data, the requests can back up in the network Everyone references data in memory 6 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111
The critical problem is that only one processor at a time can use/change data Cache read-only data (& pgms) only Check-in/Check-out model most appropriate Conclusion: Processors stall a lot … Solution: Multi-threading When stalled, change to another waiting activity Must make transition quickly, keeping context Need ample supply of waiting activities Available at different granularities
Figure from: [email protected]
The point when the activity switches can be Instruction level, at memory reference: Tera MTA Basic block level, with L1 cache miss: Alewife … At process level, with page fault: Time sharing Another variation (3-address code level) is to execute many threads ( P*log P ) in batches, called Bulk Synchronous Programming No individual activity improved, but less wait time
2 32-bit Pentiums Private 32K L1s Shared 2M-4M L MESI cc-protocol Shared bus control and memory bus L1-I L1-D Memory Bus Controller Processor P Processor P L1-I L1-D L2 Cache Front Side Bus
Standard Protocol for cache - coherent shared memory Mechanism for multiple caches to give single memory image We will not study it 4 states can be amazingly rich Thanks: Slater & Tibrewala of CMU
Upon loading, a line is marked E, subsequent reads are OK; write marks M Seeing another load, mark as S A write to an S, sends I to all, marks as M Another’s read to an M line, writes it back, marks it S Read/write to an I misses Related scheme: MOESI (used by AMD) Modified Exclusive Shared Invalid
L1-I L1-D Memory Bus Controller Processor P Processor P L1-I L1-D L2 Cache Front Side Bus Intel System Request Interface L1-I L1-D Mem Ctlr Processor P Processor P L1-I L1-D L2 Cache HT L2 Cache Cross-Bar Interconnect AMD AMD
L1-I L1-D Memory Bus Controller Processor P Processor P L1-I L1-D L2 Cache Front Side Bus System Request Interface L1-I L1-D Mem Ctlr Processor P Processor P L1-I L1-D L2 Cache HT L2 Cache Cross-Bar Interconnect System Request Interface L1-I L1-D Mem Ctlr Processor P Processor P L1-I L1-D L2 Cache HT L2 Cache Cross-Bar Interconnect Intel AMD^ AMD^ AMD^ AMD
The bus is a point that serializes references A serializing point is a shared mem enabler Bus L1-I L1-D Processor P L2 Cache Cache Control Memory Memory Memory Memory L1-I L1-D Processor P L2 Cache Cache Control L1-I L1-D Processor P L2 Cache Cache Control L1-I L1-D Processor P L2 Cache Cache Control
A powerful parallel design is to add 1 or more subordinate processors to std design Floating point instructions once implemented this way Graphics Processing Units - deep pipelining Cell Processor - multiple SIMD units Attached FPGA chip(s) - compile to a circuit These architectures will be discussed later
Interconnecting with InfiniBand Switch-based technology Host channel adapters (HCA) Peripheral computer interconnect (PCI) Thanks: IBM’s Clustering sytems using InfiniBand Hardware
Cheap to build using commodity technologies Effective when interconnect is “switched” Easy to extend, usually in increments of 1 Processors often have disks “nearby” No shared memory Latencies are usually large Programming uses message passing
Torus (Mesh) Hyper- Cube Fat Tree Omega Network
Two main classes Complete connection: CMPs, SMPs, X-bar Preserve single memory image Complete connection limits scaling to … Available to everyone Sparse connection: Clusters, Supercomputers, Networked computers used for parallelism (Grid) Separate memory images Can grow “arbitrarily” large Available to everyone with air conditioning Differences are significant; world views diverge
During the break, consider which aspects of the architectures we’ve seen should be high-lighted and which should be abstracted away
Some computations can be platform specific Most should be platform independent Parallel Software Development Problem: How do we neutralize the machine differences given that Some knowledge of execution behavior is needed to write programs that perform Programs must port across platforms effortlessly, meaning, by at most recompilation
Leave the problem to the compiler …