Download Parallel Architecture: Understanding Different Types of Parallel Computing Systems - Prof. and more Assignments Computer Science in PDF only on Docsity!
Parallel Architecture
Jingke Li
Portland State University
Jingke Li (Portland State University) CS 415/515 Parallel Architecture 1 / 36
Parallel Computers
A “conventional” computer consists of
- a single CPU (typically with a set of pipelined functional units)
- a single memory hierarchy (i.e. caches and main memory)
and operations are performed one after another in a sequential order.
A parallel computer, in contrast, has multiple components. There are many ways to organize a parallel computer:
- Single core with multiple functional units
- (^) Multiple cores with shared cache hierarchy
- (^) Multiple processors with shared memory module
- (^) Multiple processors with individual memory modules
- (^) Clusters of subsystems
Flynn’s Taxonomy
S: Single, M: Multiple, I: Instruction stream, D: Data stream
- SISD — Conventional uniprocessors
- (^) MISD — Not very realistic
- (^) SIMD — Same operation is simultaneously applied to multiple data items - Can take different forms, e.g. single CPU computers, vector computers, and large-scale super-computers - Recent SIMD computers are mostly small-scale
- MIMD — Multiple threads of instructions operating on multiple threads of data. In the most general case, threads could be independent programs! - (^) A very broad category; can be refined into many sub-categories - (^) Most current large-scale parallel systems fall into this category
Jingke Li (Portland State University) CS 415/515 Parallel Architecture 3 / 36
Parallel Architectures Today
- (^) Processor-Level Parallelism
- Single CPU with special parallel instructions
- Multi-core processors
- GPUs
- Vector processors
- System-Level Architectures
- SIMD systems
- Symmetric multiprocessors (SMPs)
- Non-uniform memory access machines (NUMAs)
- Network of workstations (NOWs)
- Supercomputer Architectures
- Massive Parallel Processing Systems (MPPs) (thousands of processors)
- Large-scale clusters (tens of thousands of processors)
- Constellations (clusters of powerful vector processors)
Multi-Core Processors
Two or more processors on the same chip — chip-level multiprocessing. The processors have individual L1 caches, but share a common L2 cache. Advantages: (In comparison to multi-chip SMP designs)
- (^) Faster cache snoop operations — the signals do not have to travel off-chip
- (^) Smaller physical package — smaller circuitry space is needed
- (^) Less power consumption — since signals are on-chip, the cores can operate at lower voltages
Disadvantages:
- (^) Require OS and application software support
- (^) More difficult to manage thermally than single-CPU chips (due to the higher integration)
- CPU power may be underutilized for some applications, since scaling efficiency is largely dependent on the application or problem set.
Jingke Li (Portland State University) CS 415/515 Parallel Architecture 7 / 36
Graphic Processing Units (GPUs)
GPUs are dedicated graphics rendering device for PCs or game consoles.
- (^) The typical modern GPU sits on a separate graphics card from the motherboard, connected to the CPU and main RAM through the AGP or PCI Express bus.
- (^) A GPU implements a number of graphics primitive operations with highly-parallel hardware components, making them running much faster than running directly on a typical CPU.
The limitation: GPUs typically have very limited programmability, and hence have a very narrow application domain.
General-Purpose GPUs (GPGPUs)
A new trend in this field:
- Add more flexibility to GPU’s programming model
- Extend to non-graphical, but still matrix-based applications
GPGPUs are best-described as co-processors. To handle general applications, CPU hosts are still needed.
Two GPU programming languages have emerged:
- CUDA (Compute Unified Device Architecture) — developed by nVidia.
- (^) OpenCL (Open Computing Language) — initially conceived by Apple, now managed by the non-profit consortium Khronos Group.
Challenges:
- (^) Data movements between the host CPU and the GPU are slow.
- (^) Integer operations are weak.
- (^) Programming is still very hard.
Jingke Li (Portland State University) CS 415/515 Parallel Architecture 9 / 36
Vector Processors
Vector processors are machines built primarily to handle large scientific and engineering calculations. Their performance derives from a heavily pipelined architecture which can execute special vector instructions very efficiently. They are the “traditional” supercomputers.
Key Components:
- A set of pipelined functional units.
- Special vector registers — Data is read into the vector registers which are FIFO queues capable of holding 50-100 floating point values. cr
- Special vector instructions — such as loading/storing a vector register from/to a location in memory, performing operations on elements in the vector registers.
Sample Machines: Early CRAY Series, CDC Cyber 205, IBM 3090 family, FPS-164.
MIMD Systems
MIMD systems consist of a collection of processors:
- (^) Each processor is capable of running a distinct thread of computation.
- The processors coordinate on a joint program via a shared address space or through message passing.
Jingke Li (Portland State University) CS 415/515 Parallel Architecture 13 / 36
Shared-Memory MIMD Systems
The processors share a single address space. This address space can be realized either through a single physical memory accessible to all processors, or through a set of distributed memory modules attached to the processors. Respectively, the two sub-categories of systems are called symmetric multiprocessors (SMPs) and non-uniform memory access machines (NUMAs).
Advantages:
- No need to partition or duplicate data
- Less communication overhead
- Programming style close to sequential programming style
Main Issues:
- Scalability of the interconnection network
- (^) Memory-cache consistency
Symmetric Multiprocessors (SMPs)
Commodity microprocessors connected to a single shared memory through a high-speed interconnect (typically bus or crossbar).
- Symmetric — each processor has exactly the same abilities, any processor can do anything
- Single physical address space — other than processors, there is one copy of everything else (memory, I/O system, OS, etc)
- Hardware-supported cache coherence — typically via snoopy protocols Typically small scale
OS on a SMP can hide the fact of the multiprocessor from applications, hence programming such a machine could be of no difference from programming a sequential machine. For this reason, SMPs are heavily favored to run commercial applications.
All major vendors of computer systems are producing and selling these types of machines: Sun, SGI, HP/Compaq, IMP, Intel, ...
Jingke Li (Portland State University) CS 415/515 Parallel Architecture 15 / 36
Sample Machine: Sun Enterprise 6000
- 30 UltraSparc (each processing board has two processors)
- (^) Gigaplane bus (peak 2.67 GB/s)
- Non-multiplexed, 256-bit data lines, 41-bit address lines, and 18-bit arbitration lines, clock rate 83.5 MHz
- The data lines are really a 16x16 crossbar
- Split-transaction, supporting 112 outstanding transactions
- Up to 30 GB 16-way main memory
- The I/O bandwidth scales with the number of I/O boards
Memory-Cache Consistency
- (^) In modern computer architectures, memory hierarchy (main memory plus multi-level of caches) is used to overcome the memory access latency problem.
- (^) A single data item may have multiple copies reside in different levels of a memory hierarchy, these copies may not always be identical.
- On an uniprocessor, the disagreement between the cache and the memory is not a problem, because the cache copy is always accessed first.
- But on a share-memory multiprocessor system, multiple caches are connected to the same memory.
Jingke Li (Portland State University) CS 415/515 Parallel Architecture 19 / 36
Consistency Problems — Example 1
With Write-Through Caches:
- Processor P1 reads x from main memory, bringing a copy to its cache
- Processor P2 reads x from main memory, bringing a copy to its cache
- P2 changes x’s value, the new value will be copied to main memory
- Processor P1 reads x’s value again — it gets the old value from its cache!
Consistency Problems — Example 2
With Write-Back Caches:
- P2 changes x’s value, the new value stays in P2’s cache
- P1 reads x, gets the stale value from its cache
- Any other processor reads x, will also get the stale value from memory
In addition, if multiple processors with distinct values of x to write back, the final value of x in main memory will be determined by the order of the cache lines arrival at the destination, which may not have anything to do with the order of the writes to x.
Jingke Li (Portland State University) CS 415/515 Parallel Architecture 21 / 36
Solutions: Invalid or Update
- (^) Invalidate — whenever a data item is written, all other copies in the memory system are invalidated.
- (^) Update — whenever a data item is written, all other copies are updated.
Directory Schemes
Jingke Li (Portland State University) CS 415/515 Parallel Architecture 25 / 36
Non-Cache Coherent NUMA Systems
These systems are highly scalable. However, the hardware does not provide assistance in maintaining cache coherence.
Sample Systems: Cray T3E
- (^) 3D mesh topology, scale up to 1024 processors
- (^) High-bandwidth links: 480MB/s
- (^) Distributed shared memory, memory controller generates communication requests for non-local references
Network of Workstations (NOWs)
A cluster of workstations (or PCs) interconnected by a high-performance network. System software ensures that all processors in a cluster work together collectively as a single, integrated computing resource.
Have been successfully used to solve large problems and as servers:
- (^) Berkeley NOW Project
- (^) Inktomi (now Yahoo) search engine
Beowulf Clusters: They are “supercomputing” clusters built primarily out of low-cost commodity hardware components, running a free-software operating system like Linux or FreeBSD.
Jingke Li (Portland State University) CS 415/515 Parallel Architecture 27 / 36
Massive Parallel Processing Systems (MPPs)
These are message-Passing MIMD Systems:
- Each node consists of a processor and a local memory
- Nodes are connected by a point-point interconnection network
- (^) flat interconnection with a single topology, or
- (^) hierarchical interconnection
- (^) Nodes share data by explicitly passing messages
Advantages:
- (^) Scalable
- (^) Good support for locality
Main Issues:
- (^) Interconnection network topology
- (^) Message passing protocols
MPPs with a Hierarchical Interconnection
Example: ASCI Blue/White Supercomputers (i.e. IBM SP2)
Each cabinet (system frame) holds sixteen nodes, communicating through a SP Switch at 110MB/second peak, full duplex. To make a 128-processor setup, use eight cabinets.
IBM SP2 Node and Frame:
Jingke Li (Portland State University) CS 415/515 Parallel Architecture 31 / 36
IBM SP2 Communication System
Large-Scale Cluster Systems
Large-scale clusters offer an attractive alternative to MPPs for supercomputing:
- The latest processors can easily be incorporated into the system as they become available.
- They tend to be more scalable.
Jingke Li (Portland State University) CS 415/515 Parallel Architecture 33 / 36
The IBM Roadrunner System
World’s fastest computer (since 07/2008).
- (^) Is considered an Opteron cluster with Cell accelerators.
- (^) Each node consists of a Cell attached to an Opteron core, and the Opterons are connected to each other.
- (^) Total of 6,948 dual-core Opterons and 12,960 Cell chips in 294 racks.
- (^) The final cluster is made up of 18 connected units, which are connected via eight additional (second-stage) Infiniband ISR switches.