Parallel Architecture: Understanding Different Types of Parallel Computing Systems - Prof., Assignments of Computer Science

This document, from portland state university, provides an overview of parallel computing systems, including their taxonomy, current architectures, and solutions to consistency problems. Topics covered include single instruction multiple data (simd) systems, multiple instruction multiple data (mimd) systems, cache-coherence, and non-cache coherent numa systems. The document also discusses various types of parallel architectures such as processor-level parallelism, system-level architectures, and massive parallel processing systems.

Typology: Assignments

Pre 2010

Uploaded on 08/16/2009

koofers-user-18q
koofers-user-18q 🇺🇸

10 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Parallel Architecture
Jingke Li
Portland State University
Jingke Li (Portland State University) CS 415/515 Parallel Architecture 1 / 36
Parallel Computers
A “conventional” computer consists of
a single CPU (typically with a set of pipelined functional units)
a single memory hierarchy (i.e. caches and main memory)
and operations are performed one after another in a sequential order.
A parallel computer, in contrast, has multiple components. There are
many ways to organize a parallel computer:
Single core with multiple functional units
Multiple cores with shared cache hierarchy
Multiple processors with shared memory module
Multiple processors with individual memory modules
Clusters of subsystems
Jingke Li (Portland State University) CS 415/515 Parallel Architecture 2 / 36
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download Parallel Architecture: Understanding Different Types of Parallel Computing Systems - Prof. and more Assignments Computer Science in PDF only on Docsity!

Parallel Architecture

Jingke Li

Portland State University

Jingke Li (Portland State University) CS 415/515 Parallel Architecture 1 / 36

Parallel Computers

A “conventional” computer consists of

  • a single CPU (typically with a set of pipelined functional units)
  • a single memory hierarchy (i.e. caches and main memory)

and operations are performed one after another in a sequential order.

A parallel computer, in contrast, has multiple components. There are many ways to organize a parallel computer:

  • Single core with multiple functional units
  • (^) Multiple cores with shared cache hierarchy
  • (^) Multiple processors with shared memory module
  • (^) Multiple processors with individual memory modules
  • (^) Clusters of subsystems

Flynn’s Taxonomy

S: Single, M: Multiple, I: Instruction stream, D: Data stream

  • SISD — Conventional uniprocessors
  • (^) MISD — Not very realistic
  • (^) SIMD — Same operation is simultaneously applied to multiple data items - Can take different forms, e.g. single CPU computers, vector computers, and large-scale super-computers - Recent SIMD computers are mostly small-scale
  • MIMD — Multiple threads of instructions operating on multiple threads of data. In the most general case, threads could be independent programs! - (^) A very broad category; can be refined into many sub-categories - (^) Most current large-scale parallel systems fall into this category

Jingke Li (Portland State University) CS 415/515 Parallel Architecture 3 / 36

Parallel Architectures Today

  • (^) Processor-Level Parallelism
    • Single CPU with special parallel instructions
    • Multi-core processors
    • GPUs
    • Vector processors
  • System-Level Architectures
    • SIMD systems
    • Symmetric multiprocessors (SMPs)
    • Non-uniform memory access machines (NUMAs)
    • Network of workstations (NOWs)
  • Supercomputer Architectures
    • Massive Parallel Processing Systems (MPPs) (thousands of processors)
    • Large-scale clusters (tens of thousands of processors)
    • Constellations (clusters of powerful vector processors)

Multi-Core Processors

Two or more processors on the same chip — chip-level multiprocessing. The processors have individual L1 caches, but share a common L2 cache. Advantages: (In comparison to multi-chip SMP designs)

  • (^) Faster cache snoop operations — the signals do not have to travel off-chip
  • (^) Smaller physical package — smaller circuitry space is needed
  • (^) Less power consumption — since signals are on-chip, the cores can operate at lower voltages

Disadvantages:

  • (^) Require OS and application software support
  • (^) More difficult to manage thermally than single-CPU chips (due to the higher integration)
  • CPU power may be underutilized for some applications, since scaling efficiency is largely dependent on the application or problem set.

Jingke Li (Portland State University) CS 415/515 Parallel Architecture 7 / 36

Graphic Processing Units (GPUs)

GPUs are dedicated graphics rendering device for PCs or game consoles.

  • (^) The typical modern GPU sits on a separate graphics card from the motherboard, connected to the CPU and main RAM through the AGP or PCI Express bus.
  • (^) A GPU implements a number of graphics primitive operations with highly-parallel hardware components, making them running much faster than running directly on a typical CPU.

The limitation: GPUs typically have very limited programmability, and hence have a very narrow application domain.

General-Purpose GPUs (GPGPUs)

A new trend in this field:

  • Add more flexibility to GPU’s programming model
  • Extend to non-graphical, but still matrix-based applications

GPGPUs are best-described as co-processors. To handle general applications, CPU hosts are still needed.

Two GPU programming languages have emerged:

  • CUDA (Compute Unified Device Architecture) — developed by nVidia.
  • (^) OpenCL (Open Computing Language) — initially conceived by Apple, now managed by the non-profit consortium Khronos Group.

Challenges:

  • (^) Data movements between the host CPU and the GPU are slow.
  • (^) Integer operations are weak.
  • (^) Programming is still very hard.

Jingke Li (Portland State University) CS 415/515 Parallel Architecture 9 / 36

Vector Processors

Vector processors are machines built primarily to handle large scientific and engineering calculations. Their performance derives from a heavily pipelined architecture which can execute special vector instructions very efficiently. They are the “traditional” supercomputers.

Key Components:

  • A set of pipelined functional units.
  • Special vector registers — Data is read into the vector registers which are FIFO queues capable of holding 50-100 floating point values. cr
  • Special vector instructions — such as loading/storing a vector register from/to a location in memory, performing operations on elements in the vector registers.

Sample Machines: Early CRAY Series, CDC Cyber 205, IBM 3090 family, FPS-164.

MIMD Systems

MIMD systems consist of a collection of processors:

  • (^) Each processor is capable of running a distinct thread of computation.
  • The processors coordinate on a joint program via a shared address space or through message passing.

Jingke Li (Portland State University) CS 415/515 Parallel Architecture 13 / 36

Shared-Memory MIMD Systems

The processors share a single address space. This address space can be realized either through a single physical memory accessible to all processors, or through a set of distributed memory modules attached to the processors. Respectively, the two sub-categories of systems are called symmetric multiprocessors (SMPs) and non-uniform memory access machines (NUMAs).

Advantages:

  • No need to partition or duplicate data
  • Less communication overhead
  • Programming style close to sequential programming style

Main Issues:

  • Scalability of the interconnection network
  • (^) Memory-cache consistency

Symmetric Multiprocessors (SMPs)

Commodity microprocessors connected to a single shared memory through a high-speed interconnect (typically bus or crossbar).

  • Symmetric — each processor has exactly the same abilities, any processor can do anything
  • Single physical address space — other than processors, there is one copy of everything else (memory, I/O system, OS, etc)
  • Hardware-supported cache coherence — typically via snoopy protocols Typically small scale

OS on a SMP can hide the fact of the multiprocessor from applications, hence programming such a machine could be of no difference from programming a sequential machine. For this reason, SMPs are heavily favored to run commercial applications.

All major vendors of computer systems are producing and selling these types of machines: Sun, SGI, HP/Compaq, IMP, Intel, ...

Jingke Li (Portland State University) CS 415/515 Parallel Architecture 15 / 36

Sample Machine: Sun Enterprise 6000

  • 30 UltraSparc (each processing board has two processors)
  • (^) Gigaplane bus (peak 2.67 GB/s)
    • Non-multiplexed, 256-bit data lines, 41-bit address lines, and 18-bit arbitration lines, clock rate 83.5 MHz
    • The data lines are really a 16x16 crossbar
    • Split-transaction, supporting 112 outstanding transactions
  • Up to 30 GB 16-way main memory
  • The I/O bandwidth scales with the number of I/O boards

Memory-Cache Consistency

  • (^) In modern computer architectures, memory hierarchy (main memory plus multi-level of caches) is used to overcome the memory access latency problem.
  • (^) A single data item may have multiple copies reside in different levels of a memory hierarchy, these copies may not always be identical.
  • On an uniprocessor, the disagreement between the cache and the memory is not a problem, because the cache copy is always accessed first.
  • But on a share-memory multiprocessor system, multiple caches are connected to the same memory.

Jingke Li (Portland State University) CS 415/515 Parallel Architecture 19 / 36

Consistency Problems — Example 1

With Write-Through Caches:

  1. Processor P1 reads x from main memory, bringing a copy to its cache
  2. Processor P2 reads x from main memory, bringing a copy to its cache
  3. P2 changes x’s value, the new value will be copied to main memory
  4. Processor P1 reads x’s value again — it gets the old value from its cache!

Consistency Problems — Example 2

With Write-Back Caches:

  1. P2 changes x’s value, the new value stays in P2’s cache
  2. P1 reads x, gets the stale value from its cache
  3. Any other processor reads x, will also get the stale value from memory

In addition, if multiple processors with distinct values of x to write back, the final value of x in main memory will be determined by the order of the cache lines arrival at the destination, which may not have anything to do with the order of the writes to x.

Jingke Li (Portland State University) CS 415/515 Parallel Architecture 21 / 36

Solutions: Invalid or Update

  • (^) Invalidate — whenever a data item is written, all other copies in the memory system are invalidated.
  • (^) Update — whenever a data item is written, all other copies are updated.

Directory Schemes

Jingke Li (Portland State University) CS 415/515 Parallel Architecture 25 / 36

Non-Cache Coherent NUMA Systems

These systems are highly scalable. However, the hardware does not provide assistance in maintaining cache coherence.

Sample Systems: Cray T3E

  • (^) 3D mesh topology, scale up to 1024 processors
  • (^) High-bandwidth links: 480MB/s
  • (^) Distributed shared memory, memory controller generates communication requests for non-local references

Network of Workstations (NOWs)

A cluster of workstations (or PCs) interconnected by a high-performance network. System software ensures that all processors in a cluster work together collectively as a single, integrated computing resource.

Have been successfully used to solve large problems and as servers:

  • (^) Berkeley NOW Project
  • (^) Inktomi (now Yahoo) search engine

Beowulf Clusters: They are “supercomputing” clusters built primarily out of low-cost commodity hardware components, running a free-software operating system like Linux or FreeBSD.

Jingke Li (Portland State University) CS 415/515 Parallel Architecture 27 / 36

Massive Parallel Processing Systems (MPPs)

These are message-Passing MIMD Systems:

  • Each node consists of a processor and a local memory
  • Nodes are connected by a point-point interconnection network
    • (^) flat interconnection with a single topology, or
    • (^) hierarchical interconnection
  • (^) Nodes share data by explicitly passing messages

Advantages:

  • (^) Scalable
  • (^) Good support for locality

Main Issues:

  • (^) Interconnection network topology
  • (^) Message passing protocols

MPPs with a Hierarchical Interconnection

Example: ASCI Blue/White Supercomputers (i.e. IBM SP2)

Each cabinet (system frame) holds sixteen nodes, communicating through a SP Switch at 110MB/second peak, full duplex. To make a 128-processor setup, use eight cabinets.

IBM SP2 Node and Frame:

Jingke Li (Portland State University) CS 415/515 Parallel Architecture 31 / 36

IBM SP2 Communication System

Large-Scale Cluster Systems

Large-scale clusters offer an attractive alternative to MPPs for supercomputing:

  • The latest processors can easily be incorporated into the system as they become available.
  • They tend to be more scalable.

Jingke Li (Portland State University) CS 415/515 Parallel Architecture 33 / 36

The IBM Roadrunner System

World’s fastest computer (since 07/2008).

  • (^) Is considered an Opteron cluster with Cell accelerators.
  • (^) Each node consists of a Cell attached to an Opteron core, and the Opterons are connected to each other.
  • (^) Total of 6,948 dual-core Opterons and 12,960 Cell chips in 294 racks.
  • (^) The final cluster is made up of 18 connected units, which are connected via eight additional (second-stage) Infiniband ISR switches.