Advanced Computer Architecture: Parallel Systems and Parallel Computers, Schemes and Mind Maps of Advanced Computer Architecture

An overview of parallel computers, their role in the future of computing, and the different types of parallelism. It covers the concepts of bit-level, instruction-level, process/thread-level, and job-level parallelism, as well as applications in scientific computing and commercial industries. The document also discusses programming models, communication abstractions, and taxonomy of parallel architecture. It includes examples of communication architectures and performance keys.

Typology: Schemes and Mind Maps

2021/2022

Uploaded on 09/07/2022

adnan_95
adnan_95 🇮🇶

4.3

(39)

918 documents

1 / 38

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CMSC 611: Advanced
Computer Architecture
Parallel Systems
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26

Partial preview of the text

Download Advanced Computer Architecture: Parallel Systems and Parallel Computers and more Schemes and Mind Maps Advanced Computer Architecture in PDF only on Docsity!

CMSC 611: Advanced

Computer Architecture

Parallel Systems

Parallel Computers

Definition: “A parallel computer is a collection of

processing elements that cooperate and communicate to

solve large problems fast.”

  • Almasi and Gottlieb, Highly Parallel Computing ,

Parallel machines are expected to have a bigger role in

the future since:

  • Microprocessors are likely to remain dominant in the uniprocessor

arena and the logical way to extend the performance is by

connecting multiple microprocessors

  • It is not expected that the microprocessor technology will keep the

pace of performance improvement given the increased level of

complexity

  • There has been steady progress in software development for

parallel architectures in recent years

  • Slide is a courtesy of Dave Patterson

Level of Parallelism

Bit-level parallelism

• ALU parallelism: 1-bit, 4-bits, 8-bit, ...

Instruction-level parallelism (ILP)

• Pipelining, Superscalar, VLIW, Out-of-Order

execution

Process/Thread-level parallelism

• Divide job into parallel tasks

Job-level parallelism

• Independent jobs on one computer system

Applications

Scientific Computing

  • Nearly Unlimited Demand (Grand Challenge):
  • Successes in some real industries:
    • Petroleum: reservoir modeling
    • Automotive: crash simulation, drag analysis, engine
    • Aeronautics: airflow analysis, engine, structural mechanics
    • Pharmaceuticals: molecular modeling
      • Slide is a courtesy of Dave Patterson

App Perf (GFLOPS) Memory (GB)

48 hour weather 0.1 0.

72 hour weather 3 1

Pharmaceutical design 100 10

Global Change, Genome 1000 1000

Framework

Extend traditional computer architecture with a

communication architecture

  • abstractions (HW/SW interface)
  • organizational structure to realize abstraction efficiently

Programming Model:

  • Multiprogramming: lots of jobs, no communication
  • Shared address space: communicate via memory
  • Message passing: send and receive messages
  • Data Parallel: several agents operate on several data sets

simultaneously and then exchange information globally and

simultaneously (shared or message passing)

Communication Abstraction:

  • Shared address space: e.g., load, store, atomic swap
  • Message passing: e.g., send, receive library calls
  • Debate over this topic (ease of programming, scaling)

→ many hardware designs 1:1 programming model

Taxonomy of Parallel

Architecture

Flynn Categories

• SISD (Single Instruction Single Data)

• MISD (Multiple Instruction Single Data)

• SIMD (Single Instruction Multiple Data)

• MIMD (Multiple Instruction Multiple Data)

  • Slide is a courtesy of Dave Patterson

MISD

No commercial examples

Apply same operations to a set of data

• Find primes

• Crack passwords

SIMD

Vector/Array computers

Data Parallel Model

Operations performed in parallel on each element of a

large regular data structure, such as an array

  • One Control Processor broadcast to many processing elements

(PE) with condition flag per PE so that can skip

For distributed memory architecture data is distributed

among memories

  • Data parallel model requires fast global synchronization
  • Data parallel programming languages lay out data to processor
  • Vector processors have similar ISAs, but no data placement

restriction

  • Slide is a courtesy of Dave Patterson

SIMD Utilization

Conditional Execution

  • PE Enable
    • if (f<.5) {...}
  • Global PE enable check
    • while (t > 0) {...} Memory Program Data Controller

PE

Data

PE

Data

PE

Data

PE

Data PE Data

PE

Data

PE

Data

PE

Data PE Data

PE

Data

PE

Data

PE

Data PE Data

PE

Data

PE

Data

PE

Data

PE

f= 1

PE

f= 2

PE

f= 1. 5

PE

f= 0 PE f= 3

PE

f=- 1

PE

f= 1

PE

f= 0 PE f= 2. 5

PE

f= 2

PE

f=. 2

PE

f=- 3 PE f= 0

PE

f=- 1

PE

f=- 6

PE

f= 0

Comunication: CM

Hypercube local routing

Wormhole global routing

Memory Program Data Controller

PE

Data

PE

Data

PE

Data

PE

Data PE Data

PE

Data

PE

Data

PE

Data PE Data

PE

Data

PE

Data

PE

Data PE Data

PE

Data

PE

Data

PE

Data

Communication: PixelFlow

Dense connections within block

  • Single swizzle operation collects one word from each PE in block
    • Designed for antialiasing
  • NO inter-block connections
  • NO global routing Memory Program Data Controller

PE

Data

PE

Data

PE

Data

PE

Data PE Data

PE

Data

PE

Data

PE

Data PE Data

PE

Data

PE

Data

PE

Data PE Data

PE

Data

PE

Data

PE

Data

Message passing

Processors have private memories,

communicate via messages

Advantages:

• Less hardware, easier to design

• Focuses attention on costly non-local

operations

Message Passing Model

Each PE has local processor, data, (I/O)

• Explicit I/O to communicate with other PEs

• Essentially NUMA but integrated at I/O vs.

memory system

Free run between Send & Receive

• Send + Receive = Synchronization between

processes (event model)

– Send: local buffer, remote receiving process/port

– Receive: remote sending process/port, local buffer