Massively Parallel Processor  -  Computer Systems Architecture - Lecture Slides, Slides for Computer Architecture and Organization
jutt
jutt

Massively Parallel Processor - Computer Systems Architecture - Lecture Slides, Slides for Computer Architecture and Organization

25 pages
1000+Number of visits
Description
Some concept of Computer Systems Architecture are Acyclic Graph, Advanced Micro Devices, Basic Grid Architecture, Control Flow Prediction, Desktop Processor Architecture, Message-Driven Processor. Main points of this lec...
20 points
Download points needed to download
this document
Download the document
Preview3 pages / 25
This is only a preview
3 shown on 25 pages
Download the document
This is only a preview
3 shown on 25 pages
Download the document
This is only a preview
3 shown on 25 pages
Download the document
This is only a preview
3 shown on 25 pages
Download the document

“The Architecture of Massively Parallel Processor CP-PACS”

Docsity.com

Outline • Introduction • Specification of CP-PACS • Pseudo Vector Processor PVP-SW • Interconnection Network of CP-PACS

– Hyper-crossbar Network – Remote DMA message transfer – Message broadcasting – Barrier synchronization

• Performance Evaluation • Conclusion, References, Questions & Comments

Docsity.com

Introduction

• CP-PACS: Computational Physics by Parallel Array Computer Systems

• To construct a dedicated MMP for computational physics, study Quantum-Chromo Dynamics

• Center for Computational Physics, University of Tsukaba, Japan

Docsity.com

Specification of CP-PACS

• MIMD parallel processing system with distributed memory.

• Each Processing Unit (PU) has a RISC processor and a local memory.

• 2048 of such PU’s, connected by an interconnection network.

• 128 IO units, that support a distributed disk space.

Docsity.com

Specification of CP-PACS

Docsity.com

Specification of CP-PACS

• Theoretical performance – To be able to solve problems like QCD, Astro-fluid

dynamics, etc. a grat number of PUs are required. – For budget, reliability reasons, number of PUs is

limited at 2048.

Docsity.com

Specification of CP-PACS

• Node processor – Improve function of node processors first. – Caches do not work efficiently on ordinary RISC

processors. – New technique for cache function is introduced:

PVP-SW

Docsity.com

Specification of CP-PACS

• Interconnection Network – 3-dimensional Hyper-Crossbar (3-D HXB) – Peak throughput of a single link: 300 MB/sec – Provides

• Hardware message broadcasting • Block-stride message transfer • Barrier synchronization

Docsity.com

Specification of CP-PACS

• I/O system – 128 I/O units, equipped with RAID-5 hard disk

system. – 528 GB total system disk space. – RAID-5 system increases fault tolerance.

Docsity.com

Pseudo Vector Processor PVP-SW

• MPPs require high performance node processors.

• A node processor cannot achieve high performance unless cache system works efficiently. – Little temporal locality exists – Data space of application is much larger than

cache size.

Docsity.com

Pseudo Vector Processor PVP-SW

• Vector processors – Main memory is pipelined. – Vector length of load/store is long. – Load/store is executed in parallel with arithmetic

execution. • We require these in our node processor

– PVP-SW is introduced. – It is pseudo-vector.

Docsity.com

Pseudo Vector Processor PVP-SW

• Cannot increase number of registers, register field in instructions is limited.

• So, a new technique, Slide-Windowed Registers is introduced.

Docsity.com

Pseudo Vector Processor PVP-SW

• Slide-Windowed Registers – Physical registers consist of logical windows, a

window consists of 32 registers. – Total number of registers is 128. – Global registers & Window registers

• Global registers are static and shared by all windows • Local registers are not shared.

– One window active at a certain time.

Docsity.com

Pseudo Vector Processor PVP-SW

• Slide-Windowed Registers – Active window is identified by a pointer, FW-STP. – New instructions are introduced, to deal with FW-

STP: • FWSTPSet: Sets new location for FW-STP. • FRPreload: Load data from memory into a window. • FRPoststore: Store data into memory from a window.

Docsity.com

Pseudo Vector Processor PVP-SW

Docsity.com

Interconnection Network of CP-PACS

• Topology is a Hyper-Crossbar Network (HXB) • 8 x 17 x 16, 2048 PUs, 128 I/O units. • On a dimension of hypercube, the PUs are

interconnected by a crossbar. – For example: On Y dimension, a Y x Y size crossbar is used.

• Routing is simple, route on 3 dimensions consecutively. – Wormhole routing is employed.

Docsity.com

Interconnection Network of CP-PACS

• Wormhole routing & HXB together has these properties: – Small network diameter – Same sized torus can be simulated. – Message broadcasting by hardware. – Binary hypercube can be emulated. – Througput in even random transfer is high.

Docsity.com

Interconnection Network of CP-PACS

• Remote DMA transfer – Making a system call to OS and copying data to OS

area is messy. – Instead, access remote node’s memory directly.

• Remote DMA is good, because: – Mode switching (kernel/user mode) is tedious. – Redundant data copying (user  kernel space) is

not done.

Docsity.com

Interconnection Network of CP-PACS

• Message Broadcasting – Supported by hardware.

• First, perform on one dimension • Then perform on other dimensions

– Hardware mechanisms to prevent deadlock caused by two nodes broadcasting at the same time are present.

– Hardware partitioning is possible. • Send broadcast message to nodes in the sender’s partition only.

Docsity.com

Interconnection Network of CP-PACS

• Barrier Synchronization – A synchronization mechanism is required in IPC

systems. – CP-PACS supports a hardware barrier

synchronization facility. • Makes use of special syncronization packets, other than

usual data packets. – CP-PACS also supports partitioned pieces of

network to use barrier synchronization.

Docsity.com

Performance Evaluation

• Based on LINPACK benchmark. – LU decomposition of a matrix. – Outer product method is used, based on 2-dimensional

block-cyclic distribution.

• All floating point and data loading/storing operations are done in PVP-SW manner.

Docsity.com

Performance Evaluation

Rmax

0 50

100 150 200 250 300 350 400

1 2 3 4 5 6 7 8 9 10 11 12

# of PUs ( 2^ )

R m

ax (G

flo ps

/s ec

)

Docsity.com

Performance Evaluation

Nmax

0

20000

40000

60000

80000

100000

120000

1 2 3 4 5 6 7 8 9 10 11 12

# of PUs ( 2^ )

M at

rix s

iz e

Docsity.com

Performance Evaluation

Rmax/peak

56

58

60

62

64

66

68

1 2 3 4 5 6 7 8 9 10 11 12

# of PUs ( 2^ )

ef fe

ct iv

en es

s

Docsity.com

Conclusion

• CP-PACS is operational in University of Tsukuba.

• Working on large scale QCD calculations. • Sponsored by Hitachi Ltd. & Grant-in-aid of

Ministry of Education, Science of Culture, in Japan.

Docsity.com

no comments were posted
This is only a preview
3 shown on 25 pages
Download the document