Vector Processors in Advanced Computer Architecture: Lecture Notes, Slides of Advanced Computer Architecture

Advanced Computer Architecture. Fall 2010. Lecture 7: Vector Processors ... Styles of Vector Architectures ... Vector equivalent of load-store architectures.

Typology: Slides

2021/2022

Uploaded on 09/07/2022

adnan_95
adnan_95 🇮🇶

4.3

(39)

918 documents

1 / 32

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
563 L06.1 Fall 2010
ECE 563
Advanced Computer Architecture
Fall 2010
Lecture 7: Vector Processors
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20

Partial preview of the text

Download Vector Processors in Advanced Computer Architecture: Lecture Notes and more Slides Advanced Computer Architecture in PDF only on Docsity!

Fall 2010

ECE 563

Advanced Computer Architecture

Fall 2010

Lecture 7: Vector Processors

Fall 2010

Supercomputers

Definition of a supercomputer: 

Fastest machine in world at given task

A device to turn a compute-bound problem into an I/Obound problem

Any machine costing $30M+

Any machine designed by Seymour Cray

CDC6600 (Cray, 1964) regarded as first supercomputer

Fall 2010

Vector Processing

Vector processors have high-level operations thatwork on linear arrays of numbers: "vectors"

Fall 2010

Properties of Vector Processing 

Each result independent of previous result

long pipeline, compiler ensures no dependencies

high clock rate

Vector instructions access memory with knownpattern

highly interleaved memory

amortize memory latency of over

64 elements

no (data) caches required! (Do use instruction cache)

Reduces branches and branch problems inpipelines

Single vector instruction implies lots of work (

~

loop)

fewer instruction fetches

Fall 2010

Components of Vector Processor 

Vector Register

: fixed length bank holding a single

vector

has at least 2 read and 1 write ports

typically 8-32 vector registers, each holding 64-128 64-bitelements

Vector Functional Units (FUs)

: fully pipelined, start

new operation every clock

typically 4 to 8 FUs: FP add, FP mult, FP reciprocal (1/X),integer add, logical, shift; may have multiple of same unit

Vector Load-Store Units (LSUs)

: fully pipelined unit

to load or store a vector; may have multiple LSUs

Scalar registers

: single element for FP scalar or

address

Cross-bar to connect FUs , LSUs, registers

Fall 2010

“DLXV” Vector Instructions

Fall 2010

Cray-1 (1976)

Single PortMemory16 banks of64-bit words

8-bit SECDED 80MW/sec dataload/store320MW/secinstructionbuffer refill

4 Instruction Buffers

64-bitx

NIP LIP

CIP

(A

0

)

( (A

h

) + j k m )^64 T Regs

(A

0

)

( (A

h

) + j k m ) 64 B Regs

S0S1S2S3S4S5S6S7 A0A1A2A3A4A5A6A

S

i T

jk A

i

B

jk

FP AddFP MulFP RecipInt AddInt LogicInt ShiftPop Cnt

S

j S

i S

k

Addr AddAddr Mul

A

j

A

i

A

k

memory bank cycle

50 ns

processor cycle

12.5 ns (80MHz)

V0V1V2V3V4V5V6V

V

k

V

j

V

i

V. Mask V. Length

64 Element

Vector Registers

12 FUs

8 scalar Registers 8 address Registers

Fall 2010

Vector Programming Model

[0]

[1]

[VLR-1]

Vector Arithmetic

Instructions

ADDV v3, v1, v

v1v2 v

Scalar Registers

r

r

Vector Registers

v

v

[0]

[1]

[2]

[VLRMAX-1]

VLR

Vector Length Register^ v

Vector Load and

Store Instructions

LV v1, r1, r

Base, r

Stride, r

Memory

Vector Register

Fall 2010

Vector Instruction Set Advantages

Compact

one short instruction encodes N operations

Expressive, tells hardware that these N operations:

are independent

use the same functional unit

access disjoint registers

access registers in same pattern as previous instructions

access a contiguous block of memory

(unit-stride load/store)

access memory in a known pattern(strided load/store)

Scalable

can run same code on more parallel pipelines (

lanes)

Fall 2010

Vector Arithmetic Execution• Use deep pipeline (=> fast

clock) to execute elementoperations

  • Simplifies control of deep

pipeline because elements invector are independent (=> nohazards!)

V^1

V^2

V^3

V3 <- v1 * v

Six stage multiply pipeline

563 L06.

Fall 2010

Vector Memory System^0

1

2

3

4

5

6

7

8

9

A B C D E

F

Base

Stride

Vector Registers

Memory Banks

Address Generator

Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency

Bank busy time

: Cycles between accesses to same bank

563 L06.

Fall 2010

Vector Unit Structure

Lane

Functional Unit

Vector

Registers

Memory Subsystem

Elements0, 4, 8, …

Elements1, 5, 9, …

Elements 2, 6, 10, …

Elements 3, 7, 11, …

563 L06.

Fall 2010

load

Vector Instruction Parallelism

Can overlap execution of multiple vector instructions

example machine has 32 elements per vector register and 8 lanes

load

mul mul

add add

Load Unit

Multiply Unit

Add Unit

time

Instruction

issue

Complete 24 operations/cycle while issuing 1 short instruction/cycle

Fall 2010

Vector Chaining

Vector version of register bypassing (forwarding)

introduced with Cray-

Memory

V 1

Load Unit

Mult.

V 2
V 3

Chain

Add

V 4
V 5

Chain

LV

v

MULV

v3,v1,v

ADDV

v5,

v3,

v