Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Vector Processors in Advanced Computer Architecture: Lecture Notes, Slides of Advanced Computer Architecture

Ahlulbait International University (AIU)Advanced Computer Architecture

Advanced Computer Architecture. Fall 2010. Lecture 7: Vector Processors ... Styles of Vector Architectures ... Vector equivalent of load-store architectures.

Typology: Slides

2021/2022

Uploaded on 09/07/2022

adnan_95 🇮🇶

4.3

(39)

918 documents

1 / 32

This page cannot be seen from the preview

Don't miss anything!

563 L06.1 Fall 2010

ECE 563

Advanced Computer Architecture

Fall 2010

Lecture 7: Vector Processors

Discover Slides of Advanced Computer Architecture Ahlulbait International University (AIU)

Partial preview of the text

Download Vector Processors in Advanced Computer Architecture: Lecture Notes and more Slides Advanced Computer Architecture in PDF only on Docsity!

Fall 2010

ECE 563

Advanced Computer Architecture

Fall 2010

Lecture 7: Vector Processors

Fall 2010

Supercomputers

Definition of a supercomputer: 

Fastest machine in world at given task



A device to turn a compute-bound problem into an I/Obound problem



Any machine costing $30M+



Any machine designed by Seymour Cray

CDC6600 (Cray, 1964) regarded as first supercomputer

Fall 2010

Vector Processing

Vector processors have high-level operations thatwork on linear arrays of numbers: "vectors"

Fall 2010

Properties of Vector Processing 

Each result independent of previous result



long pipeline, compiler ensures no dependencies



high clock rate

Vector instructions access memory with knownpattern



highly interleaved memory



amortize memory latency of over

64 elements



no (data) caches required! (Do use instruction cache)

Reduces branches and branch problems inpipelines

Single vector instruction implies lots of work (

loop)



fewer instruction fetches

Fall 2010

Components of Vector Processor 

Vector Register

: fixed length bank holding a single

vector



has at least 2 read and 1 write ports



typically 8-32 vector registers, each holding 64-128 64-bitelements

Vector Functional Units (FUs)

: fully pipelined, start

new operation every clock



typically 4 to 8 FUs: FP add, FP mult, FP reciprocal (1/X),integer add, logical, shift; may have multiple of same unit

Vector Load-Store Units (LSUs)

: fully pipelined unit

to load or store a vector; may have multiple LSUs

Scalar registers

: single element for FP scalar or

address

Cross-bar to connect FUs , LSUs, registers

Fall 2010

“DLXV” Vector Instructions

Fall 2010

Cray-1 (1976)

Single PortMemory16 banks of64-bit words

8-bit SECDED 80MW/sec dataload/store320MW/secinstructionbuffer refill

4 Instruction Buffers

64-bitx

NIP LIP

CIP

)

( (A

) + j k m )^64 T Regs

)

( (A

) + j k m ) 64 B Regs

S0S1S2S3S4S5S6S7 A0A1A2A3A4A5A6A

i T

jk A

FP AddFP MulFP RecipInt AddInt LogicInt ShiftPop Cnt

j S

i S

Addr AddAddr Mul

memory bank cycle

50 ns

processor cycle

12.5 ns (80MHz)

V0V1V2V3V4V5V6V

V. Mask V. Length

64 Element

Vector Registers

12 FUs

8 scalar Registers 8 address Registers

Fall 2010

Vector Programming Model

[0]

[1]

[VLR-1]

Vector Arithmetic

Instructions

ADDV v3, v1, v

v1v2 v

Scalar Registers

r

Vector Registers

v

[0]

[1]

[2]

[VLRMAX-1]

VLR

Vector Length Register^ v

Vector Load and

Store Instructions

LV v1, r1, r

Base, r

Stride, r

Memory

Vector Register

Fall 2010

Vector Instruction Set Advantages



Compact



one short instruction encodes N operations



Expressive, tells hardware that these N operations:



are independent



use the same functional unit



access disjoint registers



access registers in same pattern as previous instructions



access a contiguous block of memory

(unit-stride load/store)



access memory in a known pattern(strided load/store)



Scalable



can run same code on more parallel pipelines (

lanes)

Fall 2010

Vector Arithmetic Execution• Use deep pipeline (=> fast

clock) to execute elementoperations

Simplifies control of deep

pipeline because elements invector are independent (=> nohazards!)

V^1

V^2

V^3

V3 <- v1 * v

Six stage multiply pipeline

563 L06.

Fall 2010

Vector Memory System^0

A B C D E

Base

Stride

Vector Registers

Memory Banks

Address Generator

Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency

Bank busy time

: Cycles between accesses to same bank

563 L06.

Fall 2010

Vector Unit Structure

Lane

Functional Unit

Vector

Registers

Memory Subsystem

Elements0, 4, 8, …

Elements1, 5, 9, …

Elements 2, 6, 10, …

Elements 3, 7, 11, …

563 L06.

Fall 2010

load

Vector Instruction Parallelism

Can overlap execution of multiple vector instructions



example machine has 32 elements per vector register and 8 lanes

load

mul mul

add add

Load Unit

Multiply Unit

Add Unit

time

Instruction

issue

Complete 24 operations/cycle while issuing 1 short instruction/cycle

Fall 2010

Vector Chaining

Vector version of register bypassing (forwarding)



introduced with Cray-

Memory

V 1

Load Unit

Mult.

V 2

V 3

Chain

Add

V 4

V 5

Chain

Vector Processors in Advanced Computer Architecture: Lecture Notes, Slides of Advanced Computer Architecture

Related documents

Partial preview of the text

Download Vector Processors in Advanced Computer Architecture: Lecture Notes and more Slides Advanced Computer Architecture in PDF only on Docsity!

ECE 563

Advanced Computer Architecture

Fall 2010

Lecture 7: Vector Processors

Supercomputers

Definition of a supercomputer: 

Fastest machine in world at given task

A device to turn a compute-bound problem into an I/Obound problem

Any machine costing $30M+

Any machine designed by Seymour Cray

CDC6600 (Cray, 1964) regarded as first supercomputer

long pipeline, compiler ensures no dependencies

high clock rate

highly interleaved memory

amortize memory latency of over

64 elements

no (data) caches required! (Do use instruction cache)

fewer instruction fetches

has at least 2 read and 1 write ports

typically 8-32 vector registers, each holding 64-128 64-bitelements

typically 4 to 8 FUs: FP add, FP mult, FP reciprocal (1/X),integer add, logical, shift; may have multiple of same unit

memory bank cycle

50 ns

processor cycle

12.5 ns (80MHz)

Vector Programming Model

[0]

[1]

[VLR-1]

Vector Arithmetic

Instructions

ADDV v3, v1, v

v1v2 v

r

r

v

v

[0]

[1]

[2]

[VLRMAX-1]

VLR

Vector Length Register^ v

Vector Load and

Store Instructions

LV v1, r1, r

Base, r

Stride, r

Compact

one short instruction encodes N operations

Expressive, tells hardware that these N operations:

are independent

use the same functional unit

access disjoint registers

access registers in same pattern as previous instructions

access a contiguous block of memory

(unit-stride load/store)

access memory in a known pattern(strided load/store)

Scalable

can run same code on more parallel pipelines (

lanes)

Vector Arithmetic Execution• Use deep pipeline (=> fast

V^1

V^2

V^3

V3 <- v1 * v

Vector Memory System^0

Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency

Vector Unit Structure

Can overlap execution of multiple vector instructions

introduced with Cray-

V 1

V 2

V 3

V 4

V 5