Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Vector Processing: Properties, Advantages, and Challenges, Slides of Electronics engineering

Bharat Ratna Dr. B. R. Ambedkar University Electronics engineering

Vector processing, its properties, advantages, and challenges. Vector processors have high-level operations that work on linear arrays of numbers, leading to significant reductions in operations and instructions. However, they also have start-up penalties and vector length limitations. Examples of vector machines and vector instructions, as well as optimizations like conditional execution and sparse matrices.

Typology: Slides

2012/2013

Uploaded on 03/23/2013

dhrupad 🇮🇳

4.4

(17)

213 documents

1 / 53

This page cannot be seen from the preview

Don't miss anything!

Lecture 7:

Vector Processing

Docsity.com

Discover Slides of Electronics engineering Bharat Ratna Dr. B. R. Ambedkar University

Partial preview of the text

Download Vector Processing: Properties, Advantages, and Challenges and more Slides Electronics engineering in PDF only on Docsity!

Lecture 7:

Vector Processing

Computers in the News

At ISSCC (San Francisco)
- 1 GHz Alpha Processor (Compaq)
  - 1.5 V 0.18 micron CMOS, 7-layer Al, 65 W
- 1 GHz Single Issue 64b PowerPC Processor (IBM)
  - 0.22 micron CMOS, 6-layer Copper interconnect
- 1 GHz IA-32 Microprocessor
  - 0.18 micron CMOS, 6-layer Al, low-k dielectric
- Other IBM processors
  - 760 MHz processor using multiple Vt and Copper interconnects
  - 660 MHz SOI processor with Cu interconnect
- Memory trends: non-volatile; embedded DRAM

Computer News

Thermal gradients Traditional mobile processor versus Crusoe running DVD application

Review: Instructon Level Parallelism

High speed execution based on instruction level parallelism (ilp): potential of short instruction sequences to execute in parallel
High-speed microprocessors exploit ILP by:
1. pipelined execution: overlap instructions
2. superscalar execution: issue and execute multiple instructions per clock cycle
3. Out-of-order execution (commit in-order)
Memory accesses for high-speed microprocessor? - Data Cache, possibly multiported, multiple levels

Program

gcc expresso li fpppp doducd tomcatv

(^1512)

16 10 13 11

9 10 11

22 12 8 8 9

14 9

14 (^6 4 3 6 4 2 6 4 3 8 5 3 74 3 )

Infinite 256 128 64 32 16 8 4

Review: Theoretical Limits to ILP?

(Figure 4.48, Page 332) Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window

Infinite 256 128 64 32 16 8 4

Integer: 6 - 12

FP: 8 - 45

IPC

Window

Problems with conventional approach

Limits to conventional exploitation of ILP:
1. pipelined clock rate : at some point, each increase in clock rate has corresponding CPI increase (branches, other hazards)
2. instruction fetch and decode : at some point, its hard to fetch and decode more instructions per clock cycle
3. cache hit rate : some long-running (scientific) programs have very large data sets accessed with poor locality; others have continuous data streams (multimedia) and hence poor locality

Properties of Vector Processors

Each result independent of previous result => long pipeline, compiler ensures no dependencies => high clock rate
Vector instructions access memory with known pattern => highly interleaved memory => amortize memory latency of over 64 elements => no (data) caches required! (Do use instructionDocsity.com

Operation & Instruction Count: RISC v. Vector Processor

Spec92fp Operations (Millions)(from F. Quintana, U. Barcelona.) Instructions (M)

Program RISC Vector R / V RISC Vector R / V

swim256 115 95 1.1x 115 0.8 142x

hydro2d 58 40 1.4x 58 0.8 71x

nasa7 69 41 1.7x 69 2.2 31x

su2cor 51 35 1.4x 51 1.8 29x

tomcatv 15 10 1.4x 15 1.3 11x

wave5 27 25 1.1x 27 7.2 4x

mdljdp2 32 52 0.6x 32 15.8 2x Vector reduces ops by 1.2X, instructions by 20X

Components of Vector Processor

Vector Register : fixed length bank holding a single vector - has at least 2 read and 1 write ports - typically 8-32 vector registers, each holding 64- 64-bit elements
Vector Functional Units (FUs) : fully pipelined, start new operation every clock - typically 4 to 8 FUs: FP add, FP mult, FP reciprocal (1/X), integer add, logical, shift; may have multiple of same unit

“DLXV” Vector

Instr. Operands OperationInstructions Comment

ADDV V1,V2,V3 V1=V2+V vector + vector
ADDSV V1,F0,V2 V1=F0+V

scalar + vector

MULTV V1,V2,V3 V1=V2xV vector x vector
MULSV V1,F0,V2 V1=F0xV

scalar x vector

LV V1,R1 V1=M[R1..R1+63] load, stride=1Docsity.com

DAXPY (Y = a * X + Y)

LD F0,a ADDI R4,Rx,#512 ;last address to load loop: LD F2, 0(Rx) ;load X(i) MULTD F2,F0, F2 ;aX(i) LD F4, 0(Ry) ;load Y(i) ADDD F4,F2, F4 ;aX(i) + Y(i) SD F4 ,0(Ry) ;store into Y(i) ADDI Rx,Rx,#8 ;increment index to X ADDI Ry,Ry,#8 ;increment index to Y SUB R20,R4,Rx ;compute bound BNZ R20,loop ;check if done

LD F0,a ;load scalar a LV V1,Rx ;load vector X MULTS V2,F0,V1 ;vector-scalar mult. LV V3,Ry ;load vector Y ADDV V4,V2,V3 ;add SV Ry,V4 ;store the result

Assuming vectors X, Y are length 64 Scalar vs. Vector

578 (2+964) vs. 321 (1+564) ops (1.8X) 578 (2+964) vs. 6 instructions (96X) 64 operation vectors + no loop overhead also 64X fewer pipeline hazards*

Example Vector Machines

Machine Year Clock Regs Elements FUs LSUs
Cray 1 197680 MHz 8 64 6 1
Cray XMP 1983120 MHz 8 64 8 2 L, 1 S
Cray YMP 1988166 MHz 8 64 8 2 L, 1 S
Cray C-90 1991240 MHz 8 128 8 4
Cray T-90 1996455 MHz 8 128 8 4
Conv. C-1 198410 MHz 8 128 4 1
Conv. C-4 1994133 MHz 16 128 3 1
Fuj. VP200 1982133 MHz8-25632-1024Docsity.com

Vector Surprise

Use vectors for inner loop parallelism (no surprise
- One dimension of array: A[0, 0], A[0, 1], A[0, 2], ...
- think of machine as, say, 32 vector regs each with 64 elements
- 1 instruction updates 64 elements of 1 vector register
and for outer loop parallelism!
- 1 element from each column: A[0,0], A[1,0], A[2,0], ...
- think of machine as 64 “virtual processors” (VPs) each with 32 scalar registers! ( multithreaded processo
- 1 instruction updates 1 scalar register in 64 VPs

Virtual Processor Vector Model

Vector operations are SIMD (single instruction multiple data) operations
Each element is computed by a virtual processor (VP)
Number of VPs given by vector length
- vector control register

Vector Processing: Properties, Advantages, and Challenges, Slides of Electronics engineering

Related documents

Partial preview of the text

Download Vector Processing: Properties, Advantages, and Challenges and more Slides Electronics engineering in PDF only on Docsity!

Lecture 7:

Vector Processing

Review: Instructon Level Parallelism

Review: Theoretical Limits to ILP?

IPC

Problems with conventional approach

DAXPY (Y = a * X + Y)