Download Vector Processing: Properties, Advantages, and Challenges and more Slides Electronics engineering in PDF only on Docsity!
Lecture 7:
Vector Processing
Computers in the News
- At ISSCC (San Francisco)
- 1 GHz Alpha Processor (Compaq)
- 1.5 V 0.18 micron CMOS, 7-layer Al, 65 W
- 1 GHz Single Issue 64b PowerPC Processor (IBM)
- 0.22 micron CMOS, 6-layer Copper interconnect
- 1 GHz IA-32 Microprocessor
- 0.18 micron CMOS, 6-layer Al, low-k dielectric
- Other IBM processors
- 760 MHz processor using multiple Vt and Copper interconnects
- 660 MHz SOI processor with Cu interconnect
- Memory trends: non-volatile; embedded DRAM
Computer News
Thermal gradients Traditional mobile processor versus Crusoe running DVD application
Review: Instructon Level Parallelism
- High speed execution based on instruction level parallelism (ilp): potential of short instruction sequences to execute in parallel
- High-speed microprocessors exploit ILP by:
- pipelined execution: overlap instructions
- superscalar execution: issue and execute multiple instructions per clock cycle
- Out-of-order execution (commit in-order)
- Memory accesses for high-speed microprocessor? - Data Cache, possibly multiported, multiple levels
Program
0
10
20
30
40
50
60
gcc expresso li fpppp doducd tomcatv
10
(^1512)
52
17
56
10
(^1512)
47
16 10 13 11
35
15
34
9 10 11
22 12 8 8 9
14 9
14 (^6 4 3 6 4 2 6 4 3 8 5 3 74 3 )
45
22
Infinite 256 128 64 32 16 8 4
Review: Theoretical Limits to ILP?
(Figure 4.48, Page 332) Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window
Infinite 256 128 64 32 16 8 4
Integer: 6 - 12
FP: 8 - 45
IPC
Window
Problems with conventional approach
- Limits to conventional exploitation of ILP:
- pipelined clock rate : at some point, each increase in clock rate has corresponding CPI increase (branches, other hazards)
- instruction fetch and decode : at some point, its hard to fetch and decode more instructions per clock cycle
- cache hit rate : some long-running (scientific) programs have very large data sets accessed with poor locality; others have continuous data streams (multimedia) and hence poor locality
Properties of Vector Processors
- Each result independent of previous result => long pipeline, compiler ensures no dependencies => high clock rate
- Vector instructions access memory with known pattern => highly interleaved memory => amortize memory latency of over 64 elements => no (data) caches required! (Do use instructionDocsity.com
Operation & Instruction Count: RISC v. Vector Processor
Spec92fp Operations (Millions)(from F. Quintana, U. Barcelona.) Instructions (M)
Program RISC Vector R / V RISC Vector R / V
swim256 115 95 1.1x 115 0.8 142x
hydro2d 58 40 1.4x 58 0.8 71x
nasa7 69 41 1.7x 69 2.2 31x
su2cor 51 35 1.4x 51 1.8 29x
tomcatv 15 10 1.4x 15 1.3 11x
wave5 27 25 1.1x 27 7.2 4x
mdljdp2 32 52 0.6x 32 15.8 2x Vector reduces ops by 1.2X, instructions by 20X
Components of Vector Processor
- Vector Register : fixed length bank holding a single vector - has at least 2 read and 1 write ports - typically 8-32 vector registers, each holding 64- 64-bit elements
- Vector Functional Units (FUs) : fully pipelined, start new operation every clock - typically 4 to 8 FUs: FP add, FP mult, FP reciprocal (1/X), integer add, logical, shift; may have multiple of same unit
“DLXV” Vector
Instr. Operands OperationInstructions Comment
- ADDV V1,V2,V3 V1=V2+V vector + vector
- ADDSV V1,F0,V2 V1=F0+V
scalar + vector
- MULTV V1,V2,V3 V1=V2xV vector x vector
- MULSV V1,F0,V2 V1=F0xV
scalar x vector
- LV V1,R1 V1=M[R1..R1+63] load, stride=1Docsity.com
DAXPY (Y = a * X + Y)
LD F0,a ADDI R4,Rx,#512 ;last address to load loop: LD F2, 0(Rx) ;load X(i) MULTD F2,F0, F2 ;aX(i) LD F4, 0(Ry) ;load Y(i) ADDD F4,F2, F4 ;aX(i) + Y(i) SD F4 ,0(Ry) ;store into Y(i) ADDI Rx,Rx,#8 ;increment index to X ADDI Ry,Ry,#8 ;increment index to Y SUB R20,R4,Rx ;compute bound BNZ R20,loop ;check if done
LD F0,a ;load scalar a LV V1,Rx ;load vector X MULTS V2,F0,V1 ;vector-scalar mult. LV V3,Ry ;load vector Y ADDV V4,V2,V3 ;add SV Ry,V4 ;store the result
Assuming vectors X, Y are length 64 Scalar vs. Vector
578 (2+964) vs. 321 (1+564) ops (1.8X) 578 (2+964) vs. 6 instructions (96X) 64 operation vectors + no loop overhead also 64X fewer pipeline hazards*
Example Vector Machines
- Machine Year Clock Regs Elements FUs LSUs
- Cray 1 197680 MHz 8 64 6 1
- Cray XMP 1983120 MHz 8 64 8 2 L, 1 S
- Cray YMP 1988166 MHz 8 64 8 2 L, 1 S
- Cray C-90 1991240 MHz 8 128 8 4
- Cray T-90 1996455 MHz 8 128 8 4
- Conv. C-1 198410 MHz 8 128 4 1
- Conv. C-4 1994133 MHz 16 128 3 1
- Fuj. VP200 1982133 MHz8-25632-1024Docsity.com
Vector Surprise
- Use vectors for inner loop parallelism (no surprise
- One dimension of array: A[0, 0], A[0, 1], A[0, 2], ...
- think of machine as, say, 32 vector regs each with 64 elements
- 1 instruction updates 64 elements of 1 vector register
- and for outer loop parallelism!
- 1 element from each column: A[0,0], A[1,0], A[2,0], ...
- think of machine as 64 “virtual processors” (VPs) each with 32 scalar registers! ( multithreaded processo
- 1 instruction updates 1 scalar register in 64 VPs
Virtual Processor Vector Model
- Vector operations are SIMD (single instruction multiple data) operations
- Each element is computed by a virtual processor (VP)
- Number of VPs given by vector length