SIMD - Vector Processing - Lecture Slides | ECE 511, Study notes of Computer Architecture and Organization

Material Type: Notes; Class: Computer Architecture; Subject: Electrical and Computer Engr; University: University of Illinois - Urbana-Champaign; Term: Fall 2002;

Typology: Study notes

Pre 2010

Uploaded on 03/16/2009

koofers-user-fom
koofers-user-fom 🇺🇸

8 documents

1 / 24

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lecture 18
SIMD: Vector Processing
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18

Partial preview of the text

Download SIMD - Vector Processing - Lecture Slides | ECE 511 and more Study notes Computer Architecture and Organization in PDF only on Docsity!

Lecture 18

SIMD: Vector Processing

General-purpose to

Specific Application Domains

  • General purpose computing presents tough

problems in architecture.

  • One pathway to better architectures is to

“known” the application domain.

  • Example: Scientific applications

SIMD at work

© W. W. Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Types of Vector Processing

•^

Attached co-processor to improve scientificapplication performance^ – TI ASC, CDC STAR 100, IBM 3838, FPS-

-^

Supercomputers designed to run scientificapplications^ – CRAY-1, Cyber 205, CRAY-XMP, CRAY-2, CRAY-

YMP, Fujitsu VP 100/200, Hitachi S810/820, NEC SX/

-^

Minicomputers designed to give better priceperformance than supercomputers^ – CONVEX C-1, Alliant FX-

-^

Instruction set extension to improve performance^ – IBM 3090, VAX 6000, X86 MMX, 3DNow, Alpha

Typical Vector Architecture

•^

A vector unit typically consists of:^ – a vector instruction processor^ – a collection of vector registers (e.g. 8 64-entry registers

in CRAY-1)

  • a vector length register (e.g. 6 bits in CRAY-1),

implicit in MMX

  • a mask register (e.g. 64 bits in CRAY-1) – a set of pipelined function units (e.g., load/store, FP

add, FP multiply,

  • FP reciprocal, integer add, logic, shift in CRAY-1)

© W. W. Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Vector Code

•^

Vector code generated for a register-to-registervector architecture:

  • VL

& N

  • v

B

  • v

C

  • v

v0 + v

  • A

v

•^

An outer loop may be required if N is greater thanthe max length allowed, details discussed later.

-^

If N is sufficiently big, each vector instructionwould take about N cycles to execute.^ – With aggressive design, chaining, all the vector

instructions can overlap to all finish in about N cycles.

Example

•^

DO I = 1, N^ – S

: D(I) = A(I-1) * D(I) 1

  • S

: A(I) = B(I) + C(I) 2

•^

END DO

•^

The execution of S

1

and S

2 in different iterations:

  • S

: D(1) = A(0) * D(1) 1

  • S

: A(1) = B(1) + C(1) 2

  • S

: D(2) = A(1) * D(2) 1

  • S

: A(2) = B(2) + C(2) 2

•^

There is a flow dependence from S

2 of iteration i

to S

1 of iteration i+1.

Loop Distribution

•^

Basic Transformation for vectorization– transform a multi-statement loop into a sequence of single-

statement loops.

-^

Example– DO I = 1, N

  • S

1

  • …• S

N

  • END DO

•Becomes:

–DO I = 1, N

•S

1

–END DO–…–DO I = 1, N

•sN -END DO

Problem

•^

Not all multi-statement loops can be distributed.^ – DO I = 1, N

  • S

: C(I) = A(I-1) + ... 1

  • S

: A(I) = ... 2

  • END DO -^

The execution of iterations looks like:^ – C(1) = A(0) + ...^ – A(1) = ...^ – C(2) = A(1) + ...^ – A(2) = ...^ – S

in iteration i delivers its result to S 2

in iteration i+1. 1

Problem (Cont.)

•^

Loop distribution generates single-statement loops:^ – DO I = 1, N

  • S

: C(I) = A(I-1) + ... 1

  • END DO – DO I = 1, N
    • S

: A(I) = ... 2

  • END DO -^

All iterations of S

1

are executed before those of S

  • The result of S

in iteration i cannot be delivered to S 2

in 1

iteration i+1. Therefore, the execution is invalid afterloop distribution.

Backward Depndence (Cont.)

•^

Statement reordering: If S

does not dependent on 2

S

1 in the same iteration, one can reorder the syntactic ordering of S

1

and S

•After

–DO I = 1, N

•S

: A(I) = ... 2 •S

: C(I) = A(I-1) + ... 1

–END DO

  • Before

–DO I = 1, N

•S

: C(I) = A(I-1) + ... 1 •S

: A(I) = ... 2

–END DO

Backward Dependence (Cont.)

•^

Now with statement reordering and loopdistribution, the reordered loop becomes:^ – DO I = 1, N

  • S

: A(I) = ... 2

  • END DO – DO I = 1, N
    • S

: C(I) = A(I-1) + ... 1

  • END DO -^

Note that all results of S

2

are now generated

before the execution of S

. The execution result 1

remain valid after loop distribution.

Problem

  • Cyclic Dependence: A loop cannot be

distributed if there is a cyclic loop-carrieddependence.

  • Question: Can we increase the success rate

of vectorization in the presence of cyclicloop-carried dependence?

Common Solution

•^

Loop interchange: Reverse the role of Inner andOuter loops

-^

In the example, the inner loop has a cyclic loop-carried dependence but the outer loop does not.^ – DO I = 1, N

  • DO J = 1, N
    • S: A(I, J+1) = A(I,J) * B(I, J)
      • END DO
        • END DO -^

With the cyclic dependence, the inner loop cannotbe converted to a vector statement.