Data and Object Parallel Models - Lecture Slides | SOCB 160, Study notes of Introduction to Sociology

Material Type: Notes; Class: Sociology of Culture; Subject: Sociology/ Culture, Language, and Social Interaction; University: University of California - San Diego; Term: Spring 2005;

Typology: Study notes

Pre 2010

Uploaded on 03/28/2010

koofers-user-e60
koofers-user-e60 🇺🇸

10 documents

1 / 15

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Lecture #16, Slide 1
CSE 160 Chien, Spring 2005
Data and Object Parallel Models
Last Time
»Pipelined
»Systolic
»Workflow
Today
»Data Parallel Programming
»Object Parallel Programming
Reminders/Announcements
»HW#3’s returned today
»HW#4 is out today…. and due Thursday, June 2 in lecture.
Lecture #16, Slide 2
CSE 160 Chien, Spring 2005
Data Parallel Programming
Observation: In many cases, large collections of data
(vectors, matrices, etc.) can be operated on in
parallel.
Idea: provide a clear mechanism for enabling users
to specify such parallelism
=> extend the sequential programming model with
“data parallel” operations
=> these operations explicitly specify the parallel
structure, so the compiler need prove nothing to
execute in parallel
=> idea similar to vector/MMX machine instructions
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Data and Object Parallel Models - Lecture Slides | SOCB 160 and more Study notes Introduction to Sociology in PDF only on Docsity!

CSE 160 Chien, Spring 2005 Lecture #16, Slide 1

Data and Object Parallel Models

  • Last Time » Pipelined » Systolic » Workflow
  • Today » Data Parallel Programming » Object Parallel Programming
  • Reminders/Announcements » HW#3’s returned today » HW#4 is out today…. and due Thursday, June 2 in lecture.

Data Parallel Programming

  • Observation: In many cases, large collections of data (vectors, matrices, etc.) can be operated on in parallel.
  • Idea: provide a clear mechanism for enabling users to specify such parallelism
  • => extend the sequential programming model with “data parallel” operations
  • => these operations explicitly specify the parallel structure, so the compiler need prove nothing to execute in parallel
  • => idea similar to vector/MMX machine instructions

CSE 160 Chien, Spring 2005 Lecture #16, Slide 3

Fortran 90 -- Data Parallel

Extension

  • Arrays can be treated as data parallel collections
  • Elementwise operations on ensembles
  • Told compiler what’s in parallel, but how much flexibility have we granted? Between operations?
  • How useful is this? Need more power...

real A(100), B(100), C(100)

A = B * C! computes elementwise product A = A + C! and sum

Data Parallel vs. Sequential

program

  • Computationally equivalent sequential programs
  • Parallelism didn’t affect semantics because there’s no hazards in the read/write sets (almost)
  • Specification of parallelism means the compiler doesn’t have to prove for this case, does it help much?
  • Parallel execution model is equivalent to copy-in, copy-out (no dependences)

real A(100), B(100), C(100) do i = 1, 100 A(i) = B(i) * C(i) enddo do i = 1, 100 A(i) = A(i) + C(i) enddo

CSE 160 Chien, Spring 2005 Lecture #16, Slide 7

Implementing Data Parallel

Languages

  • How do we get performance? » Exploit parallelism within an operation (vector parallelism) » Exploit concurrency across data parallel operations (pipelining) -- still requires compiler support
  • Two approaches » Deeply pipelined machines (already have these for uniprocessors) » Parallel machines, distributing parts of the parallel operations, operation in parallel » => Parallel vector processors, parallel microprocessors with some type of interconnect

Exploiting Data Parallelism

  • Sequential issue and completion of Data Parallel operations
  • Problem: data movement and coordination doesn’t scale well (how much do you need? how fast?) » Important in SIMD Machines, REALLY IMPORTANT in message passing machines
  • More aggressive: pipeline the operations (equivalent of vector chaining); for sequential processors, you’d like to do strip mining and loop fusion (regs and cache)
  • => Analyze across DP operations, and control data placement

CSE 160 Chien, Spring 2005 Lecture #16, Slide 9

Compiler Analysis for

Overlap

  • Classical Dependence Testing
  • Similar to what’s required for parallelization
  • => Data parallelism doesn’t solve the entire problem, but does provide some control
  • => Simple programming interface enables compiler analysis, but also restricts expressible parallelism...

leftcols(1:200:1,1:100:1) = A(1:200:1,1:200:2)! can these be rightcols(1:200:1,1:100:1) = A(1:200:1,2:200:2)! overlapped? do i = 1, 200! can these be parallel/overlapped do j=1, 100 leftcols(i,j) = A(i,j2) rightcols(i,j) = A(i,j2+1) enddo enddo**

Importance of Controlling

Data Placement: Locality

  • What is the flop/communication balance in most machines? » Microprocessors - DEC Alpha 21064 ~ 4 Flops/64-bit word - DEC Alpha 21164 ~ 12 Flops/64-bit word - AMD Opteron ~ 18 Flops/64-bit word » Larger Scale Parallel Machines (100+ nodes) - IBM SP2 ~200/(40/8) = 40 - T3D ~ 150/(200/8) = 6 - FWGrid ~ 1800/(120/8) = 120

CSE 160 Chien, Spring 2005 Lecture #16, Slide 13

Decomposition Statement

DECOMPOSITION D(N,N)

Data Decomposition -

Alignment

  • Controls how arrays are aligned with respect to one another
  • Enables reducing data movement when operating across arrays
  • Array operations between aligned arrays are usually more efficient than array operations between arrays that are not known to be aligned

CSE 160 Chien, Spring 2005 Lecture #16, Slide 15

Alignment Example

REAL A(N,N)

DECOMPOSITION D(N,N)

ALIGN A(I,J) with D(J-2,I+3)

Data Decomposition -

Distribution

  • 2 nd^ level of parallelism is distribution/machine mapping that is how arrays are distributed on physical machine parallelism
  • Choice and performance of distribution is affected by the topology, communication mechanisms, size of local memory, and number of processors on the underlying machine
  • Specified by assigning an independent attribute to each dimension.
  • Predefined attributes include BLOCK, CYCLIC, and BLOCK_CYCLIC, : dimensions are not distributed

CSE 160 Chien, Spring 2005 Lecture #16, Slide 19

Partition Analysis

Original program

REAL A(100)

do i = I, I

A(i) = 0.

enddo

SPMD node Program

REAL A(25)

do i = i, 25

A(i) = 0.

enddo

• Converting global to local indices

Jacobi Relaxation Code

REAL A(100,100), B(100,100) DECOMPOSITION D(100,100) ALIGN A, B with D DISTRIBUTE D(:,BLOCK) do k = l,time do j = 2, do i = 2, S1 A(i,j) = (B(i,j-l)+B(i-l,j)+ B(i+l,j)+B(i,j+l))/ enddo enddo do j = 2, do i = 2, S2 B(i,j) = A(i,j) enddo enddo enddo

CSE 160 Chien, Spring 2005 Lecture #16, Slide 21

Jacobi Relaxation Processor

Layout

  • Compiling for a four-processor machine.
  • Both arrays A and B are aligned identically with decomposition D, so they have the same distribution as D.
  • Because the first dimension of D is local and the second dimension is block-distributed, the local index set for both A and B on each processor (in local indices) is [1:100,1:25].

Jacobi Relaxation cont.

CSE 160 Chien, Spring 2005 Lecture #16, Slide 25

Generated Jacobi cont.

do j = lb1,ub do i = 2, B(i,j) = A(i,j) enddo enddo enddo

  • Only true cross-processor dependences are on the k loop thus able to vectorize messages

Controlling Data Layout, HPF Style

  • Arrays are the major aggregate data structure
  • DISTRIBUTEd over an abstract processor array
  • ALIGNed with each other (syntactic convenience)
  • BLOCK, CYCLIC distributions
  • Basic control => what are the limitations?

Arrays ALIGNed DISTRIBUTEd Implemented

ALIGN A(I) with B(I) DISTRIBUTE A(block) ALIGN A(I) with B(2I) DISTRIBUTE C(block,cyclic) various ALIGN C(,:) with B(:) DISTRIBUTE B(cyclic) (columns with elts)

CSE 160 Chien, Spring 2005 Lecture #16, Slide 27

Other Data Parallel Languages

  • Model applies in numerous language settings » Data Parallel C; C*, MPC
  • Object Parallelism » pC++ (parallel over object arrays) » ICC++, RWC++ » Parallel Java’s
  • Essential elements » Aligned parallelism, single threaded semantics » Explicit data placement (?)
  • Perspective: which of the application types would you want to write in a data parallel language? » forall, independent add some flexibility

Object Parallelism

  • Challenge in Data Parallelism is getting Aligned Parallelism » Irregularity due to Boundary Conditions » Irregularity due to Problem Structure » Ex: Jacobi, Finite Element Types, Different Bucket Sizes, Web Pages with different # of outlinks
  • Object Parallel Languages » Paralation Lisp, HPC++, Parallel Java Dialects » Arrays of Objects + Subtyping => Polymorphism - Can express irregularity - Can Implement it Efficiently on MIMD machines