Download Data and Object Parallel Models - Lecture Slides | SOCB 160 and more Study notes Introduction to Sociology in PDF only on Docsity!
CSE 160 Chien, Spring 2005 Lecture #16, Slide 1
Data and Object Parallel Models
- Last Time » Pipelined » Systolic » Workflow
- Today » Data Parallel Programming » Object Parallel Programming
- Reminders/Announcements » HW#3’s returned today » HW#4 is out today…. and due Thursday, June 2 in lecture.
Data Parallel Programming
- Observation: In many cases, large collections of data (vectors, matrices, etc.) can be operated on in parallel.
- Idea: provide a clear mechanism for enabling users to specify such parallelism
- => extend the sequential programming model with “data parallel” operations
- => these operations explicitly specify the parallel structure, so the compiler need prove nothing to execute in parallel
- => idea similar to vector/MMX machine instructions
CSE 160 Chien, Spring 2005 Lecture #16, Slide 3
Fortran 90 -- Data Parallel
Extension
- Arrays can be treated as data parallel collections
- Elementwise operations on ensembles
- Told compiler what’s in parallel, but how much flexibility have we granted? Between operations?
- How useful is this? Need more power...
real A(100), B(100), C(100)
A = B * C! computes elementwise product A = A + C! and sum
Data Parallel vs. Sequential
program
- Computationally equivalent sequential programs
- Parallelism didn’t affect semantics because there’s no hazards in the read/write sets (almost)
- Specification of parallelism means the compiler doesn’t have to prove for this case, does it help much?
- Parallel execution model is equivalent to copy-in, copy-out (no dependences)
real A(100), B(100), C(100) do i = 1, 100 A(i) = B(i) * C(i) enddo do i = 1, 100 A(i) = A(i) + C(i) enddo
CSE 160 Chien, Spring 2005 Lecture #16, Slide 7
Implementing Data Parallel
Languages
- How do we get performance? » Exploit parallelism within an operation (vector parallelism) » Exploit concurrency across data parallel operations (pipelining) -- still requires compiler support
- Two approaches » Deeply pipelined machines (already have these for uniprocessors) » Parallel machines, distributing parts of the parallel operations, operation in parallel » => Parallel vector processors, parallel microprocessors with some type of interconnect
Exploiting Data Parallelism
- Sequential issue and completion of Data Parallel operations
- Problem: data movement and coordination doesn’t scale well (how much do you need? how fast?) » Important in SIMD Machines, REALLY IMPORTANT in message passing machines
- More aggressive: pipeline the operations (equivalent of vector chaining); for sequential processors, you’d like to do strip mining and loop fusion (regs and cache)
- => Analyze across DP operations, and control data placement
CSE 160 Chien, Spring 2005 Lecture #16, Slide 9
Compiler Analysis for
Overlap
- Classical Dependence Testing
- Similar to what’s required for parallelization
- => Data parallelism doesn’t solve the entire problem, but does provide some control
- => Simple programming interface enables compiler analysis, but also restricts expressible parallelism...
leftcols(1:200:1,1:100:1) = A(1:200:1,1:200:2)! can these be rightcols(1:200:1,1:100:1) = A(1:200:1,2:200:2)! overlapped? do i = 1, 200! can these be parallel/overlapped do j=1, 100 leftcols(i,j) = A(i,j2) rightcols(i,j) = A(i,j2+1) enddo enddo**
Importance of Controlling
Data Placement: Locality
- What is the flop/communication balance in most machines? » Microprocessors - DEC Alpha 21064 ~ 4 Flops/64-bit word - DEC Alpha 21164 ~ 12 Flops/64-bit word - AMD Opteron ~ 18 Flops/64-bit word » Larger Scale Parallel Machines (100+ nodes) - IBM SP2 ~200/(40/8) = 40 - T3D ~ 150/(200/8) = 6 - FWGrid ~ 1800/(120/8) = 120
CSE 160 Chien, Spring 2005 Lecture #16, Slide 13
Decomposition Statement
DECOMPOSITION D(N,N)
Data Decomposition -
Alignment
- Controls how arrays are aligned with respect to one another
- Enables reducing data movement when operating across arrays
- Array operations between aligned arrays are usually more efficient than array operations between arrays that are not known to be aligned
CSE 160 Chien, Spring 2005 Lecture #16, Slide 15
Alignment Example
REAL A(N,N)
DECOMPOSITION D(N,N)
ALIGN A(I,J) with D(J-2,I+3)
Data Decomposition -
Distribution
- 2 nd^ level of parallelism is distribution/machine mapping that is how arrays are distributed on physical machine parallelism
- Choice and performance of distribution is affected by the topology, communication mechanisms, size of local memory, and number of processors on the underlying machine
- Specified by assigning an independent attribute to each dimension.
- Predefined attributes include BLOCK, CYCLIC, and BLOCK_CYCLIC, : dimensions are not distributed
CSE 160 Chien, Spring 2005 Lecture #16, Slide 19
Partition Analysis
Original program
REAL A(100)
do i = I, I
A(i) = 0.
enddo
SPMD node Program
REAL A(25)
do i = i, 25
A(i) = 0.
enddo
• Converting global to local indices
Jacobi Relaxation Code
REAL A(100,100), B(100,100) DECOMPOSITION D(100,100) ALIGN A, B with D DISTRIBUTE D(:,BLOCK) do k = l,time do j = 2, do i = 2, S1 A(i,j) = (B(i,j-l)+B(i-l,j)+ B(i+l,j)+B(i,j+l))/ enddo enddo do j = 2, do i = 2, S2 B(i,j) = A(i,j) enddo enddo enddo
CSE 160 Chien, Spring 2005 Lecture #16, Slide 21
Jacobi Relaxation Processor
Layout
- Compiling for a four-processor machine.
- Both arrays A and B are aligned identically with decomposition D, so they have the same distribution as D.
- Because the first dimension of D is local and the second dimension is block-distributed, the local index set for both A and B on each processor (in local indices) is [1:100,1:25].
Jacobi Relaxation cont.
CSE 160 Chien, Spring 2005 Lecture #16, Slide 25
Generated Jacobi cont.
do j = lb1,ub do i = 2, B(i,j) = A(i,j) enddo enddo enddo
- Only true cross-processor dependences are on the k loop thus able to vectorize messages
Controlling Data Layout, HPF Style
- Arrays are the major aggregate data structure
- DISTRIBUTEd over an abstract processor array
- ALIGNed with each other (syntactic convenience)
- BLOCK, CYCLIC distributions
- Basic control => what are the limitations?
Arrays ALIGNed DISTRIBUTEd Implemented
ALIGN A(I) with B(I) DISTRIBUTE A(block) ALIGN A(I) with B(2I) DISTRIBUTE C(block,cyclic) various ALIGN C(,:) with B(:) DISTRIBUTE B(cyclic) (columns with elts)
CSE 160 Chien, Spring 2005 Lecture #16, Slide 27
Other Data Parallel Languages
- Model applies in numerous language settings » Data Parallel C; C*, MPC
- Object Parallelism » pC++ (parallel over object arrays) » ICC++, RWC++ » Parallel Java’s
- Essential elements » Aligned parallelism, single threaded semantics » Explicit data placement (?)
- Perspective: which of the application types would you want to write in a data parallel language? » forall, independent add some flexibility
Object Parallelism
- Challenge in Data Parallelism is getting Aligned Parallelism » Irregularity due to Boundary Conditions » Irregularity due to Problem Structure » Ex: Jacobi, Finite Element Types, Different Bucket Sizes, Web Pages with different # of outlinks
- Object Parallel Languages » Paralation Lisp, HPC++, Parallel Java Dialects » Arrays of Objects + Subtyping => Polymorphism - Can express irregularity - Can Implement it Efficiently on MIMD machines