



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Notes; Class: ADVANCED PARALLEL PROGRAMMING; Subject: Computer/Information Sciences; University: University of Delaware; Term: Spring 2008;
Typology: Study notes
1 / 7
This page cannot be seen from the preview
Don't miss anything!




CISC879: Software Support for Multicore Architectures Spring 2008
Lecturer: John Cavazos Scribe: Brice Dobry Cell Programming Tutorial Outline: I. Cell Basics II. Programming Models III. Programming Details IV. Example Code - see slides for example code I. Cell Basics
- Heterogeneous architecture (9 cores) ‣ 1 PPE - General purpose processor ✴ (^) In the picture, you can see that the PPE and takes up a large amount of space on the die ✴ (^) Basically just a slightly modified PowerPC ✴ (^) Good for control-plane or “branchy” code ‣ 8 SPEs - SIMD processors ✴ (^) Good for computational code with few branches ✴ (^) On the PS3, one is disabled for yield reasons, and the other is used by the game OS ✴ (^) We noted that it is strange that the game OS can run well on an SPE since you would think that OS code would be very “bran- chy” - Program Structure ‣ PPE code ✴ (^) Regular linux process (main thread) ✴ (^) Can spawn SPE threads ‣ SPE code ✴ (^) Can be embedded in the PPE code ✴ (^) Also can be standalone “SPUlet” - SPE details ‣ All instructions are SIMD
‣ 128 128-bit registers for each SPE ‣ 256 KB Local Store ✴ (^) Basically a software managed cache ✴ (^) Accessed with load/store instructions (16 bytes at a time) ✴ (^) Contains data and instructions ✴ (^) Allows “overlaying” functions so that if there is not enough room to fit two mutually exclusinve functions, they can occupy the same space in the LS and be brought in from main memory when needed ‣ DMA transfers are used to move data/instructions to/from main memory ✴ (^) High bandwidth - 128 bytes per cycle ✴ (^) Need to verify whether or not DMA transfers can be used to go directly to/from one LS to another threads LS ‣ Non-deterministic features of standard general purpose proces- sors are removed ✴ (^) Out of order execution ✴ (^) HW managed cache ✴ (^) HW branch prediction ✴ (^) Allows for smaller logic size of the SPEs ‣ Register Layout ‣ It is inefficient to operate on less than 16-byte “units”, smaller “units” should be packed and operated on with one instruction
‣ Since only 256 KB must fit all instructions and data, a streaming model is often used: II. Programming Models
- Choosing which model to use depends on how the data/computation can be partitioned ‣ Program structure ‣ Data structures used - Also need to consider how to DMA in and out efficiently - Possible models ‣ Data parallel
✴ (^) A large array of data is fed through the SPEs whic do the same calculation on each data segment ‣ Task parallel ✴ (^) Pipeline style where each SPE does a computation and then passes its output to the next SPE ✴ (^) This model seems to be too inefficient in practice and can usu- ally be morphed into one of the other models ‣ Job queue