Cell Programming Tutorial - Lecture Notes | CISC 879, Study notes of Computer Science

Material Type: Notes; Class: ADVANCED PARALLEL PROGRAMMING; Subject: Computer/Information Sciences; University: University of Delaware; Term: Spring 2008;

Typology: Study notes

Pre 2010

Uploaded on 09/02/2009

koofers-user-alz
koofers-user-alz 🇺🇸

10 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
!
CISC879: Software Support for Multicore Architectures
Spring 2008
Lecture 9: March 11
Lecturer: John Cavazos
Scribe: Brice Dobry
Cell Programming Tutorial
Outline:
I. Cell Basics
II. Programming Models
III. Programming Details
IV. Example Code - see slides for example code
I. Cell Basics
Heterogeneous architecture (9 cores)
1 PPE - General purpose processor
In the picture, you can see that the PPE and takes up a large
amount of space on the die
Basically just a slightly modified PowerPC
Good for control-plane or “branchy” code
8 SPEs - SIMD processors
Good for computational code with few branches
On the PS3, one is disabled for yield reasons, and the other is
used by the game OS
We noted that it is strange that the game OS can run well on an
SPE since you would think that OS code would be very “bran-
chy”
Program Structure
PPE code
Regular linux process (main thread)
Can spawn SPE threads
SPE code
Can be embedded in the PPE code
Also can be standalone “SPUlet”
SPE details
All instructions are SIMD
9-1!Lecture 9: March 11
pf3
pf4
pf5

Partial preview of the text

Download Cell Programming Tutorial - Lecture Notes | CISC 879 and more Study notes Computer Science in PDF only on Docsity!

CISC879: Software Support for Multicore Architectures Spring 2008

Lecture 9: March 11

Lecturer: John Cavazos Scribe: Brice Dobry Cell Programming Tutorial Outline: I. Cell Basics II. Programming Models III. Programming Details IV. Example Code - see slides for example code I. Cell Basics

- Heterogeneous architecture (9 cores) 1 PPE - General purpose processor ✴ (^) In the picture, you can see that the PPE and takes up a large amount of space on the die ✴ (^) Basically just a slightly modified PowerPC ✴ (^) Good for control-plane or “branchy” code 8 SPEs - SIMD processors ✴ (^) Good for computational code with few branches ✴ (^) On the PS3, one is disabled for yield reasons, and the other is used by the game OS ✴ (^) We noted that it is strange that the game OS can run well on an SPE since you would think that OS code would be very “bran- chy” - Program Structure PPE code ✴ (^) Regular linux process (main thread) ✴ (^) Can spawn SPE threads SPE code ✴ (^) Can be embedded in the PPE code ✴ (^) Also can be standalone “SPUlet” - SPE details All instructions are SIMD

128 128-bit registers for each SPE 256 KB Local Store ✴ (^) Basically a software managed cache ✴ (^) Accessed with load/store instructions (16 bytes at a time) ✴ (^) Contains data and instructions ✴ (^) Allows “overlaying” functions so that if there is not enough room to fit two mutually exclusinve functions, they can occupy the same space in the LS and be brought in from main memory when needed DMA transfers are used to move data/instructions to/from main memory ✴ (^) High bandwidth - 128 bytes per cycle ✴ (^) Need to verify whether or not DMA transfers can be used to go directly to/from one LS to another threads LS Non-deterministic features of standard general purpose proces- sors are removed ✴ (^) Out of order execution ✴ (^) HW managed cache ✴ (^) HW branch prediction ✴ (^) Allows for smaller logic size of the SPEs Register Layout It is inefficient to operate on less than 16-byte “units”, smaller “units” should be packed and operated on with one instruction

Since only 256 KB must fit all instructions and data, a streaming model is often used: II. Programming Models

- Choosing which model to use depends on how the data/computation can be partitioned Program structure Data structures used - Also need to consider how to DMA in and out efficiently - Possible models Data parallel

✴ (^) A large array of data is fed through the SPEs whic do the same calculation on each data segment Task parallel ✴ (^) Pipeline style where each SPE does a computation and then passes its output to the next SPE ✴ (^) This model seems to be too inefficient in practice and can usu- ally be morphed into one of the other models Job queue

✴ 32-bit messages

✴ 2 for sending, 1 for receiving

  • Signals
  • mfc_put(lsaddr,ea,size,tag,tid,rid)
  • Copy memory from my LS to the main memory
  • lsaddr is the address in my local store, ea is the address in

main memory and size is the size

  • tag is used for calls to determine when the transfer has com-

pleted

  • mfc_get(lsaddr,ea,size,tag,tid,rid)
    • Copy memory from the main memory to my LS
    • lsaddr is the address in my local store, ea is the address in

main memory and size is the size

  • tag is used for calls to determine when the transfer has com-

pleted

  • Double-buffering can be used to help hide the DMA latency
    • While doing operation n , put the results of operation n-1 to

main memory and get the input for operation n+