














Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Notes; Class: ADVANCED PARALLEL PROGRAMMING; Subject: Computer/Information Sciences; University: University of Delaware; Term: Unknown 2000;
Typology: Study notes
1 / 22
This page cannot be seen from the preview
Don't miss anything!















CISC 879 : Software Support for Multicore Architectures
Dept of Computer & Information Sciences
University of Delaware
Conjoined-Core Chip Multiprocessing
-^ High level: Make multithreaded processors moreefficient, by using less transistors to get same amountof compute power. •^ As we know, # transistors increases according toMoore’s law •^ (Fairly modern paper) – performance increasing at farslower rate, so multithreading already a growing field
Single-ChipMultiprocessors
SimultaneousMultithreading^ Near SMT: Large processorsthat possess private copies ofcritical resources
Near CMP: Pairs of smallprocessors (a little bit morecomplex) that share a fewcommon components
-^ Tons of research to improve efficiency near both ends of spectrum
Single-ChipMultiprocessors
SimultaneousMultithreading^ Near SMT: Large processorsthat possess private copies ofcritical resources
Near CMP: Pairs of smallprocessors (a little bit morecomplex) that share a fewcommon components
-^ Tons of research to improve efficiency near both ends of spectrum
Objective: Go between two extremes What hardware can be shared between cores tocut transistors, but not performance?
-^ Resources should be large
, so cost of additional wiring
doesn’t outweigh sharing benefits. • To keep topology issues to a minimum, investigate sharingbetween^
pairs of processors (‘Conjoined-Core Chip Multiprocessing’) • Components evaluated for sharing:^ •^ Floating point units^ •^ Crossbar^ •^ First-Level Instruction Caches^ •^ First-Level Data Caches
-^ Processor similar to Piranha (developed by Compaqin 2000-2001)^ •^ 8 cores, 64KB L1 cache in each core^ •^ Private FPU for each core^ •^ L2 cache shared by all cores •^ Processor size: 127.76 sq. mm^ •^ 64.64 mm: L2 Cache (50%)^ •^ 46.9 mm: Cores (37 %)^ •^ 16.22 mm: Crossbar (13 %)
FPU Sharing: Performance CISC 879 : Software Support for Multicore Architectures
-^ For many benchmarks, barely any performance hit^ •^ SPECINT (integer): 0.1% FP instructions^ •^ CFP (floating point specific): still less than 1/3 of instructions •^ Drawback: specialized applications (‘WATER’) can be very badwithout private non-pipelined units (extra 0.4% to have privatedivide)
-^ Method:
Shared fetch path
from ICache to pipelines of 2 cores (like FPU, time-share every other cycle) • Tested three variations of shared fetch path:^ •^ Original fetch width:
4 instructions every other cycle
-^ Double fetch width:
8 instructions every other cycle
-^ Banked architecture:
each thread can access half of the ICache each cycle^ •^ Perfect case: Double fetch width bandwidth^ •^ Worst case: Original fetch with bandwidth • Core size reduction: 10%
ICache Sharing: Performance CISC 879 : Software Support for Multicore Architectures
-^ For both workloads, wider fetch width outperforms others^ •^ Without extra latency, no slowdown at all •^ Integer workloads affected more – higher mispredict penalty •^ Original structure fetch width causes ‘fetch limited’ execution
-^ Not as obvious candidate for sharing – takes up a hugepart of each core, but very highly utilized •^ To share: like ICache -- each core can issue memoryinstruction
every other cycle
.
-^ Lengthened wires => increase in latency^ •^ Could be hidden in pipeline, but conservativeestimates used •^ Best Core-area savings (22%)
-^ 1) Reduce number of links^ •^ Pairs of processors
share input ports
to
crossbar interconnect of L2 cache • 2) Reduce width of links • Use^ less wires
for any point-to-point link (i.e. 1/2, 1/4, or 1/8) • For each case: remove 50% of the crossbar =6.5% die area savings
Crossbar Sharing: Performance CISC 879 : Software Support for Multicore Architectures
-^ Input port sharing: Only one core can issue a requestin any cycle •^ Crossbar width reduction: Increases latency of wires(halve the wires => 50% the speed) •^ In all cases, sharing^ ends up better^ (link latency >^ port contention)
Advanced Techniques: ICache CISC 879 : Software Support for Multicore Architectures
-^ Previous method: ignore parallelism of threads •^ Very common for 2 threads to be fetching from same cacheindex in a cycle (usually one match = many in a row) •^ Fetch combining
: if two threads are trying to fetch from same cache index, they can both get data in same cycle
Advanced Techniques: DCache CISC 879 : Software Support for Multicore Architectures
-^ Assertive DCache access
: exact same method as ICache
-^ Basic sharing: DCache ports allocated by cycle (cycle-by-cycleslicing with assertive access) •^ I/O partitioning
: Statically assign load ports to processors
-^ Without partitioning: probability of port being available:never more than 50% •^ With partitioning: probability depends on total # of loads •^ For FP: utilization of load^ ports too high