Conjoined-Core Chip Multiprocessing | CISC 879, Study notes of Computer Science

Material Type: Notes; Class: ADVANCED PARALLEL PROGRAMMING; Subject: Computer/Information Sciences; University: University of Delaware; Term: Unknown 2000;

Typology: Study notes

Pre 2010

Uploaded on 09/02/2009

koofers-user-53g
koofers-user-53g 🇺🇸

9 documents

1 / 22

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CISC 879 : Software Support for Multicore Architectures
John Tully
Dept of Computer & Information Sciences
University of Delaware
Conjoined-Core Chip Multiprocessing
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16

Partial preview of the text

Download Conjoined-Core Chip Multiprocessing | CISC 879 and more Study notes Computer Science in PDF only on Docsity!

CISC 879 : Software Support for Multicore Architectures

John Tully

Dept of Computer & Information Sciences

University of Delaware

Conjoined-Core Chip Multiprocessing

Overview / Motivation CISC 879 : Software Support for Multicore Architectures

-^ High level: Make multithreaded processors moreefficient, by using less transistors to get same amountof compute power. •^ As we know, # transistors increases according toMoore’s law •^ (Fairly modern paper) – performance increasing at farslower rate, so multithreading already a growing field

In Between... CISC 879 : Software Support for Multicore Architectures

Single-ChipMultiprocessors

SimultaneousMultithreading^ Near SMT: Large processorsthat possess private copies ofcritical resources

Near CMP: Pairs of smallprocessors (a little bit morecomplex) that share a fewcommon components

-^ Tons of research to improve efficiency near both ends of spectrum

In Between... CISC 879 : Software Support for Multicore Architectures

Single-ChipMultiprocessors

SimultaneousMultithreading^ Near SMT: Large processorsthat possess private copies ofcritical resources

Near CMP: Pairs of smallprocessors (a little bit morecomplex) that share a fewcommon components

-^ Tons of research to improve efficiency near both ends of spectrum

Objective: Go between two extremes What hardware can be shared between cores tocut transistors, but not performance?

Hardware Specifics CISC 879 : Software Support for Multicore Architectures

-^ Resources should be large

, so cost of additional wiring

doesn’t outweigh sharing benefits. • To keep topology issues to a minimum, investigate sharingbetween^

pairs of processors (‘Conjoined-Core Chip Multiprocessing’) • Components evaluated for sharing:^ •^ Floating point units^ •^ Crossbar^ •^ First-Level Instruction Caches^ •^ First-Level Data Caches

Baseline Architecture CISC 879 : Software Support for Multicore Architectures

-^ Processor similar to Piranha (developed by Compaqin 2000-2001)^ •^ 8 cores, 64KB L1 cache in each core^ •^ Private FPU for each core^ •^ L2 cache shared by all cores •^ Processor size: 127.76 sq. mm^ •^ 64.64 mm: L2 Cache (50%)^ •^ 46.9 mm: Cores (37 %)^ •^ 16.22 mm: Crossbar (13 %)

FPU Sharing: Performance CISC 879 : Software Support for Multicore Architectures

-^ For many benchmarks, barely any performance hit^ •^ SPECINT (integer): 0.1% FP instructions^ •^ CFP (floating point specific): still less than 1/3 of instructions •^ Drawback: specialized applications (‘WATER’) can be very badwithout private non-pipelined units (extra 0.4% to have privatedivide)

ICache Sharing: Methods CISC 879 : Software Support for Multicore Architectures

-^ Method:

Shared fetch path

from ICache to pipelines of 2 cores (like FPU, time-share every other cycle) • Tested three variations of shared fetch path:^ •^ Original fetch width:

4 instructions every other cycle

-^ Double fetch width:

8 instructions every other cycle

-^ Banked architecture:

each thread can access half of the ICache each cycle^ •^ Perfect case: Double fetch width bandwidth^ •^ Worst case: Original fetch with bandwidth • Core size reduction: 10%

ICache Sharing: Performance CISC 879 : Software Support for Multicore Architectures

-^ For both workloads, wider fetch width outperforms others^ •^ Without extra latency, no slowdown at all •^ Integer workloads affected more – higher mispredict penalty •^ Original structure fetch width causes ‘fetch limited’ execution

DCache Sharing: Methods CISC 879 : Software Support for Multicore Architectures

-^ Not as obvious candidate for sharing – takes up a hugepart of each core, but very highly utilized •^ To share: like ICache -- each core can issue memoryinstruction

every other cycle

.

-^ Lengthened wires => increase in latency^ •^ Could be hidden in pipeline, but conservativeestimates used •^ Best Core-area savings (22%)

Crossbar Sharing: Methods CISC 879 : Software Support for Multicore Architectures

-^ 1) Reduce number of links^ •^ Pairs of processors

share input ports

to

crossbar interconnect of L2 cache • 2) Reduce width of links • Use^ less wires

for any point-to-point link (i.e. 1/2, 1/4, or 1/8) • For each case: remove 50% of the crossbar =6.5% die area savings

Crossbar Sharing: Performance CISC 879 : Software Support for Multicore Architectures

-^ Input port sharing: Only one core can issue a requestin any cycle •^ Crossbar width reduction: Increases latency of wires(halve the wires => 50% the speed) •^ In all cases, sharing^ ends up better^ (link latency >^ port contention)

Advanced Techniques: ICache CISC 879 : Software Support for Multicore Architectures

-^ Previous method: ignore parallelism of threads •^ Very common for 2 threads to be fetching from same cacheindex in a cycle (usually one match = many in a row) •^ Fetch combining

: if two threads are trying to fetch from same cache index, they can both get data in same cycle

Advanced Techniques: DCache CISC 879 : Software Support for Multicore Architectures

-^ Assertive DCache access

: exact same method as ICache

-^ Basic sharing: DCache ports allocated by cycle (cycle-by-cycleslicing with assertive access) •^ I/O partitioning

: Statically assign load ports to processors

-^ Without partitioning: probability of port being available:never more than 50% •^ With partitioning: probability depends on total # of loads •^ For FP: utilization of load^ ports too high