Download Multithreading: Motivation, Architectures, and Performance Implications and more Slides Computer Science in PDF only on Docsity!
Spring 2007 CSE 471 - Multithreading 1
Motivation for Multithreaded Architectures
Processors not executing code at their hardware potential
- late 70’s: performance lost to memory latency
- 90 ’s: performance not in line with the increasingly complex parallel hardware as well - increase in instruction issue bandwidth - increase in number of functional units - out-of-order execution - techniques for decreasing/hiding branch & memory latencies - Still, processor utilization was decreasing & instruction throughput not increasing in proportion to the issue width Spring 2007 CSE 471 - Multithreading 2
Motivation for Multithreaded Architectures
Spring 2007 CSE 471 - Multithreading 3
Motivation for Multithreaded Architectures
Major cause is the lack of instruction-level parallelism in a single executing thread Therefore the solution has to be more general than building a smarter cache or a more accurate branch predictor Spring 2007 CSE 471 - Multithreading 4
Multithreaded Processors
Multithreaded processors can increase the pool of independent instructions & consequently address multiple causes of processor stalling
- holds processor state for more than one thread of execution
- registers
- PC
- each thread’s state is a hardware context
- execute the instruction stream from multiple threads without software context switching
- utilize thread-level parallelism (TLP) to compensate for a lack in ILP
Spring 2007 CSE 471 - Multithreading 7
Comparison of Issue Capabilities
Spring 2007 CSE 471 - Multithreading 8
Simultaneous Multithreading (SMT)
Third style of multithreading, different concept
- simultaneous multithreading (SMT)
- issues multiple instructions from multiple threads each cycle
- no hardware context switching
- same-cycle multithreading
- huge boost in instruction throughput with less degradation to individual threads
Spring 2007 CSE 471 - Multithreading 9
Comparison of Issue Capabilities
Spring 2007 CSE 471 - Multithreading 10
Cray (Tera) MTA
Goals
- the appearance of uniform memory access
- lightweight synchronization
- heterogeneous parallelism
Spring 2007 CSE 471 - Multithreading 13
Cray (Tera) MTA
Interesting features
- Trade-off between avoiding memory bank conflicts & exploiting spatial locality for data
- conflicts:
- memory distributed among hardware contexts
- memory addresses are randomized to avoid conflicts
- want to fully utilize all memory bandwidth
- locality:
- run-time system can confine consecutive virtual addresses to a single (close-by) memory unit - used mainly for the stack Spring 2007 CSE 471 - Multithreading 14
Cray (Tera) MTA
Interesting features
- tagged memory
- indirectly set full/empty bits to prevent data races
- prevents a consumer/producer from loading/overwriting a value before a producer/consumer has written/read it
- example for the consumer:
- set to empty when producer instruction starts executing
- consumer instructions block if try to read the producer value
- set to full when producer writes value
- consumers can now read a valid value
- explicitly set full/empty bits for thread synchronization
- primarily used accessing shared data
- lock: read memory location & set to empty
- other readers are blocked
- unlock: write & set to full
Spring 2007 CSE 471 - Multithreading 15
Cray (Tera) MTA
Interesting features
- no paging
- want pages pinned down in memory for consistent latency
- page size is 256MB
- forward bit
- memory contents interpreted as a pointer & dereferenced
- used for GC & null reference checking
- user-mode trap handlers
- lighter weight
- used for fatal exceptions, overflow, normalizing floating point numbers
- not used for protection - user might override the RT
- designed for user-written trap handlers, but too complicated for users Spring 2007 CSE 471 - Multithreading 16
Cray (Tera) MTA
Compiler support
- VLIW instructions
- memory/arithmetic/branch
- load/store architecture
- need a good code scheduler
- memory dependence look-ahead
- field in a memory instruction that specifies the number of independent memory ops that follow
- guarantees nonstalling instruction choice
- improves memory parallelism
- handling branches
- special instruction to store a branch target in a register before the branch is executed
- can start prefetching the target code
Spring 2007 CSE 471 - Multithreading 19
Performance Implications
Multiprogramming workload
- 2.5X on SPEC95, 4X on SPEC Parallel programs
- ~1.7X on SPLASH Commercial databases
- 2-3X on TPC B; 1.5X on TPC D Web servers & OS
- 4X on Apache and Digital Unix Spring 2007 CSE 471 - Multithreading 20
Does this Processor Sound Familiar?
Technology transfer =>
- 2-context Intel Hyperthreading
- 4-context IBM Power
- 2-context Sun UltraSPARC on a 4-processor CMP
- 4-context Compaq 21464
- network processor & mobile device start-ups
- others in the wings
Spring 2007 CSE 471 - Multithreading 21
An SMT Architecture
Three primary goals for this architecture:
- Achieve significant throughput gains with multiple threads
- Minimize the performance impact on a single thread executing alone
- Minimize the microarchitectural impact on a conventional out-of- order superscalar design Spring 2007 CSE 471 - Multithreading 22
Implementing SMT
Spring 2007 CSE 471 - Multithreading 25
From Superscalar to SMT
Per-thread hardware
- small stuff
- all part of current out-of-order processors
- none endangers the cycle time
- other per-thread processor state, e.g.,
- program counters
- return stacks
- thread identifiers, e.g., with BTB entries, TLB entries
- per-thread bookkeeping for, e.g.,
- instruction queue flush
- instruction retirement
- trapping This is why there is only a 15% increase to Alpha 21464 chip area. Spring 2007 CSE 471 - Multithreading 26
Implementing SMT
Thread-shared hardware :
- fetch buffers
- branch prediction structures
- instruction queues
- functional units
- active list
- all caches & TLBs
- store buffers & MSHRs This is why there is little single-thread performance degradation (~1.5%).
Spring 2007 CSE 471 - Multithreading 27
Architecture Research
Concept & potential of Simultaneous Multithreading Designing the microarchitecture
- straightforward extension of out-of-order superscalars I-fetch thread chooser
- 40% faster than round-robin The lockbox for cheap synchronization
- orders of magnitude faster
- can parallelize previously unparallelizable codes Spring 2007 CSE 471 - Multithreading 28
Architecture Research
Software-directed register deallocation
- large register-file performance w. small register file Mini-threads
- large SMT performance w. small SMTs SMT instruction speculation
- don’t execute as far down a wrong path
- speculative instructions don’t get as far down the pipeline
- speculation keeps a good thread mix in the IQ
- most important factor for performance
Spring 2007 CSE 471 - Multithreading 31
Others are Now Carrying the Ball
Fault detection & recovery Thread-level speculation Instruction & data prefetching Instruction issue hardware design Thread scheduling & thread priority Single-thread execution Profiling executing threads SMT-CMP hybrids Power considerations Spring 2007 32
SMT Collaborators