Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Multithreading: Motivation, Architectures, and Performance Implications, Slides of Computer Science

Aligarh Muslim University Computer Science

The motivation for multithreaded architectures, discussing the limitations of processors in executing code at their hardware potential. It introduces multithreaded processors as a solution to increase instruction throughput and address multiple causes of processor stalling. Two styles of traditional multithreading - coarse-grain and fine-grain - and a third style, simultaneous multithreading (smt). It also discusses the performance implications of multithreading in various workloads.

Typology: Slides

2012/2013

Uploaded on 03/22/2013

dhirendra 🇮🇳

4.3

(78)

268 documents

1 / 16

This page cannot be seen from the preview

Don't miss anything!

Spring 2007 CSE 471 - Multithreading 1

Motivation for Multithreaded Architectures

Processors not executing code at their hardware potential

• late 70’s: performance lost to memory latency

• 90’s: performance not in line with the increasingly complex parallel

hardware as well

• increase in instruction issue bandwidth

• increase in number of functional units

• out-of-order execution

• techniques for decreasing/hiding branch & memory latencies

• Still, processor utilization was decreasing & instruction

throughput not increasing in proportion to the issue width

Spring 2007 CSE 471 - Multithreading 2

Motivation for Multithreaded Architectures

Docsity.com

Discover Slides of Computer Science Aligarh Muslim University

Partial preview of the text

Download Multithreading: Motivation, Architectures, and Performance Implications and more Slides Computer Science in PDF only on Docsity!

Spring 2007 CSE 471 - Multithreading 1

Motivation for Multithreaded Architectures

Processors not executing code at their hardware potential

late 70’s: performance lost to memory latency
90 ’s: performance not in line with the increasingly complex parallel hardware as well - increase in instruction issue bandwidth - increase in number of functional units - out-of-order execution - techniques for decreasing/hiding branch & memory latencies - Still, processor utilization was decreasing & instruction throughput not increasing in proportion to the issue width Spring 2007 CSE 471 - Multithreading 2

Motivation for Multithreaded Architectures

Spring 2007 CSE 471 - Multithreading 3

Motivation for Multithreaded Architectures

Major cause is the lack of instruction-level parallelism in a single executing thread Therefore the solution has to be more general than building a smarter cache or a more accurate branch predictor Spring 2007 CSE 471 - Multithreading 4

Multithreaded Processors

Multithreaded processors can increase the pool of independent instructions & consequently address multiple causes of processor stalling

holds processor state for more than one thread of execution
- registers
- PC
- each thread’s state is a hardware context
execute the instruction stream from multiple threads without software context switching
utilize thread-level parallelism (TLP) to compensate for a lack in ILP

Spring 2007 CSE 471 - Multithreading 7

Comparison of Issue Capabilities

Spring 2007 CSE 471 - Multithreading 8

Simultaneous Multithreading (SMT)

Third style of multithreading, different concept

simultaneous multithreading (SMT)
- issues multiple instructions from multiple threads each cycle
- no hardware context switching
- same-cycle multithreading
- huge boost in instruction throughput with less degradation to individual threads

Spring 2007 CSE 471 - Multithreading 9

Comparison of Issue Capabilities

Spring 2007 CSE 471 - Multithreading 10

Cray (Tera) MTA

Goals

the appearance of uniform memory access
lightweight synchronization
heterogeneous parallelism

Spring 2007 CSE 471 - Multithreading 13

Cray (Tera) MTA

Interesting features

Trade-off between avoiding memory bank conflicts & exploiting spatial locality for data
conflicts:
- memory distributed among hardware contexts
- memory addresses are randomized to avoid conflicts
  - want to fully utilize all memory bandwidth
locality:
- run-time system can confine consecutive virtual addresses to a single (close-by) memory unit - used mainly for the stack Spring 2007 CSE 471 - Multithreading 14

Cray (Tera) MTA

Interesting features

tagged memory
- indirectly set full/empty bits to prevent data races
  - prevents a consumer/producer from loading/overwriting a value before a producer/consumer has written/read it
  - example for the consumer:
    - set to empty when producer instruction starts executing
    - consumer instructions block if try to read the producer value
    - set to full when producer writes value
    - consumers can now read a valid value
- explicitly set full/empty bits for thread synchronization
  - primarily used accessing shared data
    - lock: read memory location & set to empty
    - other readers are blocked
    - unlock: write & set to full

Spring 2007 CSE 471 - Multithreading 15

Cray (Tera) MTA

Interesting features

no paging
- want pages pinned down in memory for consistent latency
- page size is 256MB
forward bit
- memory contents interpreted as a pointer & dereferenced
- used for GC & null reference checking
user-mode trap handlers
- lighter weight
- used for fatal exceptions, overflow, normalizing floating point numbers
- not used for protection - user might override the RT
- designed for user-written trap handlers, but too complicated for users Spring 2007 CSE 471 - Multithreading 16

Cray (Tera) MTA

Compiler support

VLIW instructions
- memory/arithmetic/branch
- load/store architecture
- need a good code scheduler
memory dependence look-ahead
- field in a memory instruction that specifies the number of independent memory ops that follow
- guarantees nonstalling instruction choice
- improves memory parallelism
handling branches
- special instruction to store a branch target in a register before the branch is executed
- can start prefetching the target code

Spring 2007 CSE 471 - Multithreading 19

Performance Implications

Multiprogramming workload

2.5X on SPEC95, 4X on SPEC Parallel programs
~1.7X on SPLASH Commercial databases
2-3X on TPC B; 1.5X on TPC D Web servers & OS
4X on Apache and Digital Unix Spring 2007 CSE 471 - Multithreading 20

Does this Processor Sound Familiar?

Technology transfer =>

2-context Intel Hyperthreading
4-context IBM Power
2-context Sun UltraSPARC on a 4-processor CMP
4-context Compaq 21464
network processor & mobile device start-ups
others in the wings

Spring 2007 CSE 471 - Multithreading 21

An SMT Architecture

Three primary goals for this architecture:

Achieve significant throughput gains with multiple threads
Minimize the performance impact on a single thread executing alone
Minimize the microarchitectural impact on a conventional out-of- order superscalar design Spring 2007 CSE 471 - Multithreading 22

Implementing SMT

Spring 2007 CSE 471 - Multithreading 25

From Superscalar to SMT

Per-thread hardware

small stuff
all part of current out-of-order processors
none endangers the cycle time
other per-thread processor state, e.g.,
- program counters
- return stacks
- thread identifiers, e.g., with BTB entries, TLB entries
per-thread bookkeeping for, e.g.,
- instruction queue flush
- instruction retirement
- trapping This is why there is only a 15% increase to Alpha 21464 chip area. Spring 2007 CSE 471 - Multithreading 26

Implementing SMT

Thread-shared hardware :

fetch buffers
branch prediction structures
instruction queues
functional units
active list
all caches & TLBs
store buffers & MSHRs This is why there is little single-thread performance degradation (~1.5%).

Spring 2007 CSE 471 - Multithreading 27

Architecture Research

Concept & potential of Simultaneous Multithreading Designing the microarchitecture

straightforward extension of out-of-order superscalars I-fetch thread chooser
40% faster than round-robin The lockbox for cheap synchronization
orders of magnitude faster
can parallelize previously unparallelizable codes Spring 2007 CSE 471 - Multithreading 28

Architecture Research

Software-directed register deallocation

large register-file performance w. small register file Mini-threads
large SMT performance w. small SMTs SMT instruction speculation
don’t execute as far down a wrong path
speculative instructions don’t get as far down the pipeline
speculation keeps a good thread mix in the IQ
most important factor for performance

Spring 2007 CSE 471 - Multithreading 31

Others are Now Carrying the Ball

Fault detection & recovery Thread-level speculation Instruction & data prefetching Instruction issue hardware design Thread scheduling & thread priority Single-thread execution Profiling executing threads SMT-CMP hybrids Power considerations Spring 2007 32

Multithreading: Motivation, Architectures, and Performance Implications, Slides of Computer Science

Related documents

Partial preview of the text

Download Multithreading: Motivation, Architectures, and Performance Implications and more Slides Computer Science in PDF only on Docsity!

Motivation for Multithreaded Architectures

Motivation for Multithreaded Architectures

Motivation for Multithreaded Architectures

Multithreaded Processors

Comparison of Issue Capabilities

Simultaneous Multithreading (SMT)

Comparison of Issue Capabilities

Cray (Tera) MTA

Cray (Tera) MTA

Cray (Tera) MTA

Cray (Tera) MTA

Cray (Tera) MTA

Performance Implications

Does this Processor Sound Familiar?

An SMT Architecture

Implementing SMT

From Superscalar to SMT

Implementing SMT

Architecture Research

Architecture Research

Others are Now Carrying the Ball

SMT Collaborators