Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Multithreading and Out-of-Order Execution: A Deep Dive into Concurrent Processing, Slides of Computer Architecture and Organization

Alagappa University Computer Architecture and Organization

The concepts of multithreading and out-of-order execution, two fundamental techniques in computer architecture used to improve processor efficiency. Topics covered include thread scheduling policies, pipeline hazards, multithreading design choices, and various hardware implementations such as the cdc 6600 peripheral processors, simple multithreaded pipeline, denelcor hep, and tera mta. The document also discusses the benefits and challenges of coarse-grain multithreading and simultaneous multithreading.

Typology: Slides

2012/2013

Uploaded on 04/27/2013

jutt 🇮🇳

4.5

(154)

75 documents

1 / 35

This page cannot be seen from the preview

Don't miss anything!

Multithreading

Docsity.com

Discover Slides of Computer Architecture and Organization Alagappa University

Partial preview of the text

Download Multithreading and Out-of-Order Execution: A Deep Dive into Concurrent Processing and more Slides Computer Architecture and Organization in PDF only on Docsity!

Multithreading

Outline

Multithreading
Thread scheduling policies
Grain multithreading
Design choices
Multi threading architecture
Out-of-order superscalar processor
Simultaneous multithreading
From superscalar to SMT
SMT design issues

Multithreading

How can we guarantee no dependencies between instructions in a pipeline? - One way is to interleave execution of instructions from different program threads on same pipeline Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe

T1: LW r1, 0(r2) T2: ADD r7, r1, r T3: XORI r5, r4, # T4: SW 0(r7), r T1: LW r5, 12(r1)

CDC 6600 Peripheral Processors (Cray,

First multithreaded hardware
10 “virtual” I/O processors
fixed interleave on simple pipeline
pipeline has 100ns cycle time
each processor executes one

instruction every 1000ns

accumulator-based instruction set to

reduce processor state

Multithreading Costs

Appears to software (including OS) as multiple slower CPUs
Each thread requires its own user state
- GPRs
- PC
Also, needs own OS control state
- virtual memory page table base register
- exception handling registers
Other costs?

Thread Scheduling Policies

Fixed interleave (CDC 6600 PPUs, 1965)
- each of N threads executes one instruction every N cycles
- if thread not ready to go in its slot, insert pipeline bubble
Software-controlled interleave (TI ASC PPUs, 1971)
- OS allocates S pipeline slots amongst N threads
- hardware performs fixed interleave over S slots, executing whichever thread is in that slot
Hardware-controlled thread scheduling (HEP, 1982)
- hardware keeps track of which threads are ready to go
- picks next thread to execute based on hardware priority scheme

Multithreading Design Choices

Context switch to another thread every cycle, or on hazard or L1 miss or L2 miss or network request
Per-thread state and context-switch overhead
Interactions between threads in memory hierarchy

Denelcor HEP

(Burton Smith, 1982)

First commercial machine to use

hardware threading in main CPU

120 threads per processor
10 MHz clock rate
Up to 8 processors
precursor to Tera MTA (Multithreaded

Architecture)

MTA Instruction Format

Three operations packed into 64-bit instruction word (short VLIW)
One memory operation, one arithmetic operation, plus one arithmetic or branch operation
Memory operations incur ~150 cycles of latency
Explicit 3-bit “lookahead” field in instruction gives number of subsequent instructions (0-7) that are independent of this one - c.f. Instruction grouping in VLIW - allows fewer threads to fill machine pipeline - used for variable- sized branch delay slots
Thread creation and termination instructions

MTA Multithreading

Each processor supports 128 active hardware threads - 128 SSWs, 1024 target registers, 4096 general-purpose registers
Every cycle, one instruction from one active thread is launched into pipeline
Instruction pipeline is 21 cycles long
At best, a single thread can issue one instruction every 21 cycles - Clock rate is 260MHz, effective single thread issue rate is 260/21 = 12.4MHz

Coarse-Grain Multithreading

Tera MTA designed for supercomputing applications with large data sets and low locality - No data cache - Many parallel threads needed to hide large memory latency
Other applications are more cache friendly
- Few pipeline bubbles when cache getting hits
- Just add a few threads to hide occasional cache miss latencies
- Swap threads on cache misses

MIT Alewife

• Modified SPARC chips

register windows hold different thread

contexts

• Up to four threads per node

• Thread switch on local cache miss

Speculative, Out-of-Order

Superscalar Processor

Out-Of-Order Execution

Why Out-Of-Order Execution?
In-order Processors Stalls

Pipeline may not be full because of the frequent stalls

Example:

Allow In the Out-Of-Order Processors

**- No dependency Move the Instruction for execution

Means the Instructions that are Ready**

Multithreading and Out-of-Order Execution: A Deep Dive into Concurrent Processing, Slides of Computer Architecture and Organization

Related documents

Partial preview of the text

Download Multithreading and Out-of-Order Execution: A Deep Dive into Concurrent Processing and more Slides Computer Architecture and Organization in PDF only on Docsity!

Multithreading

Outline

Multithreading

CDC 6600 Peripheral Processors (Cray,

instruction every 1000ns

reduce processor state

Multithreading Costs

Thread Scheduling Policies

Multithreading Design Choices

Denelcor HEP

hardware threading in main CPU

Architecture)

MTA Instruction Format

MTA Multithreading

Coarse-Grain Multithreading

MIT Alewife

• Modified SPARC chips

contexts

• Up to four threads per node

• Thread switch on local cache miss

Speculative, Out-of-Order

Superscalar Processor

Out-Of-Order Execution