Scope of Parallelism-Parallel Processing-Lecture Slides, Slides of Parallel Computing and Programming

Prof. Bhairav Gupta delivered this lecture at Ankit Institute of Technology and Science for Parallel Processing course. It includes: Parallelism, Processor, Memory, System, Performance, , Latency, Prefetch, Bandwidth, Datapath, Bottlenecks

Typology: Slides

2011/2012

Uploaded on 07/23/2012

paramita
paramita 🇮🇳

4.6

(16)

120 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1.2
Scope of Parallelism
Conventional architectures coarsely
comprise of a processor, memory
system, and the datapath.
Each of these components present
significant performance bottlenecks.
Parallelism addresses each of these
components in significant ways.
It is important to understand each of
these performance bottlenecks.
docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Scope of Parallelism-Parallel Processing-Lecture Slides and more Slides Parallel Computing and Programming in PDF only on Docsity!

Scope of Parallelism

  • Conventional architectures coarsely

comprise of a processor, memorysystem, and the datapath.

  • Each of these components present

significant performance bottlenecks.

  • Parallelism addresses each of these

components in significant ways.

  • It is important to understand each of

these performance bottlenecks.

docsity.com

Implicit Parallelism: Trends inMicroprocessor Architectures

•^

Microprocessor clock speeds have posted impressivegains over the past two decades (two to three ordersof magnitude).

-^

Higher levels of device integration have madeavailable a large number of transistors.

-^

The question of how best to utilize these resources isan important one.

-^

Current processors use these resources in multiplefunctional units and execute multiple instructions inthe same cycle.

-^

The precise manner in which these instructions areselected and executed provides impressive diversityin architectures.

docsity.com

SuperPipeline

•^

Pipelining, however, has several limitations.

-^

The speed of a pipeline is eventually limited by thenumber of stages & time of slowest stage.

-^

For this reason, conventional processors rely on verydeep pipelines or super-pipeline (20 stage pipelinesin state-of-the-art Pentium processors).

-^

However, a typical pipeline has resource constraint,data dependency & Branch prediction issues. Approxevery 5-6th instruction is a conditional jump! Thisrequires very accurate branch prediction.

-^

The penalty of a prediction error grows with the depthof the pipeline, since a larger number of instructionswill have to be flushed.

docsity.com

Superscalar

-^

One simple way of alleviating thesebottlenecks is to use multiple pipelines &

-^

Issue multiple independent instructionssimultaneously

-^

Examples: MIPS1000, PowerPC & Pentium

-^

The question then becomes one of selectingthese instructions.

docsity.com

Superscalar Execution

Example…

con’d

  • In the above example, there is some

wastage of resources due to datadependencies.

  • The example also illustrates that

different instruction mixes with identicalsemantics can take significantlydifferent execution time.

docsity.com

Superscalar Execution

•^

Scheduling of instructions is determined by a numberof factors:– True Data Dependency: The result of one

operation is an input to the next.

  • Resource Dependency: Two operations require

the same resource.

  • Branch Dependency: Scheduling instructions

across conditional branch statements cannot bedone deterministically a-priori.

  • The scheduler, a piece of hardware looks at a

number of instructions in an instruction queue andselects appropriate number of instructions toexecute concurrently based on these factors.

  • The complexity of this hardware is an important

constraint on superscalar processors.

docsity.com

Superscalar Execution: Efficiency Considerations

•^

Not all functional units can be kept busy at all times.

-^

If during a cycle, no functional units are utilized, thisis referred to as vertical waste.

-^

If during a cycle, only some of the functional units areutilized, this is referred to as horizontal waste.

-^

Due to limited parallelism in typical instruction traces,dependencies, or the inability of the scheduler toextract parallelism, the performance of superscalarprocessors is eventually limited.

-^

Conventional microprocessors typically support four-way superscalar execution.

docsity.com

Very Long Instruction Word (VLIW)

Processors

•^

The hardware cost and complexity of the superscalarscheduler is a major consideration in processordesign.

-^

To address this issues, VLIW processors rely oncompile time analysis to identify and bundle togetherinstructions that can be executed concurrently.

-^

These instructions are packed and dispatchedtogether, and thus the name very long instructionword.

-^

This concept was used with some commercialsuccess in the Multiflow Trace machine (circa).

-^

Variants of this concept are employed in the IntelIA64 processors & TI TMS320 C6XXX DSPs.

docsity.com

Very Long Instruction Word (VLIW)

Processors: Considerations

•^

Issue hardware is simpler.

-^

Compiler has a bigger context from which to selectco-scheduled instructions.

-^

Compilers, however, do not have runtime informationsuch as cache misses. Scheduling is, therefore,inherently conservative.

-^

Branch and memory prediction is more difficult.

-^

VLIW performance is highly dependent on thecompiler. A number of techniques such as loopunrolling, speculative execution, branch predictionare critical.

-^

Typical VLIW processors are limited to 4-way to 8-way parallelism.

docsity.com