







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Prof. Bhairav Gupta delivered this lecture at Ankit Institute of Technology and Science for Parallel Processing course. It includes: Parallelism, Processor, Memory, System, Performance, , Latency, Prefetch, Bandwidth, Datapath, Bottlenecks
Typology: Slides
1 / 13
This page cannot be seen from the preview
Don't miss anything!








docsity.com
Implicit Parallelism: Trends inMicroprocessor Architectures
Microprocessor clock speeds have posted impressivegains over the past two decades (two to three ordersof magnitude).
-^
Higher levels of device integration have madeavailable a large number of transistors.
-^
The question of how best to utilize these resources isan important one.
-^
Current processors use these resources in multiplefunctional units and execute multiple instructions inthe same cycle.
-^
The precise manner in which these instructions areselected and executed provides impressive diversityin architectures.
docsity.com
SuperPipeline
Pipelining, however, has several limitations.
-^
The speed of a pipeline is eventually limited by thenumber of stages & time of slowest stage.
-^
For this reason, conventional processors rely on verydeep pipelines or super-pipeline (20 stage pipelinesin state-of-the-art Pentium processors).
-^
However, a typical pipeline has resource constraint,data dependency & Branch prediction issues. Approxevery 5-6th instruction is a conditional jump! Thisrequires very accurate branch prediction.
-^
The penalty of a prediction error grows with the depthof the pipeline, since a larger number of instructionswill have to be flushed.
docsity.com
-^
-^
-^
-^
docsity.com
Superscalar Execution
Example…
con’d
docsity.com
Scheduling of instructions is determined by a numberof factors:– True Data Dependency: The result of one
operation is an input to the next.
the same resource.
across conditional branch statements cannot bedone deterministically a-priori.
number of instructions in an instruction queue andselects appropriate number of instructions toexecute concurrently based on these factors.
constraint on superscalar processors.
docsity.com
Superscalar Execution: Efficiency Considerations
Not all functional units can be kept busy at all times.
-^
If during a cycle, no functional units are utilized, thisis referred to as vertical waste.
-^
If during a cycle, only some of the functional units areutilized, this is referred to as horizontal waste.
-^
Due to limited parallelism in typical instruction traces,dependencies, or the inability of the scheduler toextract parallelism, the performance of superscalarprocessors is eventually limited.
-^
Conventional microprocessors typically support four-way superscalar execution.
docsity.com
Very Long Instruction Word (VLIW)
Processors
The hardware cost and complexity of the superscalarscheduler is a major consideration in processordesign.
-^
To address this issues, VLIW processors rely oncompile time analysis to identify and bundle togetherinstructions that can be executed concurrently.
-^
These instructions are packed and dispatchedtogether, and thus the name very long instructionword.
-^
This concept was used with some commercialsuccess in the Multiflow Trace machine (circa).
-^
Variants of this concept are employed in the IntelIA64 processors & TI TMS320 C6XXX DSPs.
docsity.com
Very Long Instruction Word (VLIW)
Processors: Considerations
Issue hardware is simpler.
-^
Compiler has a bigger context from which to selectco-scheduled instructions.
-^
Compilers, however, do not have runtime informationsuch as cache misses. Scheduling is, therefore,inherently conservative.
-^
Branch and memory prediction is more difficult.
-^
VLIW performance is highly dependent on thecompiler. A number of techniques such as loopunrolling, speculative execution, branch predictionare critical.
-^
Typical VLIW processors are limited to 4-way to 8-way parallelism.
docsity.com