Fundamental Concepts - Parallel Processing - Lecture Slides, Slides of Parallel Computing and Programming

Some concept of Parallel Processing are Anatomy, Cache Access Time, Instruction Formats, Instruction Formats, Instruction Formats, Multidimensional Meshes, Network Processors, Snooping Protocol. Main points of this lecture are: Fundamental Concepts, Taxonomy, Tools For Evaluation, Comparison, Theory, Delineate, Hard Problems, Introduction to Parallelism, Taste of Parallel Algorithms, Parallel Algorithm Complexity

Typology: Slides

2012/2013

Uploaded on 04/30/2013

devank
devank 🇮🇳

4.3

(12)

152 documents

1 / 97

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Part I
Fundamental Concepts
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61

Partial preview of the text

Download Fundamental Concepts - Parallel Processing - Lecture Slides and more Slides Parallel Computing and Programming in PDF only on Docsity!

Part I

Fundamental Concepts

I Fundamental Concepts

Provide motivation, paint the big picture, introduce the 3 Ts:

  • Taxonomy (basic terminology and models)
  • Tools for evaluation or comparison
  • Theory to delineate easy and hard problems

Topics in This Part

Chapter 1 Introduction to Parallelism

Chapter 2 A Taste of Parallel Algorithms

Chapter 3 Parallel Algorithm Complexity

Chapter 4 Models of Parallel Processing

1.1 Why Parallel Processing?

Fig. 1.1 The exponential growth of microprocessor performance, known as Moore’s Law, shown over the past two decades (extrapolated).

1980 1990 2000 2010

KIPS

MIPS

GIPS

TIPS

Processor performance

Calendar year

80286

68000

80386

80486 68040

Pentium

Pentium II R

×1.6 / yr

2020

Projection, circa 1998 Projection, circa 2012

The number of cores has been increasing from a few in 2005 to the current 10s, and is projected to reach 100s by 2020

From: “Robots After All,” by H. Moravec, CACM , pp. 90-97, October 2003.

Mental power in four scales

Evolution of Computer Performance/Cost

NRC Report (2011): The Future of Computing Performance: Game Over or Next Level?

Trends in Processor Chip Density, Performance,

Clock Speed, Power, and Number of Cores

Density

Perf’ce

Power

Cores

Clock

Source: [DANO12] “CPU DB: Recording Microprocessor History,” CACM , April 2012.

Feature Size (μm)

Overall Performance Improvement (SPECINT, relative to 386)

Gate Speed Improvement (FO4, relative to 386)

~1985 (^) --------- 1995-2000 --------- ~ Much of arch. improvements already achieved

Shares of Technology and Architecture

in Processor Performance Improvement

~

The Speed-of-Light Argument

The speed of light is about 30 cm/ns.

Signals travel at 40-70% speed of light (say, 15 cm/ns).

If signals must travel 1.5 cm during the execution of an

instruction, that instruction will take at least 0.1 ns;

thus, performance will be limited to 10 GIPS.

This limitation is eased by continued miniaturization,

architectural methods such as cache memory, etc.;

however, a fundamental limit does exist.

How does parallel processing help? Wouldn’t multiple

processors need to communicate via signals as well?

Why Do We Need TIPS or TFLOPS Performance?

Reasonable running time = Fraction of hour to several hours (10 3 -10 4 s) In this time, a TIPS/TFLOPS machine can perform 10 15 -10 16 operations

Example 2: Fluid dynamics calculations (1000 × 1000 × 1000 lattice) 10 9 lattice points × 1000 FLOP/point × 10 000 time steps = 10 16 FLOP

Example 3: Monte Carlo simulation of nuclear reactor 10 11 particles to track (for 1000 escapes) × 10 4 FLOP/particle = 10 15 FLOP

Decentralized supercomputing ( from Mathworld News , 2006/4/7 ): Grid of tens of thousands networked computers discovers 2 30 402 457 – 1, the 43rd Mersenne prime, as the largest known prime (9 152 052 digits )

Example 1: Southern oceans heat Modeling (10-minute iterations) 300 GFLOP per iteration × 300 000 iterations per 6 yrs = 10 16 FLOP

The ASCI Program

1995 2000 2005 2010

Performance (TFLOPS)

Calendar year

ASCI Red

ASCI Blue

ASCI White

1+ TFLOPS, 0.5 TB

3+ TFLOPS, 1.5 TB

10+ TFLOPS, 5 TB

30+ TFLOPS, 10 TB

100+ TFLOPS, 20 TB

1

10

100

1000 Plan^ Develop^ Use

ASCI

ASCI Purple

ASCI Q

Fig. 24.1 Milestones in the Accelerated Strategic (Advanced Simulation &) Computing Initiative (ASCI) program, sponsored by the US Department of Energy, with extrapolation up to the PFLOPS level.

The Quest for Higher Performance

1. IBM Blue Gene/L 2. SGI Columbia 3. NEC Earth Sim

LLNL, California NASA Ames, California Earth Sim Ctr, Yokohama

Material science, nuclear stockpile sim

Aerospace/space sim, climate research

Atmospheric, oceanic, and earth sciences

32,768 proc’s, 8 TB, 28 TB disk storage

10,240 proc’s, 20 TB, 440 TB disk storage

5,120 proc’s, 10 TB, 700 TB disk storage

Linux + custom OS Linux Unix

71 TFLOPS , $100 M 52 TFLOPS , $50 M 36 TFLOPS* , $400 M?

Dual-proc Power-PC chips (10-15 W power)

20x Altix (512 Itanium2) linked by Infiniband

Built of custom vector microprocessors

Full system: 130k-proc, 360 TFLOPS (est)

Volume = 50x IBM, Power = 14x IBM

  • Led the top500 list for 2.5 yrs

Top Three Supercomputers in 2005 ( IEEE Spectrum , Feb. 2005, pp. 15-16)

The Quest for Higher Performance: 2012 Update

1. Cray Titan 2. IBM Sequoia 3. Fujitsu K Computer

ORNL, Tennessee LLNL, California RIKEN AICS, Japan

XK7 architecture Blue Gene/Q arch RIKEN architecture

560,640 cores, 710 TB, Cray Linux

1,572,864 cores, 1573 TB, Linux

705,024 cores, 1410 TB, Linux Cray Gemini interconn’t Custom interconnect Tofu interconnect

17.6/27.1 PFLOPS 16.3/20.1 PFLOPS 10.5/11.3 PFLOPS** *

AMD Opteron, 16-core, 2.2 GHz, NVIDIA K20x

Power BQC, 16-core, 1.6 GHz

SPARC64 VIIIfx, 2.0 GHz 8.2 MW power 7.9 MW power 12.7 MW power

Top Three Supercomputers in November 2012 (http://www.top500.org)

  • max/peak performance In the top 10, IBM also occupies ranks 4-7 and 9-10. Dell and NUDT (China) hold ranks 7-8.

Top 500 Supercomputers in the World

1.2 A Motivating

Example

Fig. 1.3 The sieve of Eratosthenes yielding a list of 10 primes for n = 30. Marked elements have been distinguished by erasure from the list.

Init. Pass 1 Pass 2 Pass 3 2 ← m 2 2 2 3 3 ← m 3 3 4 5 5 5 ← m 5 6 7 7 7 7 ← m 8 9 9 10 11 11 11 11 12 13 13 13 13 14 15 15 16 17 17 17 17 18 19 19 19 19 20 21 21 22 23 23 23 23 24 25 25 25 26 27 27 28 29 29 29 29 30

Any composite number has a prime factor that is no greater than its square root.

Single-Processor Implementation of the Sieve

Fig. 1.4 Schematic representation of single-processor solution for the sieve of Eratosthenes.

1 2 n

Current Prime (^) Index P

Bit-vector