Application Performance on the MIT Alewife Multiprocessor, Study notes of Computer Architecture and Organization

Abstract. This study reports on the performance of several applications on the Alewife machine, focus- ing on emerging applications and evolving ...

Typology: Study notes

2022/2023

Uploaded on 05/11/2023

shachi_984a
shachi_984a 🇺🇸

4.6

(15)

222 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Application Performance on the
MIT Alewife Multiprocessor
Frederic T. Chong
y
, Beng-Hong Lim
z
, Ricardo Bianchini
?
,JohnKubiatowicz
, and Anant Agarwal
y
Dept. of Computer Science, University of California at Davis
z
IBM T.J. Watson ResearchCenter
?
COPPE Systems Engineering, UFRJ/Brazil
Lab. for Computer Science, Massachusetts Institute of Technology
Abstract
This study reports on the performance of several application s on the Alewife machine, focus-
ing on emerging applications and evolving architectural mechanisms. It shows that low-latency
cache miss handling mechanisms for
both
local and remote accesses in Alewife make these emerg-
ing applications viable candidates for shared-memory parallel processing. The results show that
ecient shared memory is an excellentcommunication mechanism, even for ne-grain appli-
cations that do not re-use data. Such applications are thoughttofavor message-passing. As
expected, traditional coarse-grain applications p erform well with Alewife's mechanisms. The
results also conrm that hardware support for limited sharing is adequate for a broad range of
applications, even on large numbers of processors. Additionally, modeling local cache-miss be-
havior is important for machines such as Alewife, where remote-miss latencies are only ve times
longer than local miss latencies. Weintroduce twonovel performance metrics that account for
the eect of local misses and are more accurate than previously proposed metrics. We conclude
that most applications perform well on Alewife. In particular, ne-grain applications can take
advantage of Alewife's high integration and eciency to achiev
e a new level of performance on
scalable shared-memory machines.
Keywords:
distributed shared memory,multiprocessor, performance metrics, applications,
ne grain
1 Introduction
Developments in the architecture of parallel machines inuence the evolution of the structure of
parallel programs emerging parallel applications, in turn, impact the future directions in parallel
machine architecture. Benchmark suites and architectural mechanisms constantly evolve from the
dynamics of the architecture-applications symbiosis.
This study reports on the performance of several applications on the Alewife machine ABC
+
95]
(see Sidebar A), focusing on ne-grain applications and evolving architectural mechanisms. The
results show that low-latency miss-handling mechanisms for both local and remote accesses in Alewife
make ne-grain applications viable candidates for shared-memory parallel processing. We discover
that ecient shared memory is an excellent communication mechanism for ne-grain applications,
even without data re-use. This is a very interesting result, given that such applications havelong
been thoughttofavor message passing over shared memory.
Not surprisingly, Alewife's mechanismsallow traditional coarse-grain applications from the SPLASH
and NAS benchmark suites to perform well. The results conrm that hardware support for limited
sharing is adequate for a broad range of applications, even on large numbers of processors. Local
cache miss behavior turns out to be importantonmultiprocessors with low remote miss latencies.
To account for the eect of local misses, weintroduce two performance metrics that provide more
accurate and revealing results for Alewife than previously proposed metrics.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Application Performance on the MIT Alewife Multiprocessor and more Study notes Computer Architecture and Organization in PDF only on Docsity!

Application Performance on the

MIT Alewife Multipro cessor

Frederic T Chongy^  BengHong Limz^  Ricardo Bianchini  John Kubiatowicz^  and Anant Agarwal

y (^) Dept of Computer Science University of California at Davis z (^) IBM TJ Watson Research Center COPPE Systems Engineering UFRJBrazil (^) Lab for Computer Science Massachusetts Institute of Technology

Abstract This study rep orts on the p erformance of several application s on the Alewife machine fo cus ing on emerging applications and evolving architectural mechanisms It shows that lowlatency cache miss handling mechanisms for both lo cal and remote accesses in Alewife make these emerg ing applications viable candidates for sharedmemory parallel pro cessing The results show that ecient shared memory is an excellent communication mechanism even for negrain appli cations that do not reuse data Such application s are thought to favor messagepassing As exp ected traditional coarsegrain application s p erform well with Alewifes mechanisms The results also conrm that hardware supp ort for limited sharing is adequate for a broad range of applications  even on large numb ers of pro cessors Additionall y mo deling lo cal cachemiss b e havior is imp ortant for machines such as Alewife where remotemiss latencies are only ve times longer than lo cal miss latencies We intro duce two novel p erformance metrics that account for the e ect of lo cal misses and are more accurate than previously prop osed metrics We conclude that most applications p erform well on Alewife In particular negrain application s can take advantage of Alewifes high integration and eciency to achieve a new level of p erformance on scalable sharedmemory machines Keywords distributed shared memory multipro cessor p erformance metrics applications  ne grain

Intro duction

Developments in the architecture of parallel machines inuence the evolution of the structure of parallel programs emerging parallel applications in turn impact the future directions in parallel machine architecture Benchmark suites and architectural mechanisms constantly evolve from the dynamics of the architectureapplications symbiosis This study rep orts on the p erformance of several applications on the Alewife machine ABC^  see Sidebar A  fo cusing on negrain applications and evolving architectural mechanisms The results show that lowlatency misshandling mechanisms for b oth lo cal and remote accesses in Alewife make negrain applications viable candidates for sharedmemory parallel pro cessing We discover that ecient shared memory is an excellent communication mechanism for negrain applications even without data reuse This is a very interesting result given that such applications have long b een thought to favor message passing over shared memory Not surprisingly Alewifes mechanisms allow traditional coarsegrain applications from the SPLASH and NAS b enchmark suites to p erform well The results con rm that hardware supp ort for limited sharing is adequate for a broad range of applications even on large numb ers of pro cessors Lo cal cache miss b ehavior turns out to b e imp ortant on multipro cessors with low remote miss latencies To account for the eect of lo cal misses we intro duce two p erformance metrics that provide more accurate and revealing results for Alewife than previously prop osed metrics

CoarseGrain Description Input Data Program Appbt Solves multiple indep endent systems of equations   oats Barnes Simulates movement of b o dies under gravitational forces K b o dies  iters cgrid Straightforward D successive overrelaxation  x oats  iters Chol Cholesky factorization of a sparse matrix BCSSTK  order   oats FFT D Fast Fourier Transform  oats Gauss Unblo cked Gaussian eliminatio n  x oats Locus Routes of wires in a standard cell circuit   wires Mg D Poisson solver using multigrid techniques xx oats Msort Sorts a list of integers K integers Water Simulates movement of water molecules  molecules iterations FineGrain Description Input Data Program EMD Electromagnetic wave propagation through D ob jects no des  remote neighb ors ICCG Preconditioned conjugate gradient sparse solver BCSSTK  order    doubles MPD Simulates rareed uid ow  particles  iterations MMPD Mo died mpd reduced sharing  particles  iterations

Table  Applications and Kernels

We conclude that b oth coarse and negrain applications can b ene t signi cantly from ecient mechanisms such as those available in Alewife Finegrain applications can p erform well on scalable shared memory multipro cessors and they represent an imp ortant and emerging class that warrants further study The rest of this pap er is organized as follows Section  describ es the applications in this pap er Section  analyzes the p erformance of the applications on Alewife using several novel p erformance metrics Section  presents a detailed study of three negrain applications and describ es the mechanisms that enhance their p erformance Section summarizes the results and Section  presents the conclusions drawn from this study

 Applications

The applications in this study use the sharedmemory programming mo del They are written in the C programming language with parallel constructs based on the ANL p library Most of the programs b egin with a master pro cess that allo cates and initializes a blo ck of globally shared memory After initialization the master spawns a numb er of slave threads that p ersist until the end of the computation Computation is usually partitioned and scheduled statically among the threads that synchronize with lo cks and barriers Table  provides a short description of each of the applications and their input parameters MPD Barnes Locus Chol and Water are from the SPLASH suite SWG  Appbt and MG are part of the NAS parallel b enchmarks Bai  The rest of the applications are engineering typ e kernels from the University of Ro chester MIT CA  and Berkeley CDG^   We categorize the applications into traditional coarsegrain applications and emerging negrain applications The traditional coarsegrain applications are applications that app ear in most studies of shared memory applications and architectures These include Appbt Barnes  CGRID Chol FFT  Gauss Locus MG Msort and Water These applications are relatively coarsegrained and p erform well on Alewife The emerging negrain applications fail to achieve acceptable p erformance in most shared memory multipro cessors b ecause they communicate frequently and are sensitive to memory latencies

em3d msort mp3d chol iccg mmp3d appbt fft locus mg gauss water barnes cgrid

Frac Proc Util

Pro cessor Utilization  Work  Total Time Total Time  Work  Synchronization  Memory Stalls

em3d iccg gauss mp3d mmp3d cgrid mg locus chol appbt fft barnes msort water

Hit Ratio

Lo cal Hit Ratio

em3d msort mp3d fft cgrid water barnes mmp3d chol iccg appbt locus gauss mg

Hit Ratio

Remote Hit Ratio

em3d iccg cgrid msort mg gauss mmp3d

Ratio

mp3d fft water appbt locus barnes chol 0

5

10

15

Ratio

24.0 25.

Ratio of Remote to Lo cal Misses

em3d mp3d iccg barnes mmp3d chol cgrid fft gauss water appbt locus mg msort

Hit Ratio

Weighted Hit Ratio  remote hits  lo cal hits  remote accesses  lo cal accesses

Figure  Application characteristics with applications sorted by average Bars are for      and  pro cessors for each application ICCG is missing a pro cessor bar b ecause its dataset do es not t on a single Alewife pro cessor

msort em3d chol mp3d iccg locus mg mmp3d appbt water fft gauss barnes cgrid

Frac Proc Util

Pro cessor Utilization  Work  Total Time Total Time  Work  Synchronization  Memory Stalls

chol mp3d mg mmp3d msort locus em3d iccg appbt barnes fft water gauss 0

1000

2000

3000

4000

Cycles

5909 6925

(^0) cgrid

10000

20000

30000

Computation Grain  Work  Remote Cache Misses

chol mp3d em3d mg msort iccg mmp3d 0

500

1000

1500

Cycles locus appbt barnes fft water gauss cgrid 0

1000

2000

3000

Cycles

5099

Weighted Computation Grain  Work  Remote Cache Misses  Lo cal Cache Misses  

Metric Pro cessors Avg     Comp  Remote Miss       Comp  Weighted Miss            correlation greater than  Correlation of Application Ranking for each Metric Relative to Pro cessor Utilization

Figure  Application ranking by pro cessor utilization and granularity Bars are for      and  pro cessors for each application and are sorted by pro cessor value Correlations are computed by comparing via the sum squared dierence of ranks application rankings for each metric and numb er of pro cessors

 Computation Granularity

The amount of work p er remote cache miss the computation grain of an application has tradition ally b een a go o d indicator of p erformance Figure  shows that computation grain is wellcorrelated with pro cessor utilization However the eect of lo cal cache misses is also very imp ortant for ma chines with low remotetolo cal memory latency ratios Thus we intro duce another metric called weighted computation granularity the amount of work p er weighted cache miss As Figure  shows this weighting pro duces a b etter metric which is more correlated to utilization than granularity derived from remote misses alone In particular applications with high lo cal miss rates and ne granularity such as EMD and ICCG are in more appropriate rank order with weighted granularity than with unweighted granu larity Also note that for b oth granularity metrics the ranking for msort is considerably dierent from the utilization ranking This is b ecause neither granularity metric incorp orates any notion of load balance Section  examines the eect of load balance Together the granularity and cache hit data show that the low utilization for EMD MPD and ICCG are b ecause these applications are naturally negrained Their p erformance is primarily determined by both lo cal and remote memory access latencies

10 20 30 Processors

0

10

20

30

Speedup

MP3D Machine Comparison

ideal Alewife mmp3d Alewife mp3d DASH mp3d

Figure  Comparison of MPD sp eedup on Alewife versus on DASH

 Emerging FineGrain Applications

Applications such as MPD are often considered to o naive and negrained to b e imp ortant b ench marks This naivete however is exactly what needs to b e supp orted for the b ene t of b oth compilers and users After all a ma jor motivation for sharedmemory is its ease of programming over message passing Furthermore less naive co des such as EMD and sophisticated co des such as ICCG are inescapably negrained This section discusses these three applications in detail and demonstrates why they are now emerging as viable applications on sharedmemory multipro cessors

 MPD

MPD needs ecient supp ort for migratory data While the particles in the simulation are repre sented by data that are statically assigned to pro cessors the wind tunnel through which the particles move is represented by data that migrate frequently Figure  compares the p erformance of MPD on Alewife with Stanford DASH LLG^   Alewife achieves substantially higher sp eedups than DASH on MPD The primary reason is Alewifes coherence proto col that is b etter suited for migratory data On DASH a remote dirty cache line owned by pro cessor  is not invalidated but is downgraded to remote cleanreadonly copies at the time that pro cessor  attempts to read it When pro cessor  subsequently attempts to write this line it must rst invalidate pro cessor s copy Since Alewife p erforms the invalidation as part of the read transaction the subsequent write transaction is faster This savings b ecomes esp ecially imp ortant when data is primarily read then written by dierent pro cessors throughout an application MMPD with more optimized co de and data mapping p erforms even b etter This suggests that this application needs either ecient mechanisms or intelligent data mapping to p erform well The next two applications need b oth

 EMD

EMD is one of the nestgrain applications in the literature The input dataset consists of a randomly generated graph where  p ercent of the edges are b etween no des mapp ed on dierent pro cessors EMD is thought to favor messagepassing since it is a graph computation where no des in the graph have a pro ducerconsumer relationship Alewife exhibits go o d p erformance on EMD in spite of the fact that invalidate proto cols are suboptimal for pro ducerconsumer computation As discussed earlier lo cal cache b ehavior accounts for much of the go o d sp eedup On  Alewife pro cessors each edge takes  cycles of computation For comparison only a few results are available in the literature A Berkeley SplitC study CDG^  achieved  cycles

10 20 30 Processors

0

10

20

30

Speedup

ICCG Shared Memory versus Message Passing

ideal Alewife SM Alewife MP

Figure  Comparison ICCG with shared memory versus message passing on Alewife

p er edge with messagepassing implementation on the Thinking Machines CM on  pro cessors A Wisconsin study CLR simulates sharedmemory and messagepassing machines with hardware con gurations based closely on the CM  It nds that an invalidationbased sharedmemory imple mentation  cyclesedge is twice as slow as a messagepassing implementation  cyclesedge  Alewife p erforms substantially b etter than previous sharedmemory results and comp etitively with messagepassing implementations The to remotetolo cal miss ratio allows p erformance to scale well as the numb er of pro cessors increases even with an invalidationbased proto col

 ICCG

ICCG is a dicult application to parallelize and has only recently attained reasonable p erformance on multipro cessors With only  to  oatingp oint op erations p er double word of data communi cated ICCG has historically b een unparallelizable ICCG is similar to EMD in that its kernel involves a pro ducerconsumer graph computation However ICCG uses real scienti c datasets mapp ed onto the machine with the b estknown algorithms More imp ortantly the sparse triangu lar solve kernel in ICCG requires nergrained synchronization than the relaxation computation in EMD To synchronize sharedmemory ICCG uses readmo difywrite op erations where the pro ducer of a value can p erform an accumulate to a variable on a remote pro cessor Consequently the p erformance of ICCG is latencycritical each readmo difywrite op eration requires three roundtrip messages b etween pro cessors up on a remote miss In contrast with three messages an active message implementation needs only a single message Figure makes this comparison Surprisingly the sharedmemory implementation p erforms slightly b etter since shared memory references are handled directly in hardware and incur minimal overhead as compared to active messages that have to b e handled in software Even when added to roundtrip network latencies sharedmemory latencies are much lower than those for message passing

 Summary

We summarize by presenting overall sp eedup results and revisiting the ma jor factors leading to those results Figure  presents the familiar overlapping sp eedup curves for the applications It also presents a more digestible picture of the fraction of ideal sp eedup sorted by average Most of the traditional coarsegrain applications p erform well Appbt Barnes Cgrid FFT Gauss MG and Water Locus causes a large numb er of LimitLESS traps which results in high memory latencies Chol suers from limited parallelism due to a small dataset size Msort suers from load imbalance

The negrain applications exhibit new levels of p erformance EMD p erforms surprisingly well due to lo cal cache b ehavior and fast communication MPD has the worst eciency but Alewife sp eedups are high when compared to other machines MMPD p erforms even b etter ICCG also p erforms unexp ectedly well for a sharedmemory multipro cessor

 Conclusions

This pap er presents the p erformance of several applications on the Alewife machine fo cusing on emerging applications and the architectural mechanisms that allow them to p erform well on the machine The main contributions of this pap er are

 It determines the mechanisms that allow negrained applications to p erform well on Alewife low lo cal to remote memory access latency ratios hardware supp ort for sharedmemory com puting and a coherence proto col optimized for migratory sharing

 It characterizes the p erformance of a large numb er of applications in terms of several metrics pro cessor utilization lo cal and remote miss ratios computation granularity load balance and sharing b ehavior In addition to these wellknown metrics it intro duces two new metrics that account for lo cal and global misses and their overheads Previous metrics mispredict p erformance on machines such as Alewife

Finegrain applications can p erform well on scalable sharedmemory multipro cessors These applications are an imp ortant emerging class that warrants further study Overall b oth coarse and negrain applications can b ene t signi cantly from the ecient mechanisms such as those available in Alewife

 Acknowledgements

Thanks to Fredrik Dahlgren and our anonymous referees

References

ABC^  Anant Agarwal Ricardo Bianchini David Chaiken Kirk Johnson David Kranz John Ku biatowicz BengHong Lim Ken Mackenzie and Donald Yeung The MIT Alewife machine Architecture and p erformance In Proc nd Annual International Symposium on Computer Architecture June 

AG Anant Agarwal and Ano op Gupta MemoryReference Characteristics of Multipro cessor Appli cations under MACH In Proceedings of ACM SIGMETRICS  pages   May 

Bai D Bailey et al The NAS Parallel Benchmarks Technical Rep ort RNR  NASA Ames Research Center March 

CA Frederic T Chong and Anant Agarwal Shared memory versus message passing for iterative solution of sparse irregular problems Technical Rep ort MITLCSTR MIT Lab oratory for Computer Science Septemb er 

CDG^  David E Culler Andrea Dusseau Seth Cop en Goldstein Arvind Krishnamurthy Steven Lumetta Thorsten von Eicken and Katherine Yelick Parallel programming in SplitC In Supercomputing Novemb er 

CLR Satish Chandra James R Larus and Anne Rogers Where is time sp ent in messagepassing and sharedmemory programs In ASPLOS VI pages   San Jose California 

DRPS F DaremaRogers G F Pster and K So Memory Access Patterns of Parallel Scientic Programs In Proceedings of ACM SIGMETRICS  pages  May 

EK S J Eggers and R H Katz A Characterization of Sharing in Parallel Programs and Its Appli cation to Coherency Proto col Evaluation In Proceedings of the th International Symposium on Computer Architecture New York June  IEEE

LLG^   Daniel Lenoski James Laudon Kourosh Gharachorlo o WolfDietrich Web er Ano op Gupta John Hennessy Mark Horowitz and Monica S Lam The stanford Dash multipro cessor Com puter   March  

SWG  Jaswinder Pal Singh WolfDietrich Web er and Ano op Gupta SPLASH Stanford parallel applications for sharedmemory Computer Architecture News   March  

WOT^  Steven Cameron Wo o Moriyoshi Ohara Evan Torrie Jaswinder Pal Singh  and Ano op Gupta The SPLASH programs Characterization and metho dological considerations  In nd Annual International Symposium on Computer Architecture pages  June 

Sidebar B Application Studies The cycle of inuence b etween sharedmemory multipro cessor architecture and applications has b een in motion for at least a decade and continues to this day The early studies based on trace driven simulation of sharedmemory applications fo cused on the memory reference characteristics AG DRPS and coherenceproto col b ehaviors EK for smallscale sharedmemory multipro cessor architectures Later studies SWG WOT^   based on executiondriven simulators char acterize in more detail the b ehavior of sharedmemory applications on largerscale sharedmemory architectures They include the eect of load imbalance synchronization overhead and parallelization overhead in addition to cache and memory b ehavior An imp ortant result of research on shared memory applications is a set of SPLASH and NAS b enchmarks that has driven much research into sharedmemory architectures and cachecoherence proto cols During the same p erio d several exp erimental shared memory multipro cessors were designed and built Stanford DASH was the rst such machine followed by MIT Alewife and most recently by the Wisconsin Typho on prototyp es Researchers have used these machines as a vehicle to charac terize the b ehavior of sharedmemory applications LLG^  ABC^   This study uses the builtin statistics hardware in Alewife to characterize the p erformance of sharedmemory applications in this pap er It considers emerging negrain applications as well as the SPLASH and NAS applications The analysis fo cuses on computation granularity cache miss ratios sharing patterns and load bal ance as primary determinants of application p erformance Unlike previous studies that fo cused on remote cache misses it considers both lo cal and remote cache misses It do es not provide details on the cause of cache misses in each of the applications Wo o et al WOT^  already provides such details This study is most closely related to LLG^   due to the similarities b etween DASH and Alewife A comparison of the results from b oth studies shows that Alewife is more successful at negrain or irregular applications primarily due to its shorter memory latencies Both machines p erform comparably well for coarsegrain applications With the availability of b oth scalable sharedmemory machines and simulators there is a de bate over whether to use machines or simulators to characterize sharedmemory applications Real machines can execute more realistic programs and inputs and their measurements capture all the nuances of a real hardware implementation The disadvantage is that it is hard to compare dierent architectures and some observed eects may b e due to an artifact of the machine rather than some fundamental principle This study uses the machinebased approach The ideal approach is to use b oth machines and simulators for application studies and crosscheck the results from one with the other

Fred Chong is an Assistant Professor of Computer Science at the University of California at Davis He received his BS MS and PhD degrees in electrical engineering and computer science from MIT His current work fo cuses on architectures and applicatio ns for multigrain parallel systems His research interests include communication applicatio ns theory and VLSI for parallel and distributed systems BengHong Lim is a research sta memb er at the IBM TJ Watson Research Center He is part of a research team that is designing IBMs nextgeneration scalable multipro cessor systems His current research interests are in the architecture op erating systems and programming mo dels for scalable and reliable high p erformance computing He received BS MS and PhD degrees in electrical engineering and computer science from MIT where he served as one of the principal memb ers of the Alewife pro ject Ricardo Bianchini is a research asso ciate at COPPE Systems EngineeringUFRJ where he coleads the NCP pro ject a large research e ort fo cused on building hardware supp ort for softwarecoherent distributed sharedmemory systems He received his PhD degree in Computer Science from the University of Ro chester in  His research interests include parallel and distributed computing advanced op erating systems and parallel computer architecture John Kubiatowicz is a do ctoral candidate in the Department of Electrical Engineering and Computer Science at MIT His current research interests include parallel computer architecture highp erformance micropro cessor design articial life and highenergy particle physics He received BS degrees in electrical engineering and physics and an MS in electrical engineering from MIT Anant Agarwal received his BTech at the Indian Institute of Technology in Madras India in   and his MS and PhD at Stanford University in  Currently he is an Asso ciate Professor of Computer Science and Electrical Engineering at MIT where he leads the Alewife and RAW Pro jects