Exploiting Software Information for an Efficient Memory Hierarchy, Thesis of Network Technologies and TCP/IP

A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science at the University of Illinois at Urbana-Champaign in 2014. The document focuses on identifying inefficiencies related to coherence, data communication, and data storage in memory hierarchies of today's systems and proposes various techniques to mitigate them by exploiting information from the software. The document proposes DeNovo, a hardware-software co-designed protocol, to address the issues related to coherence and communication. It also extends DeNovo to add two optimizations to address the inefficiencies related to data communication.

Typology: Thesis

2013/2014

Uploaded on 05/11/2023

tomseller
tomseller 🇺🇸

4.6

(16)

271 documents

1 / 121

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
EXPLOITING SOFTWARE INFORMATION FOR
AN EFFICIENT MEMORY HIERARCHY
BY
RAKESH KOMURAVELLI
DISSERTATION
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Computer Science
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2014
Urbana, Illinois
Doctoral Committee:
Professor Sarita V. Adve, Director of Research
Professor Marc Snir, Chair
Professor Vikram S. Adve
Professor Wen-mei W. Hwu
Dr. Ravi Iyer, Intel Labs
Dr. Gilles Pokam, Intel Labs
Dr. Pablo Montesinos, Qualcomm Research
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Exploiting Software Information for an Efficient Memory Hierarchy and more Thesis Network Technologies and TCP/IP in PDF only on Docsity!

EXPLOITING SOFTWARE INFORMATION FOR

AN EFFICIENT MEMORY HIERARCHY

BY

RAKESH KOMURAVELLI

DISSERTATION

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2014

Urbana, Illinois

Doctoral Committee:

Professor Sarita V. Adve, Director of Research Professor Marc Snir, Chair Professor Vikram S. Adve Professor Wen-mei W. Hwu Dr. Ravi Iyer, Intel Labs Dr. Gilles Pokam, Intel Labs Dr. Pablo Montesinos, Qualcomm Research

ABSTRACT

Power consumption is one of the most important factors in the design of today’s processor chips. Multicore and heterogeneous systems have emerged to address the rising power concerns. Since the memory hierarchy is becoming one of the major consumers of the on-chip power budget in these systems [73], designing an efficient memory hierarchy is critical to future systems. We identify three sources of inefficiencies in memory hierarchies of today’s systems: (a) coherence, (b) data communication, and (c) data storage. This thesis takes the stand that many of these inefficiencies are a result of today’s software-agnostic hardware design. There is a lot of information in the software that can be exploited to build an efficient memory hierarchy. This thesis focuses on identifying some of the inefficiencies related to each of the above three sources, and proposing various techniques to mitigate them by exploiting information from the software. First, we focus on inefficiencies related to coherence and communication. Today’s hardware based direc- tory coherence protocols are extremely complex and incur unnecessary overheads for sending invalidation messages and maintaining sharer lists. We propose DeNovo, a hardware-software co-designed protocol, to address these issues for a class of programs that are deterministic. DeNovo assumes a disciplined program- ming environment and exploits features such as structured parallel control, data-race-freedom, and software information about data access patterns to build a system that is simple, extensible, and performance-efficient compared to today’s protocols. We also extend DeNovo to add two optimizations to address the inefficien- cies related to data communication, specifically, aimed at reducing the unnecessary on-chip network traffic. We show that adding these two optimizations did not only result in addition of zero new states (or transient states) to the protocol but also provided performance and energy gains to the system, thus validating the extensibility of the DeNovo protocol. Together with the two communication optimizations DeNovo reduces the memory stall time by 32% and the network traffic by 36% (resulting in direct savings in energy) on average compared to a state-of-the-art implementation of the MESI protocol for the applications studied. Next we address the inefficiencies related to data storage. Caches and scratchpads are two popular ii

To my parents and my brother

iv

ACKNOWLEDGMENTS

My Ph.D. journey has been long. There are several people whom I would like to thank for supporting me and believing in me throughout the journey. First and the foremost, I want to thank my advisor, Sarita Adve, for giving me the opportunity to pursue my dream of getting a Ph.D. I am very much grateful for her constant guidance and support for the past six years and for making me a better researcher. Her immense enthusiasm for research, her never-give-up attitude, and her always striving for the best quality are some of the traits that will inspire and motivate me forever. I am truly honored to have Sarita as my Ph.D. advisor. I would also like to thank Vikram Adve for his constant guidance on the DeNovo and the stash projects. If there were any formal designation, he would perfectly fit the description of a co-advisor. I thank Nick Carter and Ching-Tsun Chou for their collaborations on the DeNovo project and Pablo Montesinos for his collaboration on the stash project. I also sincerely thank the rest of my Ph.D. committee, Marc Snir, Wen- Mei Hwu, Ravi Iyer, and Gilles Pokam for their insightful comments and suggestions for improvements on my thesis. Special thanks to Bhushan Chitlur, my internship mentor at Intel, for exposing me to the real world architecture problems and experience. I am thankful to Hyojin Sung and Byn Choi for the collaborations on the DeNovo project and to Matt Sinclair for the collaborations on the stash project. I have learned a lot by working closely with these three folks. In addition, I am also thankful for my other collaborators on these projects, Rob Bocchino, Nima Honarmand, Rob Smolinski, Prakalp Srivastava, Maria Kotsifakou, John Alsop, and Huzaifa Muhammad. I thank my other lab-mates, Pradeep Ramachandran, Siva Hari, Radha Venkatagiri, and Abdulrahman Mah- moud who played a very important role in not only providing a great and fun research environment but also aiding me in intellectual development. I thank the Computer Science department at Illinois for providing a wonderful Ph.D. curriculum and flexibility for conducting research. Specifically, I would like to thank the staff members, Molly, Andrea,

v

TABLE OF CONTENTS

CHAPTER 1

INTRODUCTION

1.1 Motivation

Recent advances in semiconductor technology have helped Moore’s law to continue. In the past, when leakage current was minimal, increased chip densities accompanied with supply voltage scaling resulted in constant power consumption for a given area of the chip. Unfortunately, with the recent breakdown of the classical CMOS voltage scaling, power has become a first class problem in the design of processor chips leading to new research directions in the field of computer architecture. Multicores are one such attempt to address the rising power consumption problem. Alternately, hetero- geneous systems take a different approach where power efficient individual components (e.g., GPU, DSP, FPGA, accelerators, etc.) are specialized for various problem domains as opposed to a general-purpose homogeneous multicore system. However, these specialized components differ in many aspects includ- ing ISAs, functionality, and underlying memory models and hierarchy. These differences imply difficulty in building a power efficient heterogeneous system that can be effectively used. Both standalone multi- cores and a cluster of specialized components have their own advantages and disadvantages. Hence we are increasingly seeing the trend towards hybrid systems which have part multicore and part specialized components [82, 73, 32, 68]. With the rise of such hybrid systems, today’s computer systems, from smartphones to servers, are more complex than ever before. Data movement in these systems is expected to become the dominant consumer of energy as technology continues to scale [73]. For example, a recent study has shown that by 2017 more than 50% of the total energy for a 64-bit GPU floating-point computation will be spent in the memory access (reading three source operands and writing to a destination operand from/to an 8KB SRAM) [73].

This highlights the urgent need for minimizing data movement and an energy-efficient memory hierarchy for future scalable computer systems. Shared-memory is arguably the most widely used parallel programming model. Today’s shared-memory hierarchies have several inefficiencies. In this thesis, we focus on homogeneous multicores and heteroge- neous SoC systems. In multicores, complex directory-based coherence protocols, inefficient data transfers, and power-inefficient caches make it hard to design performance-, power-, and complexity-scalable hard- ware. These inefficiencies are exacerbated as more and more cores are added to the system. Traditionally, memory units of different components in heterogeneous SoC systems are only loosely coupled with respect to one another. Any communication between the components required interaction through main memory, which incurs unnecessary data movement and latency overheads. Recent designs such as AMD’s Fusion [32] and Intel’s Haswell [68] address this issue by creating more tightly coupled systems with a single unified address space and coherent caches. By tightly coupling the cores, data can be sent from one component to another without needing the explicit transfer through the main memory. However, these architectures have other inefficiencies in the memory hierarchy. For example, these systems provide only partial coherence and local memories are not globally accessible. Many of these problems of shared-memory systems are because of today’s software agnostic hardware design. They can be mitigated by having more disciplined programming models and by exploiting the infor- mation that is already available in the software. Many of today’s undisciplined programming models allow arbitrary reads and writes for implicit and unstructured communication and synchronization. This results in “wild shared-memory” behaviors with unintended data races and non-determinism and implicit side effects. The same phenomena result in complex hardware that must assume that any memory access may trigger communication, and performance- and power-inefficient hardware that is unable to exploit communication patterns known to the programmer but obfuscated by the programming model. There is much recent soft- ware work on more disciplined shared-memory programming models to address the above problems. We believe that exploiting the guarantees provided by such disciplined programming models will help us alle- viate some of the inefficiencies in the memory hierarchy. Also applications have a lot of other information that could be utilized by the hardware to be more efficient. Applications for heterogeneous systems (e.g., using CUDA and OpenCL programming models) have additional information like which data is commu- nicated between the CPU and the accelerator, which parts of the main memory are explicitly assigned to a

2

1.3 Contributions of this Thesis

In this thesis, we analyze each of the above three types of memory hierarchy inefficiencies, find ways to exploit information available in software, and propose solutions to mitigate them to make hardware more energy-efficient. We limit our focus to deterministic codes in this thesis for multiple reasons: (1) There is a growing view that deterministic algorithms will be common, at least for client-side computing [1]; (2) focusing on these codes allows us to investigate the “best case;” i.e., the potential gain from exploiting strong discipline; (3) these investigations form a basis to develop the extensions needed for other classes of codes (pursued partly for this thesis and partly by other members of the larger project). Synchronization mechanisms involve races and are used in all classes of codes; in this thesis, we assume special techniques to implement them (e.g., hardware barriers, queue based locks, etc.). Their detailed handling is explored by the larger project (some of this work is described below) and is not part of this thesis. The specific contributions of this thesis are as follows.

1.3.1 DeNovo: Addressing Coherence and Communication Inefficiencies

DeNovo [45] addresses the many inefficiencies of today’s hardware based directory coherence protocols.^1 It assumes a disciplined programming environment and exploits properties of such environments like struc- tured parallel control, data-race-freedom, deterministic execution, and software information about which data is shared and when. DeNovo uses Deterministic Parallel Java (DPJ) [28, 29] as an exemplar disciplined language providing these properties. Two key insights underlie DeNovo’s design. First, structured parallel control and knowing which memory regions will be read or written enable a cache to take responsibility for invalidating its own stale data. Such self-invalidations remove the need for a hardware directory to track sharer lists and to send invalidations and acknowledgements on writes. Second, data-race-freedom elimi- nates concurrent conflicting accesses and corresponding transient states in coherence protocols, eliminating a major source of complexity. Specifically, DeNovo provides the following benefits. Simplicity: To provide quantitative evidence of the simplicity of the DeNovo protocol, we compared it with a conventional MESI protocol [108] by implementing both in the Murphi model checking tool [54]. For MESI, we used the implementation in the Wisconsin GEMS simulation suite [94] as an example of a (^1) I co-led the design and evaluation of the DeNovo protocol with my colleagues, Byn Choi and Hyojin Sung [45]. This work will also appear in Hyojin Sung’s thesis. I was solely responsible for the verification work for the DeNovo protocol [78]. This work also appears in my M.S. thesis and is presented here for completeness.

4

(publicly available) state-of-the-art, mature implementation. We found several bugs in MESI that involved subtle data races and took several days to debug and fix. The debugged MESI showed 15X more reachable states compared to DeNovo, with a verification time difference of 173 seconds vs 8.66 seconds [78]. These results attest to the complexity of the MESI protocol and the relative simplicity of DeNovo. Extensibility: To demonstrate the extensibility of the DeNovo protocol, we implemented two optimizations addressing inefficiencies related to data communication: (1) Direct cache-to-cache transfer: Data in a re- mote cache may directly be sent to another cache without indirection to the shared lower level cache (or directory). (2) Flexible communication granularity: Instead of always sending a fixed cache line in response to a demand read, we send a programmer directed set of data associated with the region information of the demand read. Neither optimization required adding any new protocol states to DeNovo; since there are no sharer lists, valid data can be freely transferred from one cache to another. Storage overhead: The DeNovo protocol incurs no storage overhead for directory information. But we need to maintain coherence state bits and additional information at the granularity at which we guarantee data-race freedom, which can be less than a cache line. For low core counts, this overhead is higher than with conventional directory schemes, but it pays off after a few tens of cores and is scalable (constant per cache line). A positive side effect is that it is easy to eliminate the requirement of inclusivity in a shared last level cache (since we no longer track sharer lists). Thus, DeNovo allows more effective use of shared cache space. Performance and power: In our evaluations, we show that the DeNovo coherence protocol along with the communication optimizations described above reduces an average 32% (up to 77%) of the memory stall time and an average reduction of 36% (up to 71.5%) of the network traffic compared to MESI. The reductions in network traffic have direct implications on energy savings.

1.3.2 Stash: Addressing Storage Inefficiencies

The memory hierarchies of heterogeneous SoCs are often loosely coupled and require explicit communica- tion through main memory to interact. This results in unnecessary data movement and latency overheads. A more tightly coupled SoC memory hierarchy helps address these problems, but doesn’t remove all sources of inefficiency such as power-inefficient cache accesses and scratchpads that are only locally visible. To

lock synchronization, and shows 33% less network traffic on average, implying potential energy savings. My specific contributions to DeNovoND are designing and implementing queue based locks in hardware.

1.5 Outline of the Thesis

This thesis is organized as follows. Chapter 2 describes our solutions to address the coherence and com- munication inefficiencies. In this chapter, we describe the DeNovo coherence protocol and the two com- munication optimizations that extend DeNovo. Chapter 3 provides a complexity analysis of DeNovo by formally verifying it and comparing the effort against that of a state-of-the-art implementation of MESI. We provide performance analysis of DeNovo in Chapter 4. In Chapter 5, we introduce stash that addresses the storage inefficiencies. We provide performance evaluation of the stash organization in Chapter 6. Chapter 7 describes the prior work. Finally, Chapter 8 summarizes the thesis and provides directions for future work.

1.6 Summary

On-chip energy has become one of the primary constraints in building computer systems. Today’s complex and software-oblivious systems have several inefficiencies which are hindrances for building future energy- efficient systems. This thesis takes the stand that there is a lot of information in the software that can be exploited to remove these inefficiencies. We focus on three sources of inefficiencies in today’s memory hierarchies: (a) coherence, (b) data communication, and (c) data storage. Specifically, we propose a simple and scalable hardware-software co-designed DeNovo coherence pro- tocol to address inefficiencies in today’s complex hardware directory based protocols. We extend DeNovo with two optimizations that are aimed at reducing the unnecessary on-chip network traffic addressing the inefficiencies in data communication. Finally, to address several inefficiencies with data storage, we propose a new memory organization, stash, that has the best of both scratchpad and cache organizations. Together, we show that a true software-hardware co-designed system that exploits information from software makes for an efficient system compared to today’s largely software-oblivious systems.

CHAPTER 2

COHERENCE AND COMMUNICATION

In a shared-memory system, coherence is required when multiple compute units (homogeneous or hetero- geneous) replicate and modify the same data. Coherence is usually associated with cache memory organiza- tion. But similar to caches, there are other memory organizations like stash, as described in Chapter 5, that hold globally addressable and replicable data, which require coherence too. Shared-memory systems typ- ically implement coherence with snooping or directory-based protocols in the hardware. Although current directory-based protocols are more scalable than snooping protocols, they suffer from several limitations: Performance and power overhead: They incur several sources of latency and traffic overhead, impacting performance and power; e.g., they require invalidation and acknowledgment messages (which are strictly overhead) and indirection through the directory for cache-to-cache transfers. Verification complexity and extensibility: They are notoriously complex and difficult to verify since they require dealing with subtle races and many transient states (Section 2.1.2) [103, 60]. Furthermore, their fragility often discourages implementors from adding optimizations to previously verified protocols – addi- tions usually require re-verification due to even more states and races. State overhead: Directory protocols incur high directory storage overhead to track sharer lists. Several op- timized directory organizations have been proposed, but also require considerable overhead and/or excessive network traffic and/or complexity. These protocols also require several coherence state bits due to the large number of protocol states (e.g., ten bits in [115]). This state overhead is amortized by tracking coherence at the granularity of cache lines. This can result in performance/power anomalies and inefficiencies when the granularity of sharing is different from a contiguous cache line (e.g., false sharing). Researchers continue to propose new hardware directory organizations and protocol optimizations to address one or more of the above limitations (Section 7.1); however, all of these approaches incur one or

8

DPJ is an extension to Java that enforces deterministic-by-default semantics via compile-time type checking [28, 29]. Using Java is not essential; similar extensions for C++ are possible. DPJ provides a new type and effect system for expressing important patterns of deterministic and non-deterministic paral- lelism in imperative, object-oriented programs. Non-deterministic behavior can only be obtained via certain explicit constructs. For a program that does not use such constructs, DPJ guarantees that if the program is well-typed, any two parallel tasks are non-interfering, i.e., do not have conflicting accesses. (Two accesses conflict if they reference the same location and at least one is a write.) DPJ’s parallel tasks are iterations of an explicitly parallel foreach loop or statements within a cobegin block; they synchronize through an implicit barrier at the end of the loop or block. Parallel control flow thus follows a scoped, nested, fork-join structure, which simplifies the use of explicit coherence actions in DeN- ovo at fork/join points. This structure defines a natural ordering of the tasks, as well as an obvious definition of when two tasks are “concurrent”. It implies an obvious sequential equivalent of the parallel program (for replaces foreach and cobegin is simply ignored). DPJ guarantees that the result of a parallel execution is the same as the sequential equivalent. In a DPJ program, the programmer assigns every object field or array element to a named “region” and annotates every method with read or write “effects” summarizing the regions read or written by that method. The compiler checks that (i) all program operations are type safe in the region type system; (ii) a method’s effect summaries are a superset of the actual effects in the method body; and (iii) that no two parallel statements interfere. The effect summaries on method interfaces allow all these checks to be performed without interprocedural analysis. For DeNovo, the effect information tells the hardware what fields will be read or written in each par- allel “phase” (foreach or cobegin). This enables efficient software-controlled coherence mechanisms discussed in the following sections. DPJ has been evaluated on a wide range of deterministic parallel programs. The results show that DPJ can express a wide range of realistic parallel algorithms; that its type system features are useful for such programs; and that well-tuned DPJ programs exhibit good performance [28]. In addition to guaranteeing determinism, DPJ was later extended to provide strong safety properties such as data-race-freedom, strong isolation, and composition for non-deterministic code sections [29]. This is achieved by ensuring that conflicting accesses in concurrent tasks are confined to atomic sections and

10

their regions and effects are explicitly annotated as atomic. In this thesis, we focus only on the deterministic codes.

2.1.2 Complexity of Traditional Coherence Protocols

To understand the complexity of today’s directory-based protocols [78], we briefly discuss the details of a state-of-the-art, mature, publicly available protocol, the MESI protocol implemented in the Wisconsin GEMS simulation suite (version 2.1.1) [94]. Without loss of generality, we assume a multicore system with n cores, private L 1 caches, a shared L 2 cache, and a general (non-bus, unordered) interconnect on chip.

!"#$%&' ()$*+'

12'&3&+'^ ,-.%/0&#+

!"#$% &'(#)")'*"+%',-

.)%,"/

!"#$%

&01*'(#)")'-^ !"#$%^ .)%,"%

.)%,"%

!"#$%

.)%,"/

!"#$/

!"#$ .)%,"% 2 %

.)%,"/ .)%,"% !"#$/

Figure 2.1Textbook state transition diagram for L 1 cache of core i for the MESI protocol. Readi = read from core i, Readk = read from another core k.

MESI, also known as the Illinois protocol [108], stands for M odif ied (locally modified and no other cache has a copy), Exclusive (unmodified and no other cache has a copy), Shared (unmodified and some other caches may have a copy), and Invalid. Over the MSI protocol, the Exclusive state has the added advantage of avoiding invalidation traffic on write hits. For scalability, we assume a directory protocol [86]. Given our shared (inclusive) L 2 cache based multicore, we assume a directory entry per L 2 cache line, referred to as an in-cache directory [39]. We use L 2 and directory interchangeably. Figure 2.1 shows the simple textbook state transition diagram for an L 1 cache with the MESI proto- col. The L 2 cache also has four (textbook) states, L 1 M odif ied (modified in a local L 1 ), L 2 M odif ied

11