




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Coarse-Grained Reconfigurable Arrays (CGRAs) are emerging accelerators that promise low-power acceleration of compute-intensive loops in applications.
Typology: Schemes and Mind Maps
1 / 179
This page cannot be seen from the preview
Don't miss anything!





























































































Compiler Design for Accelerating Applications on Coarse-Grained Reconfigurable Architectures
by
Mahesh Balasubramanian
A Dissertation Presented in Partial Fulfillment of the Requirement for the Degree Doctor of Philosophy
Approved October 2021 by the Graduate Supervisory Committee: Aviral Shrivastava, Chair Chaitali Chakrabarti Fengbo Ren Laura Pozzi
December 2021
Coarse-Grained Reconfigurable Arrays (CGRAs) are emerging accelerators that promise low-power acceleration of compute-intensive loops in applications. The acceleration achieved by CGRA relies on the efficient mapping of the compute-intensive loops by the CGRA compiler onto the CGRA. The CGRA mapping problem, being NP- complete, is performed in a two-step process, scheduling, and mapping. The scheduling algorithm allocates timeslots to the nodes of the DFG, and the mapping algorithm maps the scheduled nodes onto the PEs of the CGRA. On a map- ping failure, the initiation interval (II) is increased, and a new schedule is obtained for the increased II. Most previous mapping techniques use the Iterative Modulo Scheduling algorithm (IMS) to find a schedule for a given II. Since IMS generates a resource-constrained ASAP (as-soon-as-possible) scheduling, even with increased II, it tends to generate a similar schedule that is not mappable and does not explore the schedule space effectively. The problems encountered by IMS-based scheduling algo- rithms are explored and an improved randomized scheduling algorithm for scheduling of the application loop to be accelerated is proposed. When encountering a mapping failure for a given schedule, existing mapping al- gorithms either exit and retry the mapping anew, or recursively remove the previ- ously mapped node to find a valid mapping (backtrack). Abandoning the mapping is extreme, but even backtracking may not be the best choice, since the root of the problem may not be the previous node. The challenges in existing algorithms are systematically analyzed and a failure-aware mapping algorithm is presented. The loops in general-purpose applications are often complicated loops, i.e., loops with perfect and imperfect nests and loops with nested if-then-else’s (conditionals). The existing hardware-software solutions to execute branches and conditions are in- efficient. A co-design approach that efficiently executes complicated loops on CGRA
i
To my parents, Balasubramanian & Rama, to my wife, Shamini, and to all my well-wishers.
iii
I would like to thank my advisor Dr.Aviral Shrivastava, who has been a great mentor. His support during my toughest times has helped me get through them and strive. This thesis would not have been possible without his guidance. I would like to thank Dr.Chaitali Chakrabarti, Dr.Fengbo Ren, and Dr.Laura Pozzi, for supervising my thesis and for giving important ideas to develop my research and fine-tune the thesis. I am grateful for the summer internships at Lawrence Berkeley National Labo- ratory (LBNL), where I had a chance to collaborate with Dr. Prabhat and Dr.Kris Bouchard. Their inputs and mentorship shaped my research direction and this the- sis. Thanks to all the people that I have met in Berkeley, especially, Brandon Cook, Maximilian Dougherty, Pratik Sachdeva, Dr.Trevor Ruiz, and Dr.Sharmodeep Bhat- tacharyya. A special thanks to Dr.Grzegorz Muszynski for his friendship and intel- lectual discussions. I would like to thank my lab mates, Shail Dave, Moslem Didehban, Dheeraj Lokam, Edward Andert, Mohammadereza Mehrabian, and Mohammad Khayatian, who have been great support and for providing a great research environment. Finally, I would like to thank my family. My dad, who has been an emotional and financial support for three decades. My mom, Rama, for her prayers. My in-laws, Radhika and Rajaganesh, for believing in me. Most importantly, my wife, Shamini, whose patience and love is unparalleled.
iv
Table Page 3.1 Classification of Compiler Techniques for CGRAs..................... 14 4.1 Benchmark Characteristics.......................................... 56 4.2 Performance (II) Comparison Between IMS-based RAMP and CRIM- SON (CRIM.) for Sizes 4×4 and 5×5. “X” Denotes That There Was No Mapping Obtained from RAMP. MII Denotes the Theoretical Min- imum II........................................................... 58 4.3 Performance (II) Comparison Between IMS-based RAMP and CRIM- SON (CRIM.) for Sizes 6×6 and 7×7. “X” Denotes That There Was No Mapping Obtained from RAMP. MII Denotes the Theoretical Min- imum II........................................................... 59 4.4 Performance (II) Comparison Between IMS-based RAMP and CRIM- SON (CRIM.) for 8×8 CGRA. “X” Denotes That There Was No Map- ping Obtained from RAMP. MII Denotes the Theoretical Minimum II.. 60 5.1 PathSeeker Has a Better Compilation Compared to Graphminor and RAMP. NA Denotes the Loops for Which a Valid Mapping Was Not Obtained Within the 100,000 Seconds Threshold...................... 81 5.2 Results Continued from Table 5.1................................... 82 7.1 Performance Analysis Setup for U oILASSO and U oIV AR............... 117 7.2 Randomized Data Distribution Design Improves the Data Read and Distribution Time Compared to Conventional Distribution Method. Beyond 1TB Data Set Size the Conventional Method’s Data Read Time Crossed Beyond 5 Hours Whereas Randomized Data Distribution Read Time Was Below 100 Seconds....................................... 121 8.1 R-type Instruction Format for CGRA................................ 138
ix
Table Page
8.2 P-type Instruction Format for CGRA................................ 138 8.3 Input Multiplexer Selection for PEs.................................. 139 8.4 Translation of LLVm IR Opcode to CCF Virtual Opcode.............. 140 8.5 Translation of CCF Virtual Opcode to CCF Machine Opcode......... 141
x
Figure Page
3.4 (A) DFG of a Loop with Nodes a, b, e, f Memory Nodes (Load, Store Operations) Denoted by Darker Shade. (B) 2×2 CGRA Architecture with Double-bank Local Memory, (C) Mapping by Ems Causing Bank Conflict, (D) Mapping by High Throughput Mapping Technique Re- solving the Bank Conflict........................................... 30 3.5 (A) Kernel Code (B) 2×2 CGRA Architecture. (C) Corresponding DFG of the Loop (D) Utilization of Registers for Routing With II=2.. 32 3.6 (A) a Simple Loop with If-then-else Conditional (B) a 2×2 CGRA Target Architecture with 2 Registers in Each PE. (C) Partial Predica- tion Adds Three Operation for e Inside If-then-else Statement, et for If-path, ef for Else-path and S, a Select Operation to Select Between If and Else Path Based on the Cmp Result (D) a Valid Mapping Obtain With II=3......................................................... 37 3.7 (A) a Simple Loop with If-then-else Conditional (B) a 2×2 CGRA Target Architecture with 2 Registers in Each PE. (C) Psb Fuses the If-path, Else-path and Select Operation from Partial Predication to Form a Single e Operation. (D) a Valid Mapping Obtain with II=2, Where Cmp Outcome Is Communicated to the Ifu. (E) to Facilitate the Issue of Only the Correct Path, the Instruction Is Laid out in the Instruction Memory. If the Cmp Is True Ifu Slot 2 Instructions Are Issued and Executed Whereas If Cmp Is False Ifu Slot 2 Is Skipped and Ifu Slot 3 Is Issued and Executed.................................... 39 4.2 Overview of Scheduling and Mapping Workflow of Previous Techniques. 45
xii
Figure Page
4.1 (A) DFG of an Application Loop. (B) a 2x2 CGRA Target Architec- ture. (C) Column 1 Shows the Nodes in the DFG and Column 2 Shows an IMS Schedule for the Nodes at II=MII=3. (D) the Mapping Algo- rithm Tries to Map the Nodes Scheduled, but Fails Due to Additional Routing Nodes “r” Required to Route Nodes f and i. Failure to Find a Valid Mapping, the II Is Increased to 4 and IMS Is Called Again to Schedule the Nodes Based on the Workflow given in Figure 4.2. (E) IMS Schedule for an Increased II (II=4). (F) Even at an Increased II, the Mapping Algorithm Cannot Find a Valid Mapping Due to Re- source Constraint at tI+1 Which Is Not Resolved at II=4 and Will Not Be Resolved on Any Further Increase in II........................... 45 4.3 An Overview of CRIMSON Workflow, with Addition of Rc asap and Rc alap Computation, Randomized Scheduling Algorithm, and a Fea- sibility Test (Shaded Blocks in the Image Are the Proposed Methods).. 47 4.4 (A) the DFG of the Motivation Example. (B) a 2x2 CGRA Archi- tecture. (C) for Each Node of the DFG, Resource Constrained Asap (Column 2) and Resource Constrained Alap (Column 3) Is First Calcu- lated. Then a Random Schedule Time Between Rc asap and Rc alap Is Chosen for Each Node. A Valid Randomized Modulo Schedule Is Shown in Column 4. (D) with CRIMSON Schedule a Valid Mapping Is Achieved by the Mapping Algorithm At II=3...................... 53
xiii
Figure Page
6.1 (A) a Simple Loop to Be Accelerated on CGRA (B) Flattened 2× 2 CGRA Where Each PE Has 2 Registers (C) a Loop with an If-then- else (D) Data Flow Graph (DFG) of the Loop with Partial Predication (E) Mapping of DFG on 2×2 CGRA With II=3...................... 90 6.2 (A) a Loop With Nested Conditional (B) DFG Using Partial Predica- tion Results in 31 Nodes. Nodes h and g Represent Conditions x%i== and y%i==1....................................................... 91 6.3 (A) an Imperfectly Nested Loop with Cond1 and Cond2 Conditions (B) Flattening Converts (a) into Single-level Loop with Conditionals with New Cond3 and Cond4.............................................. 92 6.4 (A) DFG Obtained from LASER-compiler for Loop of Fig 6.2. Nodes from Multiple If-paths and Else-path to a Single Node. If Such Path Is Absent, Balancing No-ops Are Added and a Node Such as ao Preserves the Old Value. (B) 2×2 CGRA Where Each PE Has 2 Registers. (C) Mapping with II = 4. (D) Instructions Are Selectively Issued During the Execution of the Kernel......................................... 93 6.5 LASER-Architecture to Accelerate Complex Loops. PEs Do Not Have a Predicate Network. Branch Outcome Is Communicated to the Ifu to Issue Instructions Selectively Based on the Path Taken at Runtime.... 97 6.6 LASER Reduces Nodes by 43.43%.................................. 99 6.7 LASER Reduces Energy by 46%................................... 100 6.8 LASER Is a Scalable Solution With 40.91% Cumulative Geomean Re- duction in II Compared to Partial Predication........................ 101
xv
Figure Page
7.1 (A) a Three-tier (T0, T1 and T2) Distribution Strategy for Random- ized Distribution of Data Set Across the Number of Sample from the Hdf5 Data File to the Cores of Knl. (B) Model Selection – Lasso Admm Is Used to ‘solve’ and Intersection Operation Is Used as ‘reduce’ to Select Family of Support Sj. (C) Data Randomization for Cross Val- idation Where Tier2 Random Distribution Is Employed to Randomly Reshuffle the Data. (D) Model Estimation – Ols Is Used to ‘solve’ and Union Operation Is Used to ‘reduce’ to Get an Optimally Predictive Model............................................................ 113 7.2 U oILASSO Runtime Number Using Intel-MKL Linear Algebra Library With B 1 = B 2 = 5 and q = 8........................................ 118 7.3 Exploiting PB and Pλ Parallelism by Increasing the Data Set and ADMMCores by a Factor of 2........................................ 119 7.4 Weak Scaling Plot of U oILASSO. The Problem Size per Node Was Kept Fixed.............................................................. 122 7.5 Tmin & Tmax Plot for U oILASSO..................................... 123 7.6 Strong Scaling Plot of U oILASSO. The Problem Size Was Kept Fixed At 1TB............................................................ 124 7.7 U oIV AR Single Node with B 1 = B 2 = 5 and q = 8..................... 125 7.8 Exploiting Algorithmic Parallelism of U oIV AR........................ 126 7.9 Weak Scaling Plot of U oIV AR in Logarithmic Scale. The Problem Size per Node Was Kept Fixed.......................................... 127 7.10 Strong Scaling Plot of U oIV AR. The Problem Size Was Kept Fixed at 1TB............................................................... 128
xvi
Chapter 1
INTRODUCTION
The advancement of the Internet and data collecting devices have increased the demand for high-performance, low-power computing alternatives. All mobile devices collect, process, and communicate data. Analyzing the collected data to extract mean- ingful information is compute-intensive [1] and often limited by the thermal, power, and resource constraints [2]. Efficiency in accelerating the compute-intensive sec- tions is now being achieved through the use of custom accelerators, e.g., Application- Specific Integrated Circuits (ASIC), spatial architectures like Eyeriss [3], DianNao [4], EIE [5], MAERI [6], etc. for deep learning applications [7], NERO[8], SODA[9], As- sociative Processor[10] for accelerating stencil computation, SODA[11] for Software Defined Radio applications etc. Due to the immense influx of the data, the performance of the application analyz- ing the data is of utmost importance [12]. The existing compilers and architectures exploit the data-level parallelism by vectorizing the applications. The resultant paral- lel code generated, which is of the Single Instruction Multiple Data (SIMD) fashion, is accelerated/streamed in vector units with variable width. Theoretically, as the vector width increases, the speedup achieved by such SIMD vector units should be proportional. But the speedup is restricted to highly parallel application, and many of the application with control-flow divergence does not benefit much acceleration from SIMD units [13]. The applications must be tuned to a particular architecture framework to achieve maximum performance. [12]. Graphics Processing Units (GPU) are successful in ac- celerating both graphics-based and general-purpose applications. Extensive research
1
is being carried out to handle the data-dependence and control-flow problem GPUs. The downside of the GPU accelerator technology is the programmers’ ability to choose the kernel from the application to be accelerated on the GPU and especially program it for the hardware. OpenCL [14] and CUDA [15] are the widely used parallel- languages for accelerating kernels on GPUs. So accelerating an application using the GPU framework is not simple [16, 17, 18]. Along with the GPUs, Field Programmable Gate Arrays (FPGAs) are popular in accelerating applications at low power. For power-critical systems, FPGAs have been a promising accelerator solution. With the increasing use of data-centers and edge computing devices, FPGAs have permeated into the domain-specific acceleration like artificial intelligence, high-performance computing, etc. [19, 20]. Like GPUs, FPGAs suffer from programmability issues where the programmers need to write their applications in Hardware Descriptive Languages (HDL) [21]. Although custom accelerators can arguably achieve the highest acceleration effi- ciency (performance/power), using them may result in poor code portability. If the accelerator changes in the next generation of hardware, then the code-base needs to be ported to the new accelerator technology. There has been a surge of application/domain- specific accelerators like Eyeriss [3], Diannao [4], Marvel [22]. The compiler and hardware are optimized for performance, power, area, and energy for a particular ap- plication domain like Convolutional Neural Networks (CNN), Deep Neural Networks (DNN), etc. As formulated in Marvel [22], the input application loops should have the following constraints: (1) Perfectly nested without any conditional statements. (2) Perfectly nested loops must not have any anti, flow, or output dependencies. (3) Can be freely reordered for compiler optimizations like tiling, loop reordering, etc. General-purpose applications have compute-intensive loops that do not conform to these restrictions. Accelerating them is challenging as they may contain conditional
2