Resource Constrained Scheduling in Hardware Description Languages, Study notes of Design and Analysis of Algorithms

The behavior of operations, assignment statements, and control constructs in hardware description languages (hdl), focusing on system c scheduling, resource constrained scheduling (tcs), and scheduling algorithms such as asap, alap, and list scheduling. It covers the concept of type functions, resource constraints, and partial orders in hdl scheduling.

Typology: Study notes

2012/2013

Uploaded on 04/29/2013

awais
awais 🇮🇳

4.3

(15)

148 documents

1 / 62

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Module-II
Lecture-I
Introduction to HLS: Scheduling, Allocation and Binding
Problem
1. Introduction
In Lecture 1 of Module 1, we discussed that for any VLSI design we start with
specifications and the first step is to obtain the Register Transfer Level (RTL) circuit.
RTL circuit is obtained from specifications using High Level Synthesis (HLS)
algorithms. As specifications are processed by HLS algorithms, they need to be
represented using some modeling language. In Lecture 2, we discussed Control and
Data Flow Graph (CDFG), which is one of the most widely accepted modeling
paradigm for specifications that are processed by HLS tools. Finally, in Lecture 3 we
discussed transformation techniques in the CDFGs, which lead to efficient circuit
implementation in terms of area, frequency, power etc. HLS takes as input, the
optimized CDFG, performs Scheduling, Allocation, Binding and generates RTL
design. In this module we will study algorithms pertaining to these steps--Scheduling,
Allocation, and Binding. To start with, in this lecture, we introduce HLS and problem
definition of Scheduling, Allocation and Binding.
2. Introduction to HLS
A behavioural description (i.e., functional specifications) is used as the starting point
for HLS. It specifies the behaviour in terms of operations, assignment statements, and
control constructs in a Hardware Description Language (HDL) (e.g., System C [1],
VHDL [2] or Verilog [3]). Figure 1 illustrates a typical HLS flow.
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e

Partial preview of the text

Download Resource Constrained Scheduling in Hardware Description Languages and more Study notes Design and Analysis of Algorithms in PDF only on Docsity!

Module-II

Lecture-I

Introduction to HLS: Scheduling, Allocation and Binding

Problem

1. Introduction

In Lecture 1 of Module 1, we discussed that for any VLSI design we start with specifications and the first step is to obtain the Register Transfer Level (RTL) circuit. RTL circuit is obtained from specifications using High Level Synthesis (HLS) algorithms. As specifications are processed by HLS algorithms, they need to be represented using some modeling language. In Lecture 2, we discussed Control and Data Flow Graph (CDFG), which is one of the most widely accepted modeling paradigm for specifications that are processed by HLS tools. Finally, in Lecture 3 we discussed transformation techniques in the CDFGs, which lead to efficient circuit implementation in terms of area, frequency, power etc. HLS takes as input, the optimized CDFG, performs Scheduling, Allocation, Binding and generates RTL design. In this module we will study algorithms pertaining to these steps--Scheduling, Allocation, and Binding. To start with, in this lecture, we introduce HLS and problem definition of Scheduling, Allocation and Binding.

2. Introduction to HLS

A behavioural description (i.e., functional specifications) is used as the starting point for HLS. It specifies the behaviour in terms of operations, assignment statements, and control constructs in a Hardware Description Language (HDL) (e.g., System C [1], VHDL [2] or Verilog [3]). Figure 1 illustrates a typical HLS flow.

Compilation

Scheduling

Allocation

Binding

Data-path and controller generation

Functional Specification

Intermediate representation

Scheduled design

Allocation of registers and FUs

RTL

Variables to registers, operations to FU Mapping

Figure 1. Steps of HLS

The first step in HLS is compilation of the HDL and transformation into an internal representation. Most HLS techniques use Control and Data Flow Graph (CDFG) as the representation, because it contains both the data flow and the control flow. This process also includes a series of compiler like optimizations namely, dead code elimination, redundant expression elimination etc. Further, it also applies hardware- library specific transformations such as, use of incrementers instead of adders, use of shifters instead of multipliers etc. It may be noted that in the last module, we have studied these compilation and transformation steps. Sometimes we call these steps as pre-processing phase for HLS, where the optimized CDFG is provided to HLS engine. In some literatures, however, we include these pre-processing steps in the HLS procedure.

3. Scheduling Problem The scheduling problem involves determining the sequence in which the operations are executed to produce a control step schedule, which specifies the operations that execute in each control step.

Let O be the set of all operations to be scheduled, which are obtained from the HDL code. If there is an operation o (^) jO which depends on the result of another operation an oiO , then oi must finish its execution before operation o (^) j can begin. In such a case we say that there is a data dependency between the two operations oi and o (^) j and o i is an immediate predecessor of o (^) j. Data dependency results in a precedence constraint between the two dependant operations in scheduling. In other words, an operator can be scheduled only after all its predecessors are scheduled.

For any HLS platform, there exists a module library comprising circuits for different functionalities like adder, multipliers, registers etc. Further, the library also has information regarding different parameters of the modules namely, frequency, area, power etc. Let T be the set of different types of modules that are available. For a given operation o , the type of the operation is determined by a type function Ty O :  T ; Ty o ( )  t implies that operation o can operate on module of type t.

Based on the above basic formulations we will discuss the following four types of scheduling problems i. Un-Constrained Scheduling (UCS) problem ii. Time Constrained Scheduling (TCS) problem iii. Resource Constrained Scheduling (RCS) problem iv. Time-Resource Constrained Scheduling (TRCS) problem

Now we elaborate on each of these types using the simple example expression “(a+b+c+d)*e”.

3.1 Unconstrained Scheduling (UCS) problem Given: A set of operations O , a set T^ of different types of functional modules, a type function Ty O :  T and a partial order on O determined by the precedence constraints. Find: A feasible schedule for all elements in O , taking appropriate modules from T and obeying the partial order.

a b^ c^ d

e

o 1 o 2

o 3

o 4

t (^1) t (^1)

t (^1)

t (^2)

Step 1

Step 2

Step 3

temp1 (^) temp

temp

out

Figure 1. Unconstrained Scheduling for (a+b+c+d)*e

Figure 1 illustrates an unconstrained scheduling for expression (a+b+c+d)*e. As the schedule is unconstrained we need to see that all elements in O are scheduled, appropriate modules from T are taken and partial order is maintained. In the above example, there are four operations (3 additions denoted as o 1 (^) , o 2 (^) , o 3 and 1 multiplication denoted as o 4 ), all of which are scheduled. Let the library have two types of resources, adders (denoted as t 1 ) and multipliers (denoted as t (^) 2 ). It may be noted that appropriate modules from T^ are taken-- o 1 (^) , o 2 (^) , o 3 are assigned to t 1 (i.e., adder is assigned to addition operations) and o 4 are assigned to t 2 (i.e.,

Find: A feasible schedule for all elements in O , taking appropriate modules from T , meeting the resource constraints for each functional module type and obeying the partial order.

a c^ d

e

o 1

o (^2)

o 3

o 4

t (^1)

t (^1)

t (^1)

t (^2)

Step 1

Step 2

Step 3

temp

temp

temp

out

b

Step 4

Figure 2. Resource Constrained Scheduling for (a+b+c+d)e Figure 2 illustrates a resource-constrained scheduling involving, one adder and one multiplier, for expression (a+b+c+d)e. As the schedule is resource- constrained we need to see that all elements in O are scheduled, appropriate modules from T^ are taken, partial order is maintained and recourse utilization does not cross the limit. As there is one adder module (for o 1 (^) , o 2 (^) , o 3 ) and a

multiplier module (for o 4 ), we cannot schedule o 1 and o 2 in one control step. So

we schedule o 1 is step1 and o 2 in step2. To maintain the partial order, o 3 is

scheduled in step3 and o 4 is scheduled in step4; it may be noted that these

operators cannot be scheduled earlier. Therefore, the number of control steps is 4. Due to meeting the resource constraint, we cannot have a schedule in 3 steps (Figure 1).

3.4 Time Resource Constrained Scheduling (TCS) problem Given: A set of operations O , a set T^ of different types of functional modules, a type function Ty O :  T , a time constraint (deadline) D , resource constraints max ,1 kk | T | for each functional module of type tk ,1  k | T |and a partial order on O determined by the precedence constraints. Find: A feasible schedule for all elements in O , taking appropriate modules from T , meeting the deadline D , meeting the resource constraints for each functional module type and obeying the partial order.

In time resource constraint scheduling, we need to meet both timing and resource constraints. For example, the scheduling of Figure 1 meets a time resource constraint scheduling if deadline is 3 (or more) units and resource constraint is 2 adders and 1 multiplier. Similarly, the scheduling of Figure 2 meets a time resource constraint scheduling if deadline is 4 (or more) units and resource constraint is 1 adder and 1 multiplier.

lowest frequency because system clock frequency depends on the critical path [5]. In general, a multiplier has much higher area and power requirements compared to an adder. Also, frequency of a multiplier is lower compared to an adder [5]. So, in the example of Figure 1, time period of each control step would be 2

F t (^)  S

Now, if have no area and power constraints, then we would use t 1  F for t 1 and

t 2  F for t 2 .The time period of each control step is 2

F t (^)  F

. But, in general

Ft 2 (^)  FFt 1  S ; frequency of a fast multiplier is generally less compared to even a slow adder. So in spite of allocating fast adders to o 1 (^) , o 2 (^) , o 3 (consuming high area and

power), time period of control step is 2

F t (^)  F , which is not dependent on Ft 1 (^)  S or Ft 1  F.

So we can use slow adders without any compromise in overall time period of operation (i.e., time period of control step). However, use of slow multiplier would compromise the overall time period of operation. Therefore, allocation basically allows tradeoffs reading performance (in terns of operating frequency) versus area- power overheads.

5. Binding After all the operations are scheduled and allocation is done, we get information regarding exact type of circuit modules (from the design library) to be used and their numbers. However, this is not enough to generate the RTL. We have seen in the allocation step, that operations in a control step are performed by different modules, however, modules are shared between operations (of same type) that are in different control steps. For example, in case of the schedule of Figure 1, an adder module will be shared between o 1 and o 3 or o 2 and o 3. Due to sharing, in addition to operational modules (adders, multipliers etc.), we need multiplexers. Further, to store variables ( a,b,c,d,e in the example of Figure 1) and intermediate results (temp1,temp2.temp3, in the example of Figure 1) we need registers. Like operational modules, registers can

be shared, which do not lie in same control step. All the above-mentioned steps (after scheduling and allocation) fall under Binding.

The binding task (also called resource-sharing step) assigns the operations and variables to hardware modules. A resource such as an operational module or register can be shared by different operations, data accesses, or data transfers if they are mutually exclusive. For example, two operations assigned to two different control steps are mutually exclusive since they will never execute simultaneously; hence, they can be binded to the same hardware unit. Binding can be classified into three sub-functions:

Storage binding : This step assigns input, output and temporary variables to registers units. Two variables that are not alive simultaneously (i.e., not required in overlapping control steps) in a given control step can be assigned to the same register.

Functional-unit binding: This binding step assigns operations to operational modules (like adder, multiplier etc.). Two operations of same type that are not in a single control step can be assigned to the same operational module.

Interconnection binding: This step assigns an interconnection unit such as a multiplexer or a bus to a data transfer.

Although listed separately, the three sub-functions are intertwined and are to be carried out concurrently for optimal results. Now, we will illustrate Binding for the schedule of Figure 1 when allocation is-- two number of modules of type t 1  S for o 1 (^) , o 2 (^) , o 2 and one

module of type t (^) 2  F for o 4.

 Binding of o 1 to adder1 and o 2 to adder2 (functional unit binding)  Binding of o 3 to adder2 (functional unit binding)  Binding of a,temp1,temp3 to register1 (storage binding)  Binding of b,temp2 to register2 (storage binding)  Binding of c to register3 (storage binding)  Binding of d,e to register4 (storage binding)  Binding of o 4 to multiplier1 (functional unit binding) Because of sharing of variables in registers, we need multiplexers (shown in Figure 3), which is interconnect binding. Figure 4 illustrates the variables that are in the registers and operations that are present in the functional modules at control step1. Also values of the multiplexer select lines are shown, which comprise the control signal values at step1.

adder

Register1 Register

Mux Mux

a (^) b

control1=

adder

Register3 Register

c Mux

d e

control2=

Multiplier

temp1 temp

Figure 4. Configuration at Control step

control1 is 0, thereby binding a in register1 and b in register  control2 is 0, thereby binding d in register  Binding c to register  Binding of o 1 to adder  Binding of o 2 to adder Under this binding, adder1 generates temp1 and adder2 generates temp2.

adder

Register1 Register

Mux Mux

a (^) b

control1=

adder

Register3 Register

c Mux

d e

control2=X

Multiplier

temp1 temp

temp

Figure 5. Configuration at Control Step 2

Figure 5 illustrates the variables that are in the registers and operations that are present in the functional modules at control step2. Also values of the multiplexer select lines are shown, which comprise the control signal values at step2.  control1 is 1, thereby binding temp1 in register1 and temp2 in register  control2 is X and adder2 is not used. In addition, register3 and register are not used.  Binding of o 3 to adder Under this binding, adder1 generates temp

For the scheduling, allocation and binding considered in the running example we have the following signal sequences for control1 and control2 in the three time steps.  Step-1: control1 is 0 and control2 is 0  Step-2: control1 is 1 and control2 is X  Step-3: control1 is 1 and control2 is 1 We need to develop a sequential circuit having two output bits “ control1” and“ control2” and they should have the values “00”, “1X” and “11” in three consecutive clock edges. This circuit can be easily design using the concept of state machine implementation [4]. This circuit would be the controller and the RTL for the data path is the circuit shown in Figure 3.

6. Conclusions In this lecture, we have introduced and formalized the problem of HLS and the sub- tasks therein-- scheduling, allocation and binding. Further, we have illustrated each sub-step using examples. From the discussion in this lecture, it may be noted that among the three steps, allocation step is mainly dependent on technology. The other two steps are independent of technology. Further, from the point of complexity of algorithms required to solve the three sub-steps, allocation can be accomplished with simpler procedures compared to the other two sub-steps. Broadly speaking, for allocation, once the control step duration is fixed, for a given operation, we need to select an operator from the design library of corresponding functionality that can do the computation within the duration. If we have more than one module in the library for a given operation, which meets the control step duration, we need to select the one with minimum area. So algorithms for allocations are simple and based on searching. In this module (next lectures), we will focus mainly on algorithms that solve the other two sub-steps--scheduling and binding. The next (double) lecture focuses on algorithms to automate scheduling.

Question and Answers

Question: Among the three sub-steps of HLS, scheduling, allocation and binding, what can be done without information regarding design-library? Answer: Scheduling and Binding can be done without information regarding design- library. Scheduling assigns control steps to all operations in the CDFG, after satisfying data-dependency between the operations, subject constraints like number of steps, number of modules etc. So none of the parameters are related to design-library. In case of Binding, operations and variables are attached to circuit modules, which are selected from the design library during the allocation phase. As circuit modules are already selected from the design library during the allocation phase, binding can work without any information from the design library.

Module-II

Lecture-II and III

Scheduling Algorithms

1. Introduction

In Lecture 1 of Module 2, we discussed that High Level Synthesis (HLS) involves three sub-parts namely, scheduling, allocation and binding. In this lecture, we will discuss scheduling algorithms, which automatically assign control steps to operations subject to design constraints. As disused in Lecture 1 of this module, scheduling problem can be of four types namely, unconstrained, time constrained, resource constrained and time-resource constrained.

There are many algorithms proposed in the literature that solve these four types of scheduling problem. Now, these algorithms can be classified into two types as heuristics and exact. Exact algorithms like Integer Liner Programming for scheduling, provides optimal schedule but consumes high processing time. In practical cases, these exact algorithms for HLS take prohibitive amount of execution time. To cater to the execution time issue, several algorithms based on greedy strategies have been developed that make a series of local decisions, selecting at each point the single “best” operation-control step pairing without backtracking or look-ahead. So they may miss the globally optimal solution, however, they do produce results quickly, and those results are generally be sufficiently close to optimal to be acceptable in practice. Such algorithms are called heuristic algorithms (for HLS). Examples for heuristic algorithms for HLS comprise As Soon As Possible (ASAP), As Late As Possible (ALAP), List Scheduling (LS) and Force Directed Scheduling (FDS).

ASAP, ALSP and FDS are time constrained algorithms while LS is resource constrained scheduling. In this (double) lecture, we will discuss all these scheduling algorithms.

2. As Soon As Possible Scheduling

As-Soon-As-Possible (ASAP) scheduling is one of the simplest scheduling algorithms used in HLS. In ASAP scheduling, first the maximum number of control steps that are allowed is determined. Following that, the algorithm schedules each operation, one at a time, into the earliest possible control step. In other words, ASAP algorithm schedules operations in the earliest possible control step, subject to satisfying the partial order, i.e., an operation is scheduled if and only if all its predecessors are scheduled in earlier control steps. If ASAP algorithm can schedule all the operations within the allowed number of control steps, scheduling is successful. It may be noted that ASAP algorithm does not consider any resource constraints. Algorithm for ASAP scheduling is given in Algorithm 1.

Algorithm 1: As Soon As possible Input: Operations O , Maximum number of control steps M^. Output: Control step for each operations, Status of scheduling. Steps for each operation oiO DO if oi has no immediate predecessors (i.e., computation from inputs) control_step( oi ) = 1. /* control_step( oi ) indicates control step into which operation oi is scheduled */ else control_step( oi ) = maximum(control_step( o (^) j ))+ 1,where o (^) j  { o o | is immediate predecessor of oi }. END If value of control_step( oi ),  M o , (^) iO then Status of scheduling is Successful.