ECE 5775 (Fall 17) High-Level Digital Design Automation. More Pipelining

Size: px

Start display at page:

Download "ECE 5775 (Fall 17) High-Level Digital Design Automation. More Pipelining"

Heather Hawkins
6 years ago
Views:

1 ECE 5775 (Fall 17) High-Level Digital Design Automation More Pipelining

2 Announcements HW 2 due Monday 10/16 (no late submission) Second round paper 5pm tomorrow on Piazza Talk by Prof. Margaret Martonosi 4:30pm today PHL 203 on End of Moore s Law Challenges and Opportunities: Computer Architecture Perspectives 5pm instructor office hour cancelled Midterm logistics Exam time: Thursday 10/19 in class (75 mins) Open book, open notes, closed Internet In-class recitation on Tuesday 10/17 Instructor office hour next Thursday moved to Wednesday 10/18, 4:30-5:30pm (Rhodes 320) 1

3 Agenda Midterm coverage Modulo scheduling Case studies on pipelining 2

4 Midterm (20%) Topics covered Specialized computing FPGAs C-based synthesis Control flow graph and SSA Scheduling Resource sharing Pipelining 3

5 Key Topics (1) Algorithm basics Time complexity, esp. big-o notation Graphs Trees, DAGs, topological sort BDDs, timing analysis FPGAs LUTs and LUT mapping C-based synthesis Arbitrary precision and fixed-point types Basic C-to-RTL mapping Key HLS optimizations to improve design performance 4

6 Key Topics (2) Compiler analysis Control data flow graph Dominance relation SSA Scheduling TCS & RCS algorithms: ILP, list scheduling, SDC Operation chaining, frequency/latency/resource constraints Ability to devise a simple scheduling algorithm to optimize certain design metric 5

7 Key Topics (3) Resource sharing Conflict and compatibility graphs Left edge algorithm Ability to determine minimum resource usage in # of functional units and/or registers, given a fixed schedule Pipelining Dependence types Ability to determine minimum II given a code snippet Modulo scheduling concepts: MII, RecMII, ResMII 6

8 Review: Dependence Graph Data dependences of a loop often represented by a dependence graph Forward edges: Intra-iteration (loopindependent) dependences Back edges: Inter-iteration (loop-carried) dependences Edges are annotated with distance values: number of iterations separating the two dependent operations involved [1] [0] v 1 [0] v 2 v 3 [0] [0] [2] Recurrence manifests itself as a circuit in the dependence graph v 4 Edges annotated with distance values 7

9 Pipelining through Modulo Scheduling Dependence graph: LD + Schedule: 0 1 LD + II = ST II = 2 Time (cycle) Iteration 0 LD + ST LD + ST ST LD + ST LD + ST LD ST + Initiation Interval (II) slot 0 slot 1 Steady state (II cycles) Steady state determines both performance resource usage 8

10 Modulo Scheduling A regular form of software pipelining technique Also applies to loop (/ function) pipelining for hardware synthesis Loop iterations (/ function) use the same schedule, which are initiated at a constant rate Advantages of modulo scheduling Easy to analyze Steady state determines performance and resource usage Cost efficient No code or hardware replication Optimization objective: (1) Minimize II under resource constraints or (2) minimize resource usage under II constraint NP-hard in general Optimal polynomial time solution exists without recurrences or resource constraints 9

11 Algorithmic Scheme for Modulo Scheduling Common scheme of heuristic algorithms Find MII (a lower bound on II) = max { ResMII, RecMII) Look for a schedule with given II If a feasible schedule not found, increase II and try again Find MII and set II = MII Look for a schedule Found it? No Increase II Yes 10

12 Calculating Lower Bound of Initiation Interval Minimum possible II (MII) MII = max(resmii, RecMII) A lower bound, not necessary achievable Resource constrained MII (ResMII) ResMII = max i éops(r i ) / Limit(r i )ù OPs(r): number of operations that use resource of type r Limit(r): number of available resources of type r Recurrence constrained MII (RecMII) RecMII = max i élatency(c i ) / Distance(c i )ù Latency(c i ): total latency in dependence circuit c i Distance(c i ): total distance in dependence circuit c i 11

13 Minimum II due to Resource Limits (ResMII) Dependence Reservation tables adders a0 a1 a2 a3 time i0 i1 i2 i3 i0 i1 i2 i0 i1 i0 4 5 i4 i5 i3 i4 i2 i3 i1 i2 0, 1, 2, 3, : time (clock cycles) a0, a1, a2, a3 : available adders i0, i1, i2, : loop iterations adders a0 a1 i0 i0 i1 i1 i0 i0 i2 i2 i3 i3 i1 i1 i2 i2 due to limited resources, cannot initiate iterations less than 2 cycles apart Compute ResMII: Max among all types of resources ResMII = max i éops(r i ) / Limit(r i )ù 12

14 Minimum II due to Recurrences (RecMII) Dependence Schedule (II=2) Dependence Schedule (II=1) [0] a b [1] [1] dependence distance = 1 a b a [0] a b [3] [3] dependence distance = 3 b Assume no operation chaining a b a b a b a b Compute Recurrence Minimum II (RecMII): Max among all circuits of: RecMII = max i élatency(c i ) / Distance(c i )ù Latency(c) : sum of operation latencies along circuit c Distance(c) : sum of dependence distances along circuit c 13

15 SDC-Based Modulo Scheduling SDC-based modulo scheduling Unifies intra-iteration and interiteration scheduling constraints in a single SDC Iterative algorithm with efficient incremental SDC update Loop Find Minimum II Model intra-iteration scheduling constraints Model inter-iteration scheduling constraints SDC feasible? Yes Incremental scheduling No Fail Increase II Schedule 14

16 Example: Modeling Loop-Carried Dependence The dependence between two operations from different iterations is termed inter-iteration (loop-carried) dependence Loop-carried dependence u à v with Dist(u, v) = K s u + Lat u s v + K* II A[i] C[i] ld ld v 1 v 2 II = 2 s 5 s 1 + 2*II for (i = 0; i < N-2; i++) { B[i] = A[i] * C[i]; A[i+2] = B[i] + C[i]; } v 3 + st v 4 v 5 ld v 1 v 3 ld v 2 Dist(v 5, v 1 ) = 2 A[i+2] + v 4 ld ld v 1 v 2 st v 5 v 3 15

17 SDC Constraint Graph The constraint inequalities form a system of difference constraints SDC(X, C) X denotes the variable set C denotes the constraint set The constraint graph G c (V c, E c ) of SDC(X, C) is a weighted directed graph x i Î X : v i Î V c x i x j b k Î C : e(v i, v j ) ÎE c and weight(e) = b k 16

18 Feasiblity Checking SDC Constraint Graph G c X: {x 1, x 2, x 3, x 4, x 5 } v 1 0 x 1 x 2 0 x 1 x 5-1 x 2 x 5 1 x 3 x 1-4 x 1 x 4 3 x 4 x 3-1 v 5-1 v v 2 x 3 x 1-4 x 1 x 4 3 x 4 x 3-1 à 0-2 Contradiction! v 3 An SDC(X, C) is feasible if and only if its constraint graph G c does not contain any negative cycles Feasibility check can be done by solving a single-source shortest path problem 17

19 Case Study: Prefix Sum Prefix sum computes a cumulative sum of a sequence of numbers commonly used in many applications such as radix sort, histogram, etc. void prefixsum ( int in[n], int out[n] ) out[0] = in[0]; for ( int i = 1; i < N; i++ ) { #pragma HLS pipeline II=? out[i] = out[i-1]+ in[i]; } } out[0] = in[0]; out[1] = in[0] + in[1]; out[1] = in[0] + in[1] + in[2]; out[1] = in[0] + in[1] + in[2] + in[3]; 18

20 Prefix Sum: RecMII Loop-carried dependence exists between to reads on out[] Assume chaining is not possible on memory reads (i.e., ld) and writes (i.e., st) due to target cycle time RecMII = 3 in[i] ld 2 out[i-1] ld 1 out[0] = in[0]; for ( int i = 1; i < N; i++ ) out[i] = out[i-1]+ in[i]; out[i] + st ld Load st Store cycle 1 cycle 2 cycle 3 cycle 4 i = 0 ld 1 ld 2 + st i = 1 ld 1 II = 1 + st ld 2 Assume chaining is not possible on memory reads (i.e., ld) and writes (i.e., st) due to cycle time constraint 19

21 Prefix Sum: Code Optimization Introduce an intermediate variable tmp to hold the running sum from the previous in[] values Shorter dependence circuit leads to RecMII = 1 in[i] ld + tmp int tmp = in[0]; for ( int i = 1; i < N; i++ ) { tmp += in[i]; out[i] = tmp; } out[i] st ld Load st Store cycle 1 cycle 2 cycle 3 cycle 4 i = 0 ld + st i = 1 ld + st II = 1 20

Case Study: Convolution for Image Processing A common computation of image/video processing is performed over overlapping stencils, termed as

22 Case Study: Convolution for Image Processing A common computation of image/video processing is performed over overlapping stencils, termed as convolution Input image frame x3 convolution Output image frame 21

23 Achieving High Throughput with Pipelining for (r = 1; r < R; r++) for (c = 1; c < C; c++) { #pragma HLS pipeline II=? for (i = 0; i < 3; i++) for (j = 0; j < 3; j++) out[r][c] += img[r+i-1][c+j-1] * f[i][j]; } Inner loops (i & j) are automatically unrolled With a 3x3 convolution kernel, 9 pixels are required for calculating the value of one output pixel If the entire input image is stored in an on-chip buffer with two read ports ResMII =? What about RecMII? 22

two pixels from line buffer New pixel fetched from input stream or frame

24 Achieving II=1 for 3x3 Convolution Pixels in line buffer (2 lines stored) Push three pixels into shift register window: one new pixel plus two pixels from line buffer New pixel fetched from input stream or frame buffer in off-chip memory Output pixel produced by one convolution operation 23

25 Resulting Specialized Memory Hierarchy Memory architecture customized for convolution Input pixel stream Flip- Flops Convolve Output pixel stream Processing window Line buffers Frame buffers On-chip SRAMs Frame n-2 Frame n-1 Frame n Off-chip DDR 24

26 HLS Code Snippet 1 LineBuffer<2,C,pixel_t> linebuf; 2 Window<3,3,pixel_t> window; 3 for (int r = 1; r < R+1; r++) { 4 for (int c = 1; c < C+1; c++) { 5 #pragma HLS pipeline II=1 6 pixel_t new_pixel = img[r][c]; 7 // Update shift window 8 window.shift_left(); 9 if (r < R && c < C) { 10 for (int i = 0; i < 2; i++ ) 11 window.insert(buf[i][c]); 12 } 13 else { // zero padding 14 for (int i = 0; i < 2; i++) 15 window.insert(0); 16 } 17 window.insert(new_pixel); 18 // Update line buffer 19 linebuf.shift_up(c); 20 if (r < R && c < C) 21 linebuf[1].insert(c, new_pixel); 22 else // Zero padding 23 linebuf[1].insert(c, 0); 24 // Perform 3x3 convolution 25 out[r-1][c-1] = convolve(window, weights); 26 } 27 } 25

27 Summary Pipelining is one of the most commonly used techniques in HLS to boost performance of the synthesized hardware Recurrences and resource restrictions limit the pipeline throughput Modulo scheduling A regular form of software pipeline technique Also applies to loop pipelining for hardware synthesis NP-hard problem in general 26

28 Acknowledgements These slides contain/adapt materials developed by Prof. Ryan Kastner (UCSD) Prof. Scott Mahlke (UMich) Dr. Stephen Neuendorffer (Xilinx) 27

Topic 12: Cyclic Instruction Scheduling

Topic 12: Cyclic Instruction Scheduling COS 320 Compiling Techniques Princeton University Spring 2015 Prof. David August 1 Overlap Iterations Using Pipelining time Iteration 1 2 3 n n 1 2 3 With hardware