Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications

Size: px

Start display at page:

Download "Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications"

Sheena Greer
5 years ago
Views:

1 Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications Authors: Shuangchen Li, Yongpan Liu, X.Sharon Hu, Xinyu He, Pei Zhang, and Huazhong Yang 2013/01/23

2 Outline Introduction Overview MILP-Based Solution Heuristic Solution Experimental Evaluation Conclusions and Future work 2

3 Introduction: Background Application complexity increasing Well-developed software libraries Low speed, high power MPSoCs architecture Hardware design complexity 3

4 Introduction: Background (cont d) How to rapidly design hardware from existing software algorithms? This challenge is now new. However, Ever increasing design gap Progression log of EDA tools Synthesis level Gates/Chip RTL level System level Technology capabilities Physical level Moore low Gate level HW design gap HW design productivity s s s Mid-1990s time [29] 4

5 Introduction: Motivation C2RTL tools are promising A number of C2RTL tools A lot of successful stories A DVB-SH Turbo Decoder [8] A face detection system [9] A 3g/4g MIMO wireless systems [7] 5

options are limited A Reed-Solomon decoding [28] A JPEG encoder [10] Flatten

6 Introduction: Motivation (cont d) However, state-of-the-art C2RTL tools suffer from: Low Quality of results (QoR) for large C programs System-level optimization options are limited A Reed-Solomon decoding [28] A JPEG encoder [10] Flatten approach Hierarchical approach speedup Cycles 42,475,202 4,070, x Clock x 6

7 Outline Introduction Overview MILP-Based Solution Heuristic Solution Experimental Evaluation Conclusions and Future work 7

Overview: Our work Given a large C program for a streaming application system constraints (latency, area, ) Determine how to partition the code into pipelined

8 Overview: Our work Given a large C program for a streaming application system constraints (latency, area, ) Determine how to partition the code into pipelined blocks Partition which blocks should be parallelized Parallelization The objectives Improve synthesis result quality Provide more system-level optimization options 8

9 STEP 1: We use excite here STEP 2: Determine partition and parallelization STEP 3: Synthesize each block with a C2RTL tool STEP 4: Construct the complete system Overview: Design flow STEP 1: STEP 2: STEP 3: STEP 4: C programs need to be synthesized Extract parameters of N functions 9 Throughput and Area constraints Optimize partition and parallelization Block-level parallelization F1 F1 F2 F2 Block1-1 Block1-2 Partition F3 F4 Block2 F5 FN Blockm Synthesize blocks by a C2RTL tool (excite) Assemble the modules into a single design Controller Module 1-1 Module 1-2 PE1 FIFO Module 2 PE2 Structure of the final system FIFO FIFO Module m PEm

10 Overview: An example Given a C program: In the straight-line style Given constraints: System throughput and area Partition: Which functions should be synthesized together as one pipeline stage Parallelization: Which synthesized modules should be parallelized Partition main(){ } C program F 1 (a,b); F 2 (b,c); F 3 (c,d); F 4 (d,e); F 5 (e,f); F 6 (f,g); F 7 (g,h); F 8 (h,i); Synthesized HDL Module 1 (from F 1,F 2,F 3 ) Module 2 (from F 4 ) Module 3 (from F 5,F 6 ) Module 4 (from F 7,F 8 ) FIFO FIFO FIFO Parallelization 10

11 Overview: Challenges The design space is large: Partition has a great impact on throughput and area Parallelization has a great impact on throughput and area The Pareto optimal solutions 2.4 x 104 The importance to simultaneously consider partition 2.2 and parallelization: 2 W. BLP W.O. BLP The constraints are for the system after both partition and parallelization Area (a all ) 1.8 If optimizing them separately, it is not clear how to apply the 1.6 constraints to each problem individually A GSM case Latency (r -1 all ) 11

12 Overview: Related work Application Input Target Partition Parallelization A. Hagiescu and et al., in DAC2009[11] J. Cong and et al., in DATE2012[12] Y. Liu and et al., in Intech Book[13] Y. Hara and et al., in IEICE[14] Stream StreamIT MSoPC Manually Heuristic Stream C FPGA Manually ILP Stream C FPGA Manually Heuristic General C FPGA ILP N/A This work Stream C FPGA Both MILP and Heuristic (consider simultaneously) A somewhat related line of work is mapping C programs to MPSoCs (software mapping): Blocks (or tasks) can be assigned to the same processor The processor area is given 12

13 Overview: Our Contribution A novel MILP based formulation Find a partition and parallelization solution with maximum throughput or minimum area while satisfying a given area or throughput constraint, respectively An efficient heuristic algorithm Overcome the scalability challenge facing the MILP formulation Validation of the proposed methods Developing FPGA based accelerators for seven streaming applications 13

14 Outline Introduction Overview MILP-Based Solution Heuristic Solution Experimental Evaluation Conclusions and Future work 14

15 MILP-Based Solution: Formulation Given function parameters (Para) Area, throughput of each function Determine (x n ) Which functions should be clustered to form blocks Which blocks should be parallelized Objective: min. Area (a all (x n,para) ) or max. Throughput (r all (x n,para)) Subject to: Area constraints (a all <A req ) Throughput constraints (r all >R req ) Connectivity constraints 15

16 MILP-Based Solution: Variable We use {x n } Z to represent partition and parallelization: Partition: If x n =0: F n and F n+1 are in the same block Parallelization: If x n 0: The parallelism degree of block with F n is x n F1 F2 F3 F4 F5 F6 F7 We also use {y i,j } Binary to represent partition y i,j =1 means F i,f i+1 F j are clustered F1 F1 F2 F2 F3 x n 16 F4 {0, 2, 1, 0, 1, 0, 3 } F5 F6 F6 F6 F7 F7 F7

17 MILP-Based Solution: Details To calculate throughput r all (x n,para): r r if y 1 (1) all i, j i, j To calculate area a all (x n,para): Connectivity constraints: i n i 1 i, n r i, j x y / T x y P (2) 1/ max{, } otherwi j i, j i, j j i, j i, j in out Ti, j Ti, j se i N j N le/ mem le/ mem le/ mem le/ mem all fifo (( j 1) j i, j ) i, j i 1 j i (3) a a x O x A y n i n x 1 when y 0 n y x i 1 i, n (4) i j 1 i N y y j [2, N] (5) i, j 1 j, i i 1 i j i j i N y y y 1 j [1, N] (6) i, j j, i j, j i 1 i j 17

18 Outline Introduction Overview MILP-Based Solution Heuristic Solution Experimental Evaluation Conclusions and Future work 18

19 Heuristic Solution: Overview Motivation: MILP is not scalable Bad feasible regions may incur long running time even when N is small Consider partition and oarallelization separately (constructive algorithm): Parallelization before partition to increase throughput: Incx() Partition for the given parallelization to reduce area: Clust() Implement Incx() and Clust() in a backtracking iterative way 19

20 Heuristic Solution: Algorithm Do Incx() until R req is satisfied Clust() Calculate r all and a all No Does a all violate A req? Yes Incx() Backtrack to last parallelization strategy Is this situation considered yet? No Yes Done Incx(): Parallelization before Partition to increase throughput Clust(): Partition for the given Parallelization to reduce area 20

21 Heuristic Solution: Algorithm (cont d) Incx(), Parallelization before Partition: Increase the parallelization degree of the bottleneck function Clust(), Partition under the given Parallelization: Model the blocks and their connections as a graph Convert the problem to a shortest path problem 0 B 1,1 A 1,1 A B 2,2 2,2 B 3,3 A 1,1 Begin 0 B 1,2 B 2,3 A 3,3 A 1,2 A 2,3 B 1,3 21 END

22 Outline Introduction Overview MILP-Based Solution Heuristic Solution Experimental Evaluation Conclusions and Future work 22

23 Experiments: Set up 7 Benchmark [21]: ADPCM JPEG encoder/decoder AES encryption/decryption GSM Filter Groups Environment & flow: C2RTL: excite Logic synthesis: Quartus II (cyclone II) Simulation: Modelsim excite C2RTL tool: modeling Our solution: Optimize partition and parallelization excite C2RTL tool: Implement hardware Altera Quartus tool: Area evaluation Mentor Modelsim tool: Throughput evaluation 23

24 Experiments: Validate proposed method Min. area for GSM case Heuristic solutions differ from the MILP results by 2.3% on average 24

25 Exp.: Validate proposed method (cont d) Min. Area for 7 benchmarks Heuristic with a difference of 7.5% on average 25

26 Experiments: Running time Running time: The heuristic solutions are worse by 7.2% on average 26

27 Outline Introduction Overview MILP-Based Solution Heuristic Solution Experimental Evaluation Conclusions and Future work 27

28 Conclusions and Future work Conclusions : Our work adopts a hierarchical framework with automatic C-code partition and block-level parallelization Both an MILP-based solution and a heuristic solution are proposed Experimental results obtained from seven real applications show that our approaches are effective Future work: Extend the solution to C program with feedback Taking power into consideration 28

29 Reference [1]-[27] is listed in the paper [28] Comparison of high level design methodologies for algorithmic ips: Bluespec and c-based synthesis, Ph.D. dissertation, MIT, 2009 [29] ITRS roadmap on Design 2011 Edition 29

30 30 THANK YOU!

31 MILP-Based Solution: Linearization Linearize x j y i,j : z i,j =x j y i,j My z My i, j i, j i, j x j M (1 yi, j ) zi, j x j M (1 yi, j ) Linearize Equation (1): r r M(1 y ) 1 i j N all i, j i, j Linearize Equation (2): zi, j / Ti, j ri, j in out 1/ max{ Ti, j, Ti, j } Linearize Equation (4): i n i n N y x M y x, y binary i, n n i, n n i, j i 1 i 1 31

32 Exp.: Validate proposed method (cont d) Min. area or Max. throughput for GSM 32

VHDL for Synthesis. Course Description. Course Duration. Goals

VHDL for Synthesis. Course Description. Course Duration. Goals VHDL for Synthesis Course Description This course provides all necessary theoretical and practical know how to write an efficient synthesizable HDL code through VHDL standard language. The course goes