Lecture 10: Vivado C to IP HLS. James C. Hoe Department of ECE Carnegie Mellon University

Size: px

Start display at page:

Download "Lecture 10: Vivado C to IP HLS. James C. Hoe Department of ECE Carnegie Mellon University"

Oliver Nichols
5 years ago
Views:

1 Lecture 10: Vivado C to IP HLS James C. Hoe Department of ECE Carnegie Mellon University F17 L10 S1, James C. Hoe, CMU/ECE/CALCM, 2017

2 Housekeeping Your goal today: learn how to tell Vivado HLS what you really want and understand what Vivado HLS is telling you Notices Handout #4: lab 2, due noon, 10/6 3.5 weeks to project proposal Readings Ch 15, The Zynq Book (skim Ch 14) Vivado Design Suite User Guide: High Level Synthesis (UG902) F17 L10 S2, James C. Hoe, CMU/ECE/CALCM, 2017

90% of the time Hare only gets to 90% quality delivers the design 10 times faster This

3 Tortoise Tortoise and Hare delivers exact optimal implementation to a fully specified objective (functional + tuning) perfection takes time say last 10% of quality takes up 90% of the time Hare only gets to 90% quality delivers the design 10 times faster This hare doesn t take a nap after one design F17 L10 S3, James C. Hoe, CMU/ECE/CALCM, 2017

4 The Design Race power hey, it works out of time 90% Good Enough Box educated guess best possible 1/perf F17 L10 S4, James C. Hoe, CMU/ECE/CALCM, 2017

5 Why the Hare Wins In real design projects don t always know exact target initially can t land first shot on target anyway good enough really is good enough hitting schedule is everything show at COMDEX in Nov or bust in Dec There are a lot more rabbits than turtles in this world; there are not enough turtles in this world Even more turkeys... but that s a different class F17 L10 S5, James C. Hoe, CMU/ECE/CALCM, 2017 All characters appearing in this story are fictitious. Any resemblance to real persons, living or dead, is purely coincidental.

6 F17 L10 S6, James C. Hoe, CMU/ECE/CALCM, 2017 Vivado HLS

7 Function to IP, not Program to HW **Object of design is an IP module** Designer still in charge (garbage in, garbage out) specify functionality as algorithm (in C) specify structure as pragmas (beyond C) set optimization constraints (beyond C) Offload bit and cycle level design/opt. to tools Vivado HLS (formerly AutoESL; formerly UCLA) never mind all of C (what s main( )? what malloc?) never mind all usages of allowed subset (all loops okay, but static ones actually work well) what else beyond C might a HW designer need (types, interface, structural hints) F17 L10 S7, James C. Hoe, CMU/ECE/CALCM, 2017

8 What does Vivado see? int fibi(int n) { int last=1; int lastlast=0; int temp; if (n==0) return 0; if (n==1) return 1; for(;n>1;n--) { temp=last+lastlast; lastlast=last; last=temp; } } return temp; F17 L10 S8, James C. Hoe, CMU/ECE/CALCM, 2017

9 Function to IP Block n ap_clk ap_rst ap_start Don t look inside yet fibi ap_ready ap_done ap_idle What if I want multiple outputs? F17 L10 S9, James C. Hoe, CMU/ECE/CALCM, 2017 int fibi(int n) {.... return...; }

10 AP_CTRL_HS Block Protocol ap_clk I ap_rst ap_start ap_idle 1 1 O ap_ready ap_done F17 L10 S10, James C. Hoe, CMU/ECE/CALCM, 2017 inputs consumed output valid ready for new ap_start

11 Function Invocation: Latency vs Throughput minimum initiation interval latency start ready done start ready done start ready done F17 L10 S11, James C. Hoe, CMU/ECE/CALCM, 2017

12 Other Block Control Options ap_ctrl_chain separate input producer and output consumer ap_continue:driven by the consumer to backpressure the block and producer IF a block reaches done AND ap_continue is deasserted, the block will hold ap_done and keep output valid until ap_continue is asserted AXI compatible port interfaces software on ARM interacts with the block using fxn call like interfaces (input, output, start, etc.) IP specific.h and routines generated automatically F17 L10 S12, James C. Hoe, CMU/ECE/CALCM, 2017

13 Scalar I/O Port Timing By default (ap_none) input ports should be stable between ap_start and ap_ready output port is valid when ap_done 3 asynchronous handshake options on input ap_vldonly: consumes only if input valid ap_ackonly: signals back when input consumed ap_hs: ap_vld + ap_ack HLS s job to follow protocol F17 L10 S13, James C. Hoe, CMU/ECE/CALCM, 2017 n ap_vld ap_ack

14 Pass by Reference Arguments void fibi(int *n, int *fib) { int last=1; int lastlast=0; int temp; int nn=*n; if (nn==0) { *fib=0; *n=0; return; } if (nn==1) { *fib=1; *n=0; return; } for(;nn>1;nn--) { temp=last+lastlast; lastlast=last; last=temp; } } *fib=last; *n=lastlast; F17 L10 S14, James C. Hoe, CMU/ECE/CALCM, 2017

15 Pass by Reference I/O n_i ap_clk ap_rst ap_start Don t look inside yet fib n_o ap_ready ap_done ap_idle They are not really pointers do not evaluate *(fib+1) or fib except to pretend to be a fifo F17 L10 S15, James C. Hoe, CMU/ECE/CALCM, 2017 void fibi(int *n, int *fib) {.... *n in RHS and LHS; *fib in LHS only.... } used before assigned

16 All I/O Options Fig 1 49, Vivado Design Suite User Guide: High Level Synthesis F17 L10 S16, James C. Hoe, CMU/ECE/CALCM, 2017

17 Array Arguments #define N (1<<10) void D2XPY (double Y[N], double X[N]) { for(i=0; i<n; i++) { Y[i]=2*X[i]+Y[i]; } } X_q0[63:0] X_ce0 X_addr0[9:0] F17 L10 S17, James C. Hoe, CMU/ECE/CALCM, 2017 *could ask to use separate read and write ports Y_q0[63:0] Y_ce0 Y_we0 Y_addr0[9:0]

18 Array Arg Options By default, array args become BRAM ports array must be fixed size can use 2 ports for bandwidth or split read/write If array arg is accessed always consecutively AND only either read or written can become ap_fifo port i.e., no addresses, just push or pop Array args can also become AXI or a generic bus master ports Scheduler handles port sharing and dynamic delays F17 L10 S18, James C. Hoe, CMU/ECE/CALCM, 2017

19 Time to Look Inside n fibi ap_clk ap_rst ap_start ap_ready ap_done ap_idle F17 L10 S19, James C. Hoe, CMU/ECE/CALCM, 2017

20 MMM (yet again) void mmm(char A[N][N], char B[N][N], short C[N][N) { for(int i=0; i<n; i++) { for(int j=0; j<n; j++) { C[i][j]=0; for(int k=0; k<n; k++) { C[i][j] += A[i][k]*B[k][j]; } } } keep it simple } F17 L10 S20, James C. Hoe, CMU/ECE/CALCM, 2017 N 2 by 8b BRAM N 2 by 8b BRAM BRAM Rd BRAM Rd mmm BRAM Rd/Wr N 2 by 8b BRAM Same example as Zynq Book Tutorial 3

21 Structural Pragma: Pipelining Fully elaborate scope (e.g., unroll loops) Find minimum iteration interval (II) schedule II >= num stages a resource instance is used II >= RAW hazard distance E.g., to pipeline C[i][j]+=A[i][k]*B[k][j]; RAW hazard, II>=3 rd0 A rd0 B A*B rd0 C rd0 A rd0 B F17 L10 S21, James C. Hoe, CMU/ECE/CALCM, 2017 accum A*B rd0 C rd0 A rd0 B wr0 C accum A*B rd0 C rd0 A rd0 B wr0 C accum A*B rd0 C wr0 C accum structural conflict, II>=2 (II>=1 if 2 port) wr0 C

k=0; k<5; k++) { #pragma HLS PIPELINE C[i][j] += A[i][k]*B[k][j]; 18

22 HLS Analysis and Visualization // Zynq Book Tutorial 3, Sol#2 for(int i=0; i<5; i++) { for(int j=0; j<5; j++) { C[i][j]=0; for(int k=0; k<5; k++) { #pragma HLS PIPELINE C[i][j] += A[i][k]*B[k][j]; F17 L10 S22, James C. Hoe, CMU/ECE/CALCM, 2017 [Vivado HLS Screenshots]

23 Design by Trial and Error // Zynq Book Tutorial 3, Sol#3 for(int i=0; i<5; i++) { for(int j=0; j<5; j++) { C[i][j]=0; #pragma HLS PIPELINE for(int k=0; k<5; k++) { C[i][j] += A[i][k]*B[k][j]; F17 L10 S23, James C. Hoe, CMU/ECE/CALCM, 2017 [Vivado HLS Screenshots]

$Design by Trial and Error // Zynq Book Tutorial 3, Sol#4 #program HLS ARRAY_RESHAPE variable=a, dim=2 #program HLS ARRAY_RESHAPE variable=b, dim=1 for(int i=0; i<5; i++) { for(int j=0; j<5; j++) {$

24 Design by Trial and Error // Zynq Book Tutorial 3, Sol#4 #program HLS ARRAY_RESHAPE variable=a, dim=2 #program HLS ARRAY_RESHAPE variable=b, dim=1 for(int i=0; i<5; i++) { for(int j=0; j<5; j++) { C[i][j]=0; #pragma HLS PIPELINE for(int k=0; k<5; k++) { C[i][j] += A[i][k]*B[k][j]; F17 L10 S24, James C. Hoe, CMU/ECE/CALCM, 2017 A and B reshaped to read entire row/column at a time? What if N>>5? [Vivado HLS Screenshots]

25 Recall from Last Time for(k= for(i= for(i= for(j= for(j= GET C[i][j] for(k= GET C[i][j] for(i= for(j= GET C[i][j] parallel kernel pipelines fully unrolled inner loops F17 L10 S25, James C. Hoe, CMU/ECE/CALCM, 2017

26 With Algo. Rewrite (Option 1) From here we can play with pragmas to sensibly widen concurrency if needed // assume C initialized to 0 for(int k=0; k<5; k++) for(int i=0; i<5; i++) { for(int j=0; j<5; j++) { #pragma HLS PIPELINE C[i][j]+= A[i][k]*B[k][j]; can fix by disable flattening F17 L10 S26, James C. Hoe, CMU/ECE/CALCM, 2017 [Vivado HLS Screenshots]

27 With Algo. Rewrite (Option 2) for(int i=0; i<5; i++) { for(int j=0; j<5; j++) { short Ctemp=0; for(int k=0; k<5; k++) #pragma HLS PIPELINE Ctemp += A[i][k]*B[k][j]; C[i][j]=Ctemp; can fix by disable flattening F17 L10 S27, James C. Hoe, CMU/ECE/CALCM, 2017 HLS figured out forwarding [Vivado HLS Screenshots]

28 Loop Unroll (full and partial) amortize loop control overhead increase loop body size, hence ILP and scheduling flexibility Loop Merge combine loop bodies of independent loops of same control improve parallelism and scheduling Loop Flatten streamline loop nest control reduce start/finish stutter F17 L10 S28, James C. Hoe, CMU/ECE/CALCM, 2017 Pragma Crib Sheet: Loops 4 iter 2iter (unroll by2) 2x (2 iter) fully unrolled 2+2 iter 2 iter merged 4 iter longer steadystate

29 Map multiple arrays in same BRAM no perf loss if no scheduling conflicts Reshape change BRAM aspect ratio to widen ports higher bandwidth on consecutive addresses Partition Pragma Crib Sheet: Arrays map 1 array to multiple BRAMs multiple independent ports if no bank conflicts addr data F17 L10 S29, James C. Hoe, CMU/ECE/CALCM, 2017 A lot more you can control; must read UG902

Design by Exploration reference algorithm & testbench algorithm for synthesis pragmas When this takes only minutes, a little trial anderror is okay (just a little!

30 Design by Exploration reference algorithm & testbench algorithm for synthesis pragmas When this takes only minutes, a little trial anderror is okay (just a little!!!!) co simulation validation HLS & analysis good enough yes no F17 L10 S30, James C. Hoe, CMU/ECE/CALCM, 2017 RTL RTL backend not good enough after backend

31 Putting it in context (from last time) Why hardware design is hard reason #1: low level abstraction reason #2: unrestricted design freedom reason #3: massive concurrency C to HW (i.e., C to RTL) compiler bridges the gap between functionality and implementation fill in the details below the functional abstraction make good decisions when filling in the details extract parallelism from a sequential specification Vivado does its part fast and without mistakes F17 L10 S31, James C. Hoe, CMU/ECE/CALCM, 2017

32 F17 L10 S32, James C. Hoe, CMU/ECE/CALCM, 2017 Parting Thoughts Vivado doesn t turn program into HW Vivado doesn t turn programmer into HW designer Multifaceted benefits to HW designer algo. development/debug/validate in SW pragma steering (no RTL hacking, machine tuning) fast analysis and visualization data type support it is about more than adding double to Verilog built in, stylized IP interfaces integration with the rest of Vivado and Zynq!! We are entering a new era for FPGAs

33 Vivado Software Defined SoC Screenshot, page 24, SDSoC Environment Getting Started (UG1028) F17 L10 S33, James C. Hoe, CMU/ECE/CALCM, 2017

Lecture 10: Vivado C to IP HLS. Housekeeping

18 643 Lecture 10: Vivado C to IP HLS James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L10 S1, James C. Hoe, CMU/ECE/CALCM, 2017 Housekeeping Your goal today: learn how to tell Vivado