Lecture 10: Vivado C to IP HLS. Housekeeping

Size: px

Start display at page:

Download "Lecture 10: Vivado C to IP HLS. Housekeeping"

Beverley Anthony
6 years ago
Views:

1 Lecture 10: Vivado C to IP HLS James C. Hoe Department of ECE Carnegie Mellon University F17 L10 S1, James C. Hoe, CMU/ECE/CALCM, 2017 Housekeeping Your goal today: learn how to tell Vivado HLS what you really want and understand what Vivado HLS is telling you Notices Handout #4: lab 2, due noon, 10/6 3.5 weeks to project proposal Readings Ch 15, The Zynq Book (skim Ch 14) Vivado Design Suite User Guide: High Level Synthesis (UG902) F17 L10 S2, James C. Hoe, CMU/ECE/CALCM, 2017

Tortoise Tortoise and Hare delivers exact optimal implementation to a fully specified objective (functional + tuning) perfection takes time say last 10% of quality takes up 90% of the time Hare only

2 Tortoise Tortoise and Hare delivers exact optimal implementation to a fully specified objective (functional + tuning) perfection takes time say last 10% of quality takes up 90% of the time Hare only gets to 90% quality delivers the design 10 times faster This hare doesn t take a nap after one design F17 L10 S3, James C. Hoe, CMU/ECE/CALCM, 2017 The Design Race power hey, it works out of time 90% Good Enough Box educated guess best possible 1/perf F17 L10 S4, James C. Hoe, CMU/ECE/CALCM, 2017

3 Why the Hare Wins In real design projects don t always know exact target initially can t land first shot on target anyway good enough really is good enough hitting schedule is everything show at COMDEX in Nov or bust in Dec There are a lot more rabbits than turtles in this world; there are not enough turtles in this world Even more turkeys... but that s a different class F17 L10 S5, James C. Hoe, CMU/ECE/CALCM, 2017 All characters appearing in this story are fictitious. Any resemblance to real persons, living or dead, is purely coincidental. Vivado HLS F17 L10 S6, James C. Hoe, CMU/ECE/CALCM, 2017

4 Function to IP, not Program to HW **Object of design is an IP module** Designer still in charge (garbage in, garbage out) specify functionality as algorithm (in C) specify structure as pragmas (beyond C) set optimization constraints (beyond C) Offload bit and cycle level design/opt. to tools Vivado HLS (formerly AutoESL; formerly UCLA) never mind all of C (what s main( )? what malloc?) never mind all usages of allowed subset (all loops okay, but static ones actually work well) what else beyond C might a HW designer need (types, interface, structural hints) F17 L10 S7, James C. Hoe, CMU/ECE/CALCM, 2017 What does Vivado see? int fibi(int n) { int last=1; int lastlast=0; int temp; if (n==0) return 0; if (n==1) return 1; for(;n>1;n--) { temp=last+lastlast; lastlast=last; last=temp; return temp; F17 L10 S8, James C. Hoe, CMU/ECE/CALCM, 2017

5 Function to IP Block n ap_clk ap_rst ap_start Don t look inside yet fibi ap_ready ap_done ap_idle What if I want multiple outputs? F17 L10 S9, James C. Hoe, CMU/ECE/CALCM, 2017 int fibi(int n) {.... return...; ap_clk AP_CTRL_HS Block Protocol I ap_rst ap_start ap_idle 1 1 O ap_ready ap_done F17 L10 S10, James C. Hoe, CMU/ECE/CALCM, 2017 inputs consumed output valid ready for new ap_start

6 Function Invocation: Latency vs Throughput minimum initiation interval latency start ready done start ready done start ready done F17 L10 S11, James C. Hoe, CMU/ECE/CALCM, 2017 Other Block Control Options ap_ctrl_chain separate input producer and output consumer ap_continue: driven by the consumer to backpressure the block and producer IF a block reaches done AND ap_continue is deasserted, the block will hold ap_done and keep output valid until ap_continue is asserted AXI compatible port interfaces software on ARM interacts with the block using fxn call like interfaces (input, output, start, etc.) IP specific.h and routines generated automatically F17 L10 S12, James C. Hoe, CMU/ECE/CALCM, 2017

7 F17 L10 S13, James C. Hoe, CMU/ECE/CALCM, 2017 Scalar I/O Port Timing By default (ap_none) input ports should be stable between ap_start and ap_ready output port is valid when ap_done 3 asynchronous handshake options on input ap_vld only: consumes only if input valid ap_ack only: signals back when input consumed ap_hs: ap_vld + ap_ack HLS s job to follow protocol n ap_vld ap_ack Pass by Reference Arguments void fibi(int *n, int *fib) { int last=1; int lastlast=0; int temp; int nn=*n; if (nn==0) { *fib=0; *n=0; return; if (nn==1) { *fib=1; *n=0; return; for(;nn>1;nn--) { temp=last+lastlast; lastlast=last; last=temp; *fib=last; *n=lastlast; F17 L10 S14, James C. Hoe, CMU/ECE/CALCM, 2017

Hoe, CMU/ECE/CALCM, 2017 void fibi(int *n, int *fib) {.... *n in RHS and LHS; *fib in LHS only.

8 Pass by Reference I/O n_i ap_clk ap_rst ap_start Don t look inside yet fib n_o ap_ready ap_done ap_idle They are not really pointers do not evaluate *(fib+1) or fib except to pretend to be a fifo F17 L10 S15, James C. Hoe, CMU/ECE/CALCM, 2017 void fibi(int *n, int *fib) {.... *n in RHS and LHS; *fib in LHS only.... used before assigned All I/O Options Fig 1 49, Vivado Design Suite User Guide: High Level Synthesis F17 L10 S16, James C. Hoe, CMU/ECE/CALCM, 2017

9 Array Arguments #define N (1<<10) void D2XPY (double Y[N], double X[N]) { for(i=0; i<n; i++) { Y[i]=2*X[i]+Y[i]; X_q0[63:0] X_ce0 X_addr0[9:0] F17 L10 S17, James C. Hoe, CMU/ECE/CALCM, 2017 *could ask to use separate read and write ports Y_q0[63:0] Y_ce0 Y_we0 Y_addr0[9:0] Array Arg Options By default, array args become BRAM ports array must be fixed size can use 2 ports for bandwidth or split read/write If array arg is accessed always consecutively AND only either read or written can become ap_fifo port i.e., no addresses, just push or pop Array args can also become AXI or a generic bus master ports Scheduler handles port sharing and dynamic delays F17 L10 S18, James C. Hoe, CMU/ECE/CALCM, 2017

10 Time to Look Inside n fibi ap_clk ap_rst ap_start ap_ready ap_done ap_idle F17 L10 S19, James C. Hoe, CMU/ECE/CALCM, 2017 MMM (yet again) void mmm(char A[N][N], char B[N][N], short C[N][N) { for(int i=0; i<n; i++) { for(int j=0; j<n; j++) { C[i][j]=0; for(int k=0; k<n; k++) { C[i][j] += A[i][k]*B[k][j]; keep it simple F17 L10 S20, James C. Hoe, CMU/ECE/CALCM, 2017 N 2 by 8b BRAM N 2 by 8b BRAM BRAM Rd BRAM Rd mmm BRAM Rd/Wr N 2 by 8b BRAM Same example as Zynq Book Tutorial 3

Structural Pragma: Pipelining Fully elaborate scope (e.g., unroll loops) Find minimum iteration interval (II) schedule II >= num stages a resource instance is used II >= RAW hazard distance E.

Hoe, CMU/ECE/CALCM, 2017 accum A*B rd0 C rd0 A rd0 B wr0 C accum A*B rd0 C rd0 A rd0 B wr0 C accum A*B rd0 C wr0 C accum structural conflict, II>=2 (II>=1 if 2 port) wr0 C HLS Analysis and

11 Structural Pragma: Pipelining Fully elaborate scope (e.g., unroll loops) Find minimum iteration interval (II) schedule II >= num stages a resource instance is used II >= RAW hazard distance E.g., to pipeline C[i][j]+=A[i][k]*B[k][j]; RAW hazard, II>=3 rd0 A rd0 B A*B rd0 C rd0 A rd0 B F17 L10 S21, James C. Hoe, CMU/ECE/CALCM, 2017 accum A*B rd0 C rd0 A rd0 B wr0 C accum A*B rd0 C rd0 A rd0 B wr0 C accum A*B rd0 C wr0 C accum structural conflict, II>=2 (II>=1 if 2 port) wr0 C HLS Analysis and Visualization // Zynq Book Tutorial 3, Sol#2 for(int i=0; i<5; i++) { for(int j=0; j<5; j++) { C[i][j]=0; for(int k=0; k<5; k++) { #pragma HLS PIPELINE C[i][j] += A[i][k]*B[k][j]; F17 L10 S22, James C. Hoe, CMU/ECE/CALCM, 2017 [Vivado HLS Screenshots]

$ARRAY_RESHAPE variable=b, dim=1 for(int i=0; i<5; i++) { for(int j=0; j<5; j++) { C[i][j]=0; #pragma HLS PIPELINE for(int k=0; k<5; k++) { C[i][j] += A[i][k]*B[k][j]; 18 643$

12 Design by Trial and Error // Zynq Book Tutorial 3, Sol#3 for(int i=0; i<5; i++) { for(int j=0; j<5; j++) { C[i][j]=0; #pragma HLS PIPELINE for(int k=0; k<5; k++) { C[i][j] += A[i][k]*B[k][j]; F17 L10 S23, James C. Hoe, CMU/ECE/CALCM, 2017 [Vivado HLS Screenshots] Design by Trial and Error // Zynq Book Tutorial 3, Sol#4 #program HLS ARRAY_RESHAPE variable=a, dim=2 #program HLS ARRAY_RESHAPE variable=b, dim=1 for(int i=0; i<5; i++) { for(int j=0; j<5; j++) { C[i][j]=0; #pragma HLS PIPELINE for(int k=0; k<5; k++) { C[i][j] += A[i][k]*B[k][j]; F17 L10 S24, James C. Hoe, CMU/ECE/CALCM, 2017 A and B reshaped to read entire row/column at a time? What if N>>5? [Vivado HLS Screenshots]

Recall from Last Time for(k= for(i= for(i= for(j= for(j= GET C[i][j] for(k= GET C[i][j] for(i= for(j= GET C[i][j] parallel kernel pipelines fully unrolled inner loops 18 643 F17 L10 S25, James C.

13 Recall from Last Time for(k= for(i= for(i= for(j= for(j= GET C[i][j] for(k= GET C[i][j] for(i= for(j= GET C[i][j] parallel kernel pipelines fully unrolled inner loops F17 L10 S25, James C. Hoe, CMU/ECE/CALCM, 2017 With Algo. Rewrite (Option 1) From here we can play with pragmas to sensibly widen concurrency if needed // assume C initialized to 0 for(int k=0; k<5; k++) for(int i=0; i<5; i++) { for(int j=0; j<5; j++) { #pragma HLS PIPELINE C[i][j]+= A[i][k]*B[k][j]; can fix by disable flattening F17 L10 S26, James C. Hoe, CMU/ECE/CALCM, 2017 [Vivado HLS Screenshots]

14 With Algo. Rewrite (Option 2) for(int i=0; i<5; i++) { for(int j=0; j<5; j++) { short Ctemp=0; for(int k=0; k<5; k++) #pragma HLS PIPELINE Ctemp += A[i][k]*B[k][j]; C[i][j]=Ctemp; can fix by disable flattening F17 L10 S27, James C. Hoe, CMU/ECE/CALCM, 2017 HLS figured out forwarding [Vivado HLS Screenshots] Loop Unroll (full and partial) amortize loop control overhead increase loop body size, hence ILP and scheduling flexibility Loop Merge combine loop bodies of independent loops of same control improve parallelism and scheduling Loop Flatten streamline loop nest control reduce start/finish stutter F17 L10 S28, James C. Hoe, CMU/ECE/CALCM, 2017 Pragma Crib Sheet: Loops 4 iter 2iter (unroll by2) 2x (2 iter) fully unrolled 2+2 iter 2 iter merged 4 iter longer steadystate

Map Pragma Crib Sheet: Arrays multiple arrays in same BRAM no perf loss if no scheduling conflicts Reshape change BRAM aspect ratio to widen ports higher bandwidth on

Hoe, CMU/ECE/CALCM, 2017 A lot more you can control; must read UG902 Design by Exploration reference algorithm & testbench algorithm for synthesis pragmas When this takes only

15 Map Pragma Crib Sheet: Arrays multiple arrays in same BRAM no perf loss if no scheduling conflicts Reshape change BRAM aspect ratio to widen ports higher bandwidth on consecutive addresses Partition map 1 array to multiple BRAMs multiple independent ports if no bank conflicts addr data F17 L10 S29, James C. Hoe, CMU/ECE/CALCM, 2017 A lot more you can control; must read UG902 Design by Exploration reference algorithm & testbench algorithm for synthesis pragmas When this takes only minutes, a little trial anderror is okay (just a little!!!!) co simulation validation HLS & analysis good enough yes no F17 L10 S30, James C. Hoe, CMU/ECE/CALCM, 2017 RTL RTL backend not good enough after backend

16 Putting it in context (from last time) Why hardware design is hard reason #1: low level abstraction reason #2: unrestricted design freedom reason #3: massive concurrency C to HW (i.e., C to RTL) compiler bridges the gap between functionality and implementation fill in the details below the functional abstraction make good decisions when filling in the details extract parallelism from a sequential specification Vivado does its part fast and without mistakes F17 L10 S31, James C. Hoe, CMU/ECE/CALCM, F17 L10 S32, James C. Hoe, CMU/ECE/CALCM, 2017 Parting Thoughts Vivado doesn t turn program into HW Vivado doesn t turn programmer into HW designer Multifaceted benefits to HW designer algo. development/debug/validate in SW pragma steering (no RTL hacking, machine tuning) fast analysis and visualization data type support it is about more than adding double to Verilog built in, stylized IP interfaces integration with the rest of Vivado and Zynq!! We are entering a new era for FPGAs

17 Vivado Software Defined SoC Screenshot, page 24, SDSoC Environment Getting Started (UG1028) F17 L10 S33, James C. Hoe, CMU/ECE/CALCM, 2017

Lecture 10: Vivado C to IP HLS. James C. Hoe Department of ECE Carnegie Mellon University

18 643 Lecture 10: Vivado C to IP HLS James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L10 S1, James C. Hoe, CMU/ECE/CALCM, 2017 Housekeeping Your goal today: learn how to tell Vivado