Lecture 6: Hard vs Soft Logic. James C. Hoe Department of ECE Carnegie Mellon University

Size: px

Start display at page:

Download "Lecture 6: Hard vs Soft Logic. James C. Hoe Department of ECE Carnegie Mellon University"

Rosemary Johns
5 years ago
Views:

1 Lecture 6: Hard vs Soft Logic James C. Hoe Department of ECE Carnegie Mellon niversity F17 L06 S1, James C. Hoe, CM/ECE/CALCM, 2017

2 Housekeeping Your goal today: understand the difference between hard and soft logic Notices Handout #3: lab 1, due noon, 9/22 Readings (skim) Kuon and Rose, Measuring the gap between FPGAs and ASICs, ISFPGA, M. Papamichael, et al., CONNECT, ISFPGA, Chung, et al., Single Chip Heterogeneous Computing MICRO, F17 L06 S2, James C. Hoe, CM/ECE/CALCM, 2017

3 The Project Template Pick a compute application Pick a metric of merit Study implementation options on the Zedboard a good software implementation must be one option the rest is up to you Report findings Keep in mind, you have optimistically 6 weeks; don forget you are taking other courses F17 L06 S3, James C. Hoe, CM/ECE/CALCM, 2017

4 DoF: you pick the application The problem could be well studied (expect thoroughness and depth) unproven (credit for honest attempts) Convince us it is 6 weeks of effort Something there is a reason to do on FPGAs Best if it is something you want to or have to do anyways Need to find and study (at least) 1 closely relevant research paper as starting point F17 L06 S4, James C. Hoe, CM/ECE/CALCM, 2017

5 F17 L06 S5, James C. Hoe, CM/ECE/CALCM, 2017 DoF: you define the metric What you can study performance(throughput or latency?) cost (in terms of what?) power and energy (how will you measure?) design effort (what will you measure) app specific metrics (e.g., numerical accuracy) composite metric: energy delay product, performance/watt, performance/$ Must commit up front measurement procedure/benchmark testable good enough target condition

6 DoF: Platform You have the Zedboard You may substitute a reconfigurable platform you are already using (check with me first) You have access to more advanced platforms risky learning curve to fit in 6 weeks only if this plays into what else you are doing in life F17 L06 S6, James C. Hoe, CM/ECE/CALCM, 2017

7 DoF: Approach 1 option must be a good software only baseline This is a study do more than crank out implementations think about what are the design choices hypothesize the expected effects of your choices corroborate hypothesis by implementation and evaluation Implementation approach: no artificial bounds how would you work in real life? if you have access, you can use it (including tools and IPs) F17 L06 S7, James C. Hoe, CM/ECE/CALCM, 2017 Convince us it is 6 weeks of effort

8 What makes a good project Interesting and/or important Not totally obvious (to you) You have special insights or interest Hard enough for 6 weeks Not too hard, too risky Most importantly, you should enjoy it The above need not be an AND F17 L06 S8, James C. Hoe, CM/ECE/CALCM, 2017

9 We now return you to our regularly scheduled program F17 L06 S9, James C. Hoe, CM/ECE/CALCM, 2017

10 Reasons for FPGAs FPGA Pro s and Con s no manufacturing NRE (non recurring eng.) cost faster design time: try out increments as you go less validation time: debug as you go at full speed / can also patch after shipping The price of FPGAs high unit cost (not for high volume products) ~10x overhead in area/speed/power/. RTL level design abstraction (relative to SW) F17 L06 S10, James C. Hoe, CM/ECE/CALCM, 2017

11 Hard vs Soft Processor Cores Table 4.2: The Zynq Book Processor Configuration DMIPs MicroBlaze 900LT/700FF/ 2BRAM to 3800LT/3200FF/ 6DSP/21BRAM Table 4.3: The Zynq Book area optimized (3 stage) 196 perf. optimized (5 stage) with branch optimizations perf. optimized (5 stage) without branch optimizations 228 ARM Cortex A9 1GHz; both cores combined 5000 Processor Configuration CoreMark MicroBlaze 125MHz; 5 stage (Virtex 5) 238 ARM Cortex A9 1GHz; both cores combined 5927 ARM Cortex A9 800MHz; both cores combined ??from book F17 L06 S11, James C. Hoe, CM/ECE/CALCM, 2017

12 [Kuon and Rose, 2006] Altera Stratix II FPGA, 90nm Quartus II balanced, standard fit hard multipliers and memory blocks ST Micro 90nm standard cells Synopsys high effort, add scan chain ST Micro memory compiler Cadence place and route Basic Results avg 21x/40x in area (w/wo using hard macros) 3~4x critical path ~12x dynamic power F17 L06 S12, James C. Hoe, CM/ECE/CALCM, 2017

13 Benchmarking Opencores and local designs removed cases where FPGA and ASIC are more than 5% different in FF count (Bias?) Metrics evaluated logic density circuit speed power consumption [Table 1: Kuon and Rose, Measuring the Gap between FPGAs and ASICs, 2006] F17 L06 S13, James C. Hoe, CM/ECE/CALCM, 2017

14 Area Ratios Differences attributed to overhead surrounding LTs and FFs [Table 2: Kuon and Rose, Measuring the Gap between FPGAs and ASICs, 2006] F17 L06 S14, James C. Hoe, CM/ECE/CALCM, 2017

15 Critical Path Ratios [Table 3&4: Kuon and Rose, Measuring the Gap between FPGAs and ASICs, 2006] F17 L06 S15, James C. Hoe, CM/ECE/CALCM, 2017

16 Dynamic Power Ratios [Table 5: Kuon and Rose, Measuring the Gap between FPGAs and ASICs, 2006] F17 L06 S16, James C. Hoe, CM/ECE/CALCM, 2017

17 F17 L06 S17, James C. Hoe, CM/ECE/CALCM, 2017 Actual Mileage Varies Comparisons strongly affected by exact design, FPGA/ASIC target, methodology comparing less than best effort designs can bias in either direction design is not a point a full comparison would have to be Pareto front to Pareto front Either precise in a specific context, or warm fuzzy rule of thumb (~10x all around) 2x<< ~10x <<100x Amoving target with arch and process changes

18 Effects of Tuning FPGA design not a scaled version of ASIC design different relative cost in logic vs. wires vs. mem different relative speed in logic vs. wires vs. mem also unique usage and operating characteristics Designed for FPGA RTLs need different tuning F17 L06 S18, James C. Hoe, CM/ECE/CALCM, 2017

19 FPGA Wire Peculiarities Routing architecture over provisioned to handle worst case In a typical design, wires appear cheaper relative to other resource types Best case is nearest neighbor, regular grid Counterintuitively, you SHOLD use wider busses consume unused free wires compensate for lower frequency F17 L06 S19, James C. Hoe, CM/ECE/CALCM, 2017

20 FPGA Memory Peculiarities Large memory abnormally fast Large memory are free until your run out Quantized memory options jumps between FF based vs. LT RAM vs. BRAMs choose from fixed menu of sizes and aspect ratios Must manage RAM usage don t waste BRAM on small buffers tune buffer sizes to natural granularities, e.g., zero incremental cost to go from 2Kb to 4Kb pack buffers to share same physical array F17 L06 S20, James C. Hoe, CM/ECE/CALCM, 2017

21 FPGA Logic Peculiarities Logic slower than expected Sharp aberrations around hard macro use e.g., faster mult than add in Virtex II CLB mapped logic not divisible for pipelining over pipeline adds cycles without freq. increase sweetspot frequency that is easy to reach but hard to exceed Design for performance correct and maximal usage of hard macros shallowly pipelined, wide datapath F17 L06 S21, James C. Hoe, CM/ECE/CALCM, 2017

22 E.g., FPGA vs ASIC tuned NoC on FPGA ASIC RTL from nocs.stanford.edu/cgi bin/trac.cgi/ wiki/resources/router FPGA RTL from LTs 9K 8K 7K 6K 5K 4K 3K 2K 1K FPGA Resource sage (same router/noc configuration) fixing general FPGA style guidelines Single Router LTs 60K 50K 40K 30K 20K 10K 4x4 Mesh NoC Avg. Packet Latency (in ns) Network Performance (uniform random 100MHz) same config Load (in Gbps) at same cost [Papamichael, ISFPGA 2012] F17 L06 S22, James C. Hoe, CM/ECE/CALCM, 2017

23 Soft IPs need not be general purpose Reconfigurable fabric provides generality Soft IPs should be maximally specialized to usage Ring Fat Tree Mesh High Radix niform Random Traffic 90% Neighbor Traffic Latency (in cycles) Latency (in cycles) Load (in flits/cycle) F17 L06 S23, James C. Hoe, CM/ECE/CALCM, Load (in flits/cycle)

24 It is not just FPGA vs ASIC CP: highest level abstraction / Software most general purpose support Multicore: still high level abstraction / general parallelism GP: explicitly parallel programs / best for SIMD, regular FPGA: ASIC like abstraction / overhead for reprogrammability ASIC: lowest level abstraction / Hardware fixed application and tuning tradeoff between efficiency and effort F17 L06 S24, James C. Hoe, CM/ECE/CALCM, 2017

25 Case Study [Chung, MICRO 2010] CP GPs FPGA ASIC Intel Core i7 960 Nvidia GTX285 ATI R5870 Xilinx V6 LX760 Std. Cell Year Node 45nm 55nm 40nm 40nm 65nm Die area 263mm 2 470mm 2 334mm 2 Clock rate 3.2GHz 1.5GHz 1.5GHz 0.3GHz Single prec. floating point apps M M Mult FFT Black Scholes MKL Multithreaded Spiral.net Multithreaded PARSEC multithreaded CBLAS 2.3 CAL++ hand coded CFFT /3.1 CDA 2.3 Spiral.net hand coded F17 L06 S25, James C. Hoe, CM/ECE/CALCM, 2017

26 Best Case Performance and Energy MMM Device GFLOP/s actual (GFLOP/s)/mm 2 normalized to 40nm GFLOP/J normalized to 40nm Intel Core i7 (45nm) Nvidia GTX285 (55nm) ATI R5870 (40nm) XilinxV6 LX760 (40nm) same RTL std cell (65nm) CP and GP benchmarking is compute bound; FPGA and Std Cell effectively compute bound (no off chip I/O) Power (switching+leakage) measurements isolated the core from the system For detail see [Chung, et al. MICRO 2010] F17 L06 S26, James C. Hoe, CM/ECE/CALCM, 2017

27 Less Regular Applications GFLOP/s (GFLOP/s)/mm 2 GFLOP/J FFT 2 10 Intel Core i7 (45nm) Nvidia GTX285 (55nm) ATI R5870 (40nm) XilinxV6 LX760 (40nm) same RTL std cell (65nm) Mopt/s (Mopts/s)/mm 2 Mopts/J Black Scholes Intel Core i7 (45nm) Nvidia GTX285 (55nm) ATI R5870 (40nm) XilinxV6 LX760 (40nm) same RTL std cell (65nm) F17 L06 S27, James C. Hoe, CM/ECE/CALCM, 2017

28 Tradeoff in Heterogeneity? Big Core What will you choose to put on it? GPGP little core little core little core Custom Logic FPGA little core little core little core little core little core little core F17 L06 S28, James C. Hoe, CM/ECE/CALCM, 2017

29 Amdahl s Law on Multicore BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE A program is rarely completely parallelizable; let s say a fraction f is perfectly parallelizable Speedup of n cores over sequential Speedup 1 (1 f ) f n for small f, die area under utilized Base Core Equivalent (BCE) in [Hill and Marty, 2008] F17 L06 S29, James C. Hoe, CM/ECE/CALCM, 2017

30 Asymmetric Multicores BCE BCE BCE BCE BCE BCE BCE F17 L06 S30, James C. Hoe, CM/ECE/CALCM, 2017 Trade pwr/area efficient slow BCEs for a pwr/area hungry fast core fast core for sequential code slow cores for parallel sections [Hill and Marty, 2008] BCE BCE BCE Fast Core BCE BCE 1 BCE Speedup 1 f f perf BCE BCE BCE seq ( n r) perf seq Base Core Equivalent (BCE) in [Hill and Marty, 2008] r = cost of fast core in BCE perf seq = speedup of fast core over BCE solve for optimal die area allocation given f

31 Asymmetric BCE BCE BCE BCE BCE BCE BCE Fast Core BCE BCE Fast Core F17 L06 S31, James C. Hoe, CM/ECE/CALCM, 2017 Heterogeneous Multicores [Chung, et al. MICRO 2010] BCE BCE BCE Base Core Equivalent Heterogeneous Speedup Speedup 1 f perf 1- f perf seq perf seq seq 1 f ( n r) [Hill and Marty, 2008] simplified f is fraction parallelizable n is total die area in BCE units r is fast core area in BCE units perf seq (r) is fast core perf. relative to BCE 1 f ( n - r) For the sake of analysis, break the area for GP/FPGA/etc. into units of cores that are the same size as BCEs. Each core type is characterized by a relative performance µ and relative power compared to a BCE

32 Modeling Power and Bandwidth Budgets Heterogeneous Fast Core Speedup 1- f perf seq 1 f ( n - r) The above is based on area alone Power or bandwidth budget limits the usable die area if P is total power budget expressed as a multiple of a BCE s power, then usable core area n r P if B is total memory bandwidth expressed also as a multiple of BCEs, then usable core area n r B μ F17 L06 S32, James C. Hoe, CM/ECE/CALCM, 2017

33 and μ example values Nvidia GTX285 Nvidia GTX480 ATI R5870 Xilinx LX760 Custom Logic MMM Black Scholes FFT 2 10 Φ μ Φ On equal area basis, 3.41x performance at 0.74x power relative a BCE μ Φ 1.27 μ 8.47 Φ μ Φ μ Nominal BCE based on an Intel Atom in order processor, 26mm 2 in a 45nm process F17 L06 S33, James C. Hoe, CM/ECE/CALCM, 2017

34 Combine Model with ITRS Trends Year Technology 40nm 32nm 22nm 16nm 11nm Core die budget (mm 2 ) Normalized area (BCE) (16x) Core power (W) (1x) Bandwidth (GB/s) (1.4x) Rel pwr per device 1X 0.75X 0.5X 0.36X 0.25X Fast Core 2011 parameters reflect high end systems of the day; future parameters extrapolated from ITRS mm 2 populated by an optimally sized Fast Core and cores of choice F17 L06 S34, James C. Hoe, CM/ECE/CALCM, 2017

35 Single Prec. MMMult (f=99%) 300 ASIC 250 Speedup nm 32nm 22nm 16nm 11nm SymMC AsymMC (0) SymCMP (1) AsymCMP (2) ASIC FPGA (3) LX760 GP R5870 (5) R5870 GP R5870 FPGA LX760 Sym f=0.990 & Asym multicore Power Bound Mem Bound F17 L06 S35, James C. Hoe, CM/ECE/CALCM, 2017

36 Single Prec. MMMult (f=90%) Speedup ASIC GP R5870 FPGA LX760 Asymmetric multicore Symmetric multicore f=0.900 x 0 40nm 32nm 22nm 16nm 11nm SymMC AsymMC (0) SymCMP (1) AsymCMP (2) ASIC FPGA (3) LX760 GP R5870 (5) R5870 Power Bound Mem Bound F17 L06 S36, James C. Hoe, CM/ECE/CALCM, 2017

37 Single Prec. MMMult (f=50%) ASIC/GP/FPGA Asymmetric Symmetric Speedup f= nm 32nm 22nm 16nm 11nm SymMC AsymMC (0) SymCMP (1) AsymCMP (2) ASIC ASIC FPGA (3) LX760 GP R5870 (5) R5870 Power Bound Mem Bound F17 L06 S37, James C. Hoe, CM/ECE/CALCM, 2017

38 Single Prec. FFT 1024 (f=99%) 60 ASIC/GP/FPGA 50 Speedup Asymmetric Symmetric 10 f= nm 32nm 22nm 16nm 11nm SymMC AsymMC (0) SymCMP (1) AsymCMP (2) ASIC FPGA (3) LX760 GP R5870 (4) GTX480 Power Bound Mem Bound F17 L06 S38, James C. Hoe, CM/ECE/CALCM, 2017

39 FFT 1024 (f=99%) if hypothetical 1TB/sec bandwidth 200 ASIC Speedup FPGA LX760 GP GTX nm 32nm 22nm 16nm 11nm SymMC AsymMC (0) SymCMP (1) AsymCMP (2) ASIC ASIC FPGA (3) LX760 GP R5870 (4) GTX480 f=0.990 Sym & Asym multicore Power Bound Mem Bound F17 L06 S39, James C. Hoe, CM/ECE/CALCM, 2017

40 Parting Thoughts FPGAs pay an overhead for reconfigurability significant but reducing power and BW bottleneck can compress differences FPGAs differ from ASICs in more than then reconfiguration overhead require distinct architecture and tuning {Multicore/GP/FPGA} are all midpoints between CP and ASIC extremes none is a panacea go with the easier option unless not good enough F17 L06 S40, James C. Hoe, CM/ECE/CALCM, 2017

Lecture 1: Welcome, why are you here? James C. Hoe Department of ECE Carnegie Mellon University

18 643 Lecture 1: Welcome, why are you here? James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L01 S1, James C. Hoe, CMU/ECE/CALCM, 2017 18 643 F17 L01 S2, James C. Hoe, CMU/ECE/CALCM,