Customising Field Programmability

Size: px

Start display at page:

Download "Customising Field Programmability"

Evelyn Morton
5 years ago
Views:

1 Customising Field Programmability Wayne Luk Department of Computing Imperial College London FPL 1 September

2 Overview 1. how it begins 2. customisation + field programmability 3. customisation: fab time 4. customisation: design time 5. customisation: run time 6. research directions 7. customise FPL 8. summary 2

3 1. How it begins 3

4 FPL 91 Attendants. FPL 91: 90 FPL 10: 220 at dinner tonight 4

5 Technical papers Paper on run-time reconfiguration 5

6 Technical papers Remember the RAMP project? 6

7 FPL 91 Bank of England inflation calculator 275 in 1991 = 440 in 2009 = 536 Euros in 2010 Registration fee for FPL 10: 500 Euros! 7

8 1991 Program comparison 24 regular papers 26 poster papers selected papers in submissions 60 regular papers 43 short papers 9 PhD Forum papers 8

Program comparison 1991 24 regular papers 26 poster papers selected papers in visit: Oxford Story

9 Program comparison regular papers 26 poster papers selected papers in visit: Oxford Story submissions 60 regular papers 43 short papers 9 PhD Forum papers visit: The Last Supper 9

10 Commercial impact how many companies formed/benefited from related research published at FPL, ? paper published before or within a year of company start authors may or may not be involved in company answer: 10

11 Commercial impact how many companies formed/benefited from related research published at FPL, ? paper published before or within a year of company start authors may or may not be involved in company answer: at least 8, all still around except one 11

12 Some companies from FPL 91, Page: syn tool Celoxica 95, Gokhale: syn tool Impulse Accel. Tech. 97, Betz: device tool Right Track 99, Kastrup: device Silicon Hive 00, Mencer: syn tool Maxeler 02, de Sousa: IP cores Core Works 03, Teifel: device Achronix 07, Quinton: post-silicon debug Veridae first author of paper 12

13 Some companies from FPL 91, Page: syn tool Celoxica 95, Gokhale: syn tool Impulse Accel. Tech. 97, Betz: device tool Right Track 99, Kastrup: device Silicon Hive 00, Mencer: syn tool Maxeler bought by Altera: hear him tomorrow! 02, de Sousa: IP cores Core Works 03, Teifel: device Achronix 07, Quinton: post-silicon debug Veridae first author of paper 13

14 2. Customisation + field programmability reconfigurable computing reconfigurable technology + custom computing computing with field programmable devices, e.g. Field Programmable Gate Arrays (FPGAs) meet requirements by customisation of parallelism: space reconfiguration: time reconfigurable computing systems: wide range supercomputers portable systems 14

15 Systems with Reconfigurable Technology Supercomputer: SGI Altix 4700 Multi-Paradigm ccnuma FP6 hartes Project Source: SGI, Sony, FP6 hartes, CERN Particle Physics: CERN LHC 15

16 Customisation: when fabrication time: pre-fab options customise field-programmable fabric less post-fab options: less flexible, more efficient design time: pre-execution options customise initial mapping from application to fabric less execution options: less flexible, more efficient run time customise mapping to fabric during execution more execution options: dynamic scheduling 16

17 Customisation: some benefits high-level generic design requirements + context: multiple designs optimise: options + parameters + abstraction levels facilitate design composition help optimisation and verification modularity of building blocks + interfaces platform-based evolution re-use un-verified design: risky automate re-verification after changes 17

18 3. Customisation: fab time FPL 97: VPR FPGA device architecture: homogeneous device-specific mapping tools the most-cited FPL paper so far FPGA 09: VPR from Google Scholar different types of blocks in columns + routing single-driver routing, architecture files compare customised device with real device? new FPGA customised for particular application FPL 07: hybrid FPGA customised for floating-point applications coarse-grained blocks too complex for VPR 18

19 Generic hybrid FPGA architecture Example: floating-point hybrid FPGA coarse-grained units implement datapath bus-based operations, constant bit-width dominated by multiplication and addition fine-grained units implement control logic, state machine adopt existing FPGA fabric 19

20 Coarse-grained fabric D=9, M=4, R=3, F=3, 2 add, 2 mul: best density over benchmarks 20

21 challenges Modelling compare hybrid FPGA with current technology fine-grained units: same as current FPGA existing modelling tools cannot support new approach: Virtual Embedded Blocks synthesis tools: produce delay and area of specialised units model specialised units by arrays of existing FPGA cells with similar delay or area can use existing low-level tools in mapping applications to hybrid FPGA, e.g. place and route 21

FPGA dummy blocks to model coarse-grained block s area or delay timing

22 Virtual Embedded Blocks L L' W ' W tpd Embedded Block in ASIC tpd' WL W' L' tpd tpd' Equivalent VEB using LC Distributed VEBs in a virtual FPGA dummy blocks to model coarse-grained block s area or delay timing analyzer: determine hybrid s performance, e.g. fine-to-coarse routing delay 22

23 Comparison: double precision Area (slices) Floating Point hybrid FPGA Delay (ns) XC2V Area (slices) Delay (ns) Area (times) Delay (times) bfly dscg fir mm ode syn bgm times area reduction 4 times speedup Geometric Mean compared with Virtex-II 23

24 Comparison: energy reduction (FPL 08) Energy reduced by 14 times on average 24

25 4. Customisation: design time FPL 91: program to gates program construct for parallelism + communication syntax directed compiler: simple but not optimised basis for Celoxica s Handel-C tool; now Mentor FPL 95: data parallel C, FCCM 00: Streams-C automate parallelisation, roots of Impulse-C FPL 01: GALADRIEL and NENYA compilers expose parallelism in hierarchical model support multi-memories + region-based scheduling FPL 09, DATE 10: combine two optimisations goal-directed: e.g. geometric programming syntax-directed: e.g. loop tiling 25

26 Customisation approach Sequential design Parser Speed spec Platform spec Pre-transform ( Inlining, Ptr2Array, Loop merge, Loop coalesce... ) Dependence analysis Design exploration and transform ( Data reuse, MapReduce, Pipelining) Power efficient design pre-transformations: enable and facilitate code analysis and design exploration geometric programming model for low power design exploration 26

27 Design space formulation Design parameters: N-level nested loops, R array references, Target performance: execution timet Target platform: F types of resources, each has Resf min { 1,2} ρij 1 kl L 1 ii l Power( ρ, k subject to Speed( ρ, k ij l, ii, freq) ij Resource f l, ii, ( ρ, k ij freq) l, ii) T Res f for i = 1,..., R, j = 1,..., E i, l = 1,..., N, f F ρ ij : data reuse variable k l : MapReduce parallel partition variable ii : pipelining initiation interval freq :system clock frequency 27

28 28 Power model PU Off-chip RAMs FPGA RAMs PU PU RAMs PU PU RAMs PU freq P bram P dsp P par P accesses off P cycles total accesses off I V on dd off on off dynamic = = + = ) # # # _ # ( _ # _ # P P P P Power

29 29 dsp N l l R i E j B ij N l l R i E j C ij x k dsp d bram k par accesses off i ij i ij = = = = = = = = = = log log # # # _ # 2 2 ρ ρ Capturing parallelisation + data re-use freq P bram P dsp P par P accesses off P cycles total accesses off I V on dd off on off dynamic = = + = ) # # # _ # ( _ # _ # P P P P Power Data reuse Parallelization

30 Speed model # total _ cycles = cyc + cyc + cyc + # off s in r _ accesses Num of cycles of statements outside the innermost loop cyc s = v l = S Ws s= 1 l= 1 Num of cycles of the innermost loop after pipelining cyc in N 1 v ( v ii + C + l N data l= 1 i= 1 Num of cycles of the result reduce phase cyc r = Wr l= 1 v l k l I d i + log 2 k + notfull) N 30

31 Resource model computational resource constraint N l=1 k l x dsp Rec dsp memory constraint d R E i i= 1 j= 1 ρ log 2 ij B ij Rec ram 31

32 Experimental results #Refs: number of array references 32

33 Experiment settings FPGA-based platform Xilinx XC4VFX140: 192 DSP blocks and 552 RAM blocks Clock frequency: 1~100MHz Speed constraint: up to 158x Design exploration time: within 2 mins Power results: Xilinx Power Estimator 33

34 Result: matrix multiplication (ρ1j, ρ2j, k1, k2, k3, ii, frequency) minimum energy 34 34

35 FP6 hartes: Harmonic toolchain 35

36 FP6 hartes: bigger picture Holistic Approach to Reconfigurable real Time Embedded Systems Algorithm Exploration Tools.c source code hartes Tool-Chain DSP GPP FPGA 16 partners in 5 European countries: Atmel, FAITAL, Fraunhofer, Imperial, INRIA, Leaff, P. di Bari, P. di Milano, Scaleo, Segula, Thales, Thomson, TU Delft, Uni. Di Ferrara, Uni. P. delle Marche, Uni. D Avignon 36

37 5. Customisation: run time 1. pre-deployment - reconfigure for design/debug, not during deployed 2. in-field upgrade - update and reboot - reconfiguration time/energy overhead: irrelevant 3. infrequent run-time reconfiguration - occasional adaptation to changing environment - reconfiguration time/energy overhead: not critical 4. frequent run-time reconfiguration - frequent reconfiguration: core functionality - reconfiguration time/energy overhead: critical 37

38 Design approaches FPL 91, 93, 00: Dynamic Circuit Switching run-time reconfigurable design applications estimate reconfiguration latency FPL 03, FPL 05: branch/phase-optimised more resources to speed up more frequent branch optimise for program phases ARC 09, FCCM 10 analytical optimisation of parallelism: speed/energy more parallelism: faster execution, slower configuration less execution energy, more configuration energy 38

39 Performance parallelism increases processing throughput parallelism slows down reconfiguration 39

40 Optimising parallelism total processing time for workload n find optimal degree of parallelism 40

41 How parallelism affects performance small workload: reconfiguration penalty drives to serial implementations large workload: benefits from parallelism 41

42 1. derive Φ app, s and n Optimisation steps 2. obtain Φ config from target technology 3. develop one design and determine t p and A 4. calculate t r and t r,e and t p,e 5. calculate p opt and find a feasible value 6. calculate t total and verify throughput 7. implement design with p from step 5 and verify its actual throughput 8. calculate buffer size, check device utilisation 42

43 Case study: GSM channel filter optimising the FIR filter for maximum performance 43

44 44 Extend: parallelism to optimise energy total energy for workload n find optimal degree of parallelism p t P p n s t P n s t P E E E E e r r e p o e p e c R o c total + + = + + =,,,, e r r e p o total t P p n s t P p E, 2, + = e r r e p o opt t P n s t P p,, =

45 CoreGen design E = 33.4 nj p opt = E total,n E p,n E r,n LUTs p = 20 E model = 16.7 nj E = 17.7 nj E [nj/sample] x area -47% -49% compared with parallel design compared with non-reconfig design p 45

46 6. Some research directions theory and models analysis of reconfiguration and parallelism design automation standard optimisation: apply automatically floorplan for parametric relocatable modules scalable design re-use, placement strategies run-time performance tuning and validation benefits vs overhead applications of reconfiguration overcome process variation, autonomous systems 46

47 Autonomous systems control strategy make decisions to optimise itself model of world: planning and action understand trade-offs: e.g. reconfigure or not event-driven just-in-time reconfiguration component meta-data description assemble + tune partially-optimised components hide reconfiguration latency extensions self-optimisation self-verification 47

48 Self-optimisation + self-verification optimise and verify: hardware + software meet requirements efficiently and demonstrably self-optimising and self-verifying design (SOSV) preserve property in design composition self? aware of context capable of planning effective external control 2 stages pre-deployment: building design, compile time post-deployment: operational, run time 48

49 Pre-deployment: example of choices circuit technology: e.g. ASIC or FPGA input/output: options memory: hierarchy + options interconnect: e.g. bus, switch, network-on-chip granularity: configurable unit, custom instruction synchronisation: e.g. clock domains, self-timed parallelism: processors, hardware/software data representation optimisation post-deployment optimisation/verification 49

50 Post-deployment: situation-specific optimization and verification opportunities design upgrade run-time conditions, e.g. noise, process variation program phase optimisation optimisation and verification process light-weight: on-site, e.g. proof-carrying code heavy-weight: remotely, verified by signature run-time system deals with exceptions error diagnosis facilities 50

51 Example: dynamic power optimisation power surge while self-optimising/verifying 51

52 Challenges: theory + practice for: productive automate: evolutionary vs disruptive SOSV design: specify + analyse requirements composable description: design + context multi-level capture: domain-specific constraints open standard: design, optim/verify programs 52

53 aim 7. Customise FPL could only attend one conference? FPL! platform for great research + education cost of attendance: not a barrier provide your opinions: they will count! cost of registration number of pages for papers paper review: authors response? 53

54 8. Summary self * (optimising+verifying) = trusted re-use unify: autonomic, self-test, dynamic optim., RTR improve: design efficiency + designer productivity self-optimising self-verifying design platform customisable: cloud + embedded computing autonomous: system-on-chip + network of ASOCs applications: ubiquitous, dependable, secure, robust new generation of designers building blocks + tools: made smarter specify, analyse, adapt: requirements + search but do not forget... 54

55 FPL is us! lets make FPL even better... better papers better review + feedback on papers better interactions at conference lower cost to attend more companies benefit from FPL research and... 55

56 FPL is us! lets make FPL even better... better papers better review + feedback on papers better interactions at conference lower cost to attend more companies benefit from FPL research and... see you in the next 20 FPLs at least! 56

Architecture. Philip Leong Computer Engineering Laboratory School of Electrical and Information Engineering The University of Sydney

Architecture. Philip Leong Computer Engineering Laboratory School of Electrical and Information Engineering The University of Sydney Architecture Philip Leong Computer Engineering Laboratory School of Electrical and Information Engineering The University of Sydney This course 1. Introduction to Reconfigurable Computing - what is reconfigurable