Natalie Enright Jerger, Jason Anderson, University of Toronto November 5, 2010

Size: px

Start display at page:

Download "Natalie Enright Jerger, Jason Anderson, University of Toronto November 5, 2010"

Sandra Reeves
5 years ago
Views:

1 Next Generation FPGA Research Natalie Enright Jerger, Jason Anderson, and Ali Sheikholeslami l i University of Toronto November 5, 2010

2 Outline Part (I): Next Generation FPGA Architectures Asynchronous Fabric On-Chip Network for FPGAs High-Speed I/O for FPGAs Part (II): FPGA Application Platform High-Level Synthesis On-Chip Network for Commodity FPGAs 11/5/2010 University of Toronto 2

3 Part I: Next Generation FPGA Architectures Asynchronous Fabric On-Chip Network for FPGAs High-Speed I/O for FPGAs 11/5/2010 University of Toronto 3

4 Asynchronous Circuits Concept: Use handshaking instead of clock to sequence between operations. Synchronous FIFO pipeline Asynchronous FIFO pipeline (MOUSETRAP scheme) M. Singh and S. M. Nowick, MOUSETRAP: High-speed transition signaling asynchronous pipelines, IEEE Trans. on Very Large Scale Integrated Systems, 15(6): , /5/2010 University of Toronto 4

5 Advantages of Asynchronous Circuits No clock skew (since there is no clock) Low power (no global clock and no glitches) Glitches:10 s of % of FPGA dynamic power Increased speed (not limited by worst case) Less electromagnetic noise Robustness to variation In voltage, temperature, fabrication parameters Better modularity 11/5/2010 University of Toronto 5

6 Achronix FPGA Start-Up Spun out of Manohar s research Cornell University 2005 Externally synchronous; internally asynchronous Purported: 1.5 GHz internal throughput 65 nm TSMC process November news: Next generation 22nm chip to be fabricated by Intel 11/5/2010 University of Toronto 6

7 Achronix FPGA [Source: Achronix 10] 11/5/2010 University of Toronto 7

8 Achronix Signaling Protocol [Source: Manohar, CICC 06] 11/5/2010 University of Toronto 8

9 Achronix FPGA Positives: High-speed Latency insensitive (can add stages still correct) Less noise; robustness to variation Negatives: High power consumption Large silicon area 11/5/2010 University of Toronto 9

10 Next-Generation Asynchronous FPGA Potential Directions: Bundle multiple data bits together with a single handshake signal Datapath-oriented routing Lower-overhead handshaking / data encoding: Matched-delay (as in MOUSETRAP) [Nowick`2007]. Makes timing assumptions. Mix of wire lengths in routing fabric: Fewer latch/handshake h stages for long wires. Similar to Xilinx/Altera FPGAs except used in opposite circumstances. 11/5/2010 University of Toronto 10

11 Next Gen Asynchronous FPGA Testchip Implementation Testchip to contain key asynchronous circuit structures, built in state-of-the-art CMOS process Measure speed and power Use measured data to drive a broader large-scale architectural study Use measured results to populate area/delay/power models in architecture study. 11/5/2010 University of Toronto 11

12 Next Gen Asynchronous FPGA Develop Tools: Develop a set of tools to allow circuits to be implemented on the proposed FPGA (from HDL) Want to present a synchronous look. Re-use existing synthesis/place/route FPGA tools as much as possible Make it easy to port existing synchronous circuits to new (asynchronous) fabric. Develop parameterized architecture model to evaluate architectural trade-offs: LUT size, routing segmentation, etc. 11/5/2010 University of Toronto 12

13 Part I: Next Generation FPGA Architectures Asynchronous Fabric On-Chip Network for FPGAs High-Speed I/O for FPGAs 11/5/2010 University of Toronto 13

14 On-Chip Networks Transistor scaling along Moore s Law Trend toward integrating more cores on a single die Cores require efficient communication fabric to transmit data and control System of channels, buffers, switches and routers that transport data between nodes Data is transferred as packets Each node is connected to neighbors by short local wires 11/5/2010 University of Toronto 14

15 Aspects of OCN Design Topology: Arrangements of channels and routers # of links per node, hop count, total bandwidth Routing algorithm: Deterministic, oblivious or adaptive Deadlock avoidance Flow control: Governs allocation of resources to messages Router micro-architecture Implementation of routing, flow control 11/5/2010 University of Toronto 15

16 On-Chip Networks Beneficial to highly connected systems with many communication flows Saves routing area by re-using wires for multiple l packets Becoming ubiquitous Servers Embedded systems: heterogeneous system-on-chip designs Experimental platforms: Intel Single-chip Cloud Computer OCN design involves trade-offs between speed, power and area 11/5/2010 University of Toronto 16

17 OCNs for FPGAs OCN Today Typically implemented on many-core processor architecture Offers scalability for large number of nodes Intel: 48 and 80 core prototypes OCNs implemented on FPGAs as soft IP Router logic in LUTs OCN for Next Generation FPGA Perform feasibility study on benefits of hardening OCN IP in FPGA fabric New trade-offs between flexibility and area/power/speed of dedicated logic Understand communication needs of benchmark circuits 11/5/2010 University of Toronto 17

18 Dynamic and Flexible OCN Mismatch between static OCN and dynamic traffic behavior Observe spatial and temporal variation in communication within and across applications No single best OCN across all applications OCN on FPGA should maintain some level of configurability OCN for FPGA needs a mix of fixed and configurable logic Determine best set of OCN primitives for FPGA fabric Channel widths, buffering, allocation, routing logic Provide best performance across a wide variety of synthesized designs 11/5/2010 University of Toronto 18

19 OCN for FPGA: CAD Tools Develop CAD tools to support OCN communication Tools to leverage hardware communication capability Effectively partition communication Directly routed communication Communication through OCN Potential to simplify CAD tools and reduce synthesis time Easier to place and route smaller partitioned circuits Leverage high-level programming constructs Message passing paradigm 11/5/2010 University of Toronto 19

20 Simulation and Implementation Leverage existing tools/infrastructure Design-space exploration of OCN for FPGA Implement small OCN modules on a testchip to obtain accurate measurements Design in state-of-the-art CMOS process 11/5/2010 University of Toronto 20

21 Part I: Next Generation FPGA Architectures Asynchronous Fabric On-Chip Network for FPGAs High-Speed I/O for FPGAs 11/5/2010 University of Toronto 21

22 High-Speed I/O High-speed I/O refers to a transmitter and receiver that send digital data serially across a given channel Physical Media Attachment (PMA): forms the electrical interface to the channel Physical Coding Sublayer (PCS): performs data encoding and decoding (e.g. 8b/10b, deskewing, etc.) Intellectual property (IP): optional block that performs standard-specific specific operations (e.g. transaction and media access layer in PCI Express.) 11/5/2010 University of Toronto 22

High-Speed I/O in Current FPGAs Stratix IV and Virtex 6

23 High-Speed I/O in Current FPGAs Stratix IV and Virtex 6 (40nm from Altera and Xilinx) Focused mostly on backplane channels Support up to ~11Gbps At 11.3Gbps, Stratix IV PMA consumes 25.3 mw/gbps and PCS+HIP consumes 7.1 mw/gbps Stratix V (28nm) claims to support up to 28Gbp * Image obtained by Google search 11/5/2010 University of Toronto 23

24 FPGA High-Speed I/O: Challenges The main challenges arise from supporting many standards under speed, power, and area constraints: Wide range of data rates XAUI: 1.25G per TX/RX PCI Express Gen. 2: 5G per TX/RX 100G Ethernet: 10G per TX/RX Different channel configurations Number of bonded channels Clock distribution Data encoding/decoding (8b/10b, 64b/66b, etc.) Differential swing and common-mode voltage Different requirements for handling non-idealities Multiple standards can have conflicting requirements (e.g. ESD protection vs. return loss) 11/5/2010 University of Toronto 24

25 FPGA HSIO: Research Directions (1 of 2) High-speed I/O modeling System-level model for communication between an FPGA and another chip or between two FPGAs Timing and jitter estimation Ease of use What is the estimated latency (delay) from the TX PCS input to the RX PCS output? What are the optimal TX and RX settings (e.g. PLL counter settings, RX equalization, etc.) to minimize i i jitter? To work with multiple l standards d TX/RX reconfiguration through on-chip controller or software tool 11/5/2010 University of Toronto 25

26 FPGA HSIO: Research Directions (2 of 2) Low power and high speed operation Power scaling with data rate Burst-mode operation for low power Bi-directional communication to reduce number of I/O pins on the FPGA package Area reduction High-speed I/O occupies 21% of silicon area in Stratix IV die photo FPGA layout increasingly constrained by I/O blocks instead of core logic Area reduction, uniformity, and modularity across high-speed I/O blocks will improve layout 11/5/2010 University of Toronto 26

27 FPGA HSIO Implementation Design test chip containing TX and RX blocks built in state-of-the-art CMOS process Design high-speed blocks for multiple standards To be reconfigurable, To support wide range of data rates Measure speed and power 11/5/2010 University of Toronto 27

28 Part (II): FPGA Application Platform High-Level Synthesis On-Chip Network for Commodity FPGAs 11/5/2010 University of Toronto 28

29 Motivation Hardware has advantages over software: Performance Energy-efficiency Hardware design is difficult and skills are rare: 10 software engineers for every hardware engineer* Need a CAD flow that simplifies hardware design for software engineers. LegUp: High-level synthesis tool: C program processor/accelerator system. Project about 1.5 years old; initially funded by Altera, but now available publicly Public release: *US Bureau of Labour Statistics 08 11/5/2010 University of Toronto 29

30 LegUp: Top-Level Vision int FIR(int ntaps, int sum) { int i; for (i=0; i < ntaps; i++) sum += h[i] * z[i]; return (sum); }... Program code C Compiler Self- Profiling Processor Processor (MIPS) Altered SW binary (calls HW accelerators) Profiling Data: μp FPGA fabric Hardened program segments High-level synthesis Suggested program segments to target to HW Execution Cycles Power Cache Misses 11/5/2010 University of Toronto 30

31 Initial Target System Architecture FPGA MIPS Processor Hardware Accelerator Hardware Accelerator AVALON BUS Memory Controller On-Chip Memory Off-Chip Memory 11/5/2010 University of Toronto 31

32 LegUp for Fujitsu Adapt LegUp for Fujitsu s applications Gain understanding of key computing applications for Fujitsu: Server, financial, biomedical, communications Identify areas where speed/power crucial Develop LegUp-based tools and methodologies specifically for Fujitsu s future computing needs Enable use of FPGAs in Fujitsu computing apps. Improve ease-of-use, time-to-market 11/5/2010 University of Toronto 32

33 Target Architecture Alternatives to initial bus-based arch: CPUs, accelerators connected via on-chip networks FPGA-based compute accelerators connected to Intel/AMD processor in a different socket Potential target: server market Processor/accelerator architectures specifically tailored to Fujitsu applications: Example: develop streaming interface between accelerators/processors for improved speed/power in media applications. 11/5/2010 University of Toronto 33

34 Tool Development Develop specific tools & methodologies that ease future Fujitsu product development Create libraries to simplify and optimize FPGA use in targeted areas Examples: financial risk modeling via Monte Carlo analysis, fixed point arithmetic Patterns/engines optimized for target areas

35 Part (II): FPGA Application Platform High-Level Synthesis On-Chip Network for Commodity FPGAs 11/5/2010 University of Toronto 35

36 OCN for Commodity FPGA/Multi-FPGA Soft OCN architectures for FPGAs Extend OCN design for multi-fpga systems FPGA-based supercomputers FPGA-accelerated server products Requirements Efficient/seamless interfaces Fairness and quality of service Novel protocols Short-term: term: leverage commodity FPGA parts 11/5/2010 University of Toronto 36

37 Soft IP OCN Explore design trade-offs of soft IP-based OCN on commodity FPGAs Use FPGA as a platform to study innovative OCN design Explore novel router architectures Architectural support for various on-chip and chip-tochip protocols Coherence, message-passing, g, communication primitives Create library of FPGA modules Provide composable OCN building blocks with various area/speed configurations 11/5/2010 University of Toronto 37

38 Multi-FPGA Systems Design FPGA modules to provide off-chip interfaces Interface with existing I/O or novel I/O designs Multi-FPGA communication protocols Architectural techniques to manage data across chip boundaries OCN fairness and quality of service Analyze and mitigate communication bottlenecks in system Significant chip-to-chip communication, substantial memory demandd Off-chip interface, memory controllers communication hotspots 11/5/2010 University of Toronto 38

39 Multi-FPGA Tool Development Extend CAD tools to understand communication costs between chips Latency sensitive communication kept on-chip Effective circuit-partitioning 11/5/2010 University of Toronto 39

40 Implementation Modify relevant benchmark circuits to leverage OCN Synthesize in commodity FPGA to measure performance Implement FPGA modules for chip-to-chip p protocol Utilize results to motivate interface and I/O implementation in future next generation FPGA designs 11/5/2010 University of Toronto 40

41 Research Resources/Timeline Faculty members committed to this research: Ali Sheikholeslami (lead member) Natalie Enright Jerger Jason Anderson Graduate students involved: Clifford Ting Graduate students to be recruited beginning May 2011 Access to CMOS process Initial access to Fujitsu s 65nm CMOS Tapeout dates to be determined Phone Conferences/Meetings To be arranged as needed (at least 4 times in a year) Updates via and full annual report 11/5/2010 University of Toronto 41

42 Summary Research in next-generation FPGA architectures spans the following directions: Asynchronous FPGA architecture Explore reduction in power & area, increase in speed On-Chip Network for FPGA Explore reduction in area and increase in speed Use of OCN for Asynchronous FPGA Although not discussed in this document, there is a possibility that the OCN developed under this proposal will be useful to asynchronous FPGAs Research in commodity FPGA includes Synthesis with Fujitsu s application as main target Explore implementation of OCN as a soft block 11/5/2010 University of Toronto 42

43 Acknowledgement We would like to acknowledge the technical contributions of Clifford Ting to the preparation and discussions of this proposal. 11/5/2010 University of Toronto 43

ECE/CS 757: Advanced Computer Architecture II Interconnects

ECE/CS 757: Advanced Computer Architecture II Interconnects Instructor:Mikko H Lipasti Spring 2017 University of Wisconsin-Madison Lecture notes created by Natalie Enright Jerger Lecture Outline Introduction