High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx

Similar documents
Copyright 2017 Xilinx.

UltraScale Architecture and Product Overview

Zynq AP SoC Family

UltraScale Architecture: Highest Device Utilization, Performance, and Scalability

Reconfigurable Computing

High Performance Memory in FPGAs

Zynq-7000 All Programmable SoC Product Overview

FPGA architecture and design technology

CPE/EE 422/522. Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices. Dr. Rhonda Kay Gaede UAH. Outline

INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS)

Stacked Silicon Interconnect Technology (SSIT)

7-Series Architecture Overview

Xilinx SSI Technology Concept to Silicon Development Overview

Beyond Moore. Beyond Programmable Logic.

Basic FPGA Architecture Xilinx, Inc. All Rights Reserved

Simplify System Complexity

Introduction to Field Programmable Gate Arrays

Field Programmable Gate Array (FPGA)

Abbas El Gamal. Joint work with: Mingjie Lin, Yi-Chang Lu, Simon Wong Work partially supported by DARPA 3D-IC program. Stanford University

Signal Conversion in a Modular Open Standard Form Factor. CASPER Workshop August 2017 Saeed Karamooz, VadaTech

UltraScale Architecture and Product Overview

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

International Training Workshop on FPGA Design for Scientific Instrumentation and Computing November 2013.

Field Programmable Gate Array (FPGA) Devices

Simplify System Complexity

EITF35: Introduction to Structured VLSI Design

COSMOS Architecture and Key Technologies. June 1 st, 2018 COSMOS Team

CS310 Embedded Computer Systems. Maeng

Understanding Peak Floating-Point Performance Claims

The Next Generation 65-nm FPGA. Steve Douglass, Kees Vissers, Peter Alfke Xilinx August 21, 2006

Virtex-II Architecture. Virtex II technical, Design Solutions. Active Interconnect Technology (continued)

Advancing high performance heterogeneous integration through die stacking

! Program logic functions, interconnect using SRAM. ! Advantages: ! Re-programmable; ! dynamically reconfigurable; ! uses standard processes.

The Xilinx UltraScale Architecture

Programmable Logic. Any other approaches?

Placement Strategies for 2.5D FPGA Fabric Architectures

HES-7 ASIC Prototyping

UltraScale Architecture Integrated IP Core for Interlaken v1.3

Introduction to Partial Reconfiguration Methodology

Maximum Capability Artix-7 Family Kintex-7 Family Virtex-7 Family

Graduate Institute of Electronics Engineering, NTU FPGA Design with Xilinx ISE

ECE 636. Reconfigurable Computing. Lecture 2. Field Programmable Gate Arrays I

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

7 Series FPGAs Overview

Vivado HLx Design Entry. June 2016

Spiral 2-8. Cell Layout

Introduction to Modern FPGAs

Pushing Performance and Integration with the UltraScale+ Portfolio

Topics. Midterm Finish Chapter 7

How to Efficiently Implement Flexible and Full-Featured Digital Radio Solutions Using All Programmable SoCs

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

Introduction to FPGAs. H. Krüger Bonn University

7 Series FPGAs Clocking Resources

Altera FLEX 8000 Block Diagram

S2C K7 Prodigy Logic Module Series

All Programmable: from Silicon to System

UltraScale Architecture and Product Data Sheet: Overview

Achieving Breakthrough Performance with Virtex-4, the World s Fastest FPGA

EE219A Spring 2008 Special Topics in Circuits and Signal Processing. Lecture 9. FPGA Architecture. Ranier Yap, Mohamed Ali.

BittWare s XUPP3R is a 3/4-length PCIe x16 card based on the

INTRODUCTION TO FPGA ARCHITECTURE

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

Power Consumption in 65 nm FPGAs

Power Solutions for Leading-Edge FPGAs. Vaughn Betz & Paul Ekas

UltraScale Architecture Integrated Block for 100G Ethernet v1.4

VLSI Programming 2016: Lecture 3

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Institute of Automation, Chinese Academy of Sciences. The Design of Snap2 System

Reduce Your System Power Consumption with Altera FPGAs Altera Corporation Public

FPGA VHDL Design Flow AES128 Implementation

The Nios II Family of Configurable Soft-core Processors

High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS

An Introduction to Programmable Logic

Xynergy It really makes the difference!

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function.

Optimizing latency in Xilinx FPGA Implementations of the GBT. Jean-Pierre CACHEMICHE

Xilinx 7 Series FPGA Power Benchmark Design Summary

Opportunities & Challenges: 28nm & 2.5/3-D IC Design and Manufacturing

Topics. Midterm Finish Chapter 7

Synthesis Options FPGA and ASIC Technology Comparison - 1

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

Pushing the Boundaries of Moore's Law to Transition from FPGA to All Programmable Platform Ivo Bolsens, SVP & CTO Xilinx ISPD, March 2017

High Performance DDR4 interfaces with FPGA Flexibility. Adrian Cosoroaba and Terry Magee Xilinx, Inc.

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray

Enabling Technology for the Cloud and AI One Size Fits All?

Hot Chips 2017 Xilinx 16nm Datacenter Device Family with In-Package HBM and CCIX Interconnect Gaurav Singh Sagheer Ahmad, Ralph Wittig, Millind

XA Artix-7 FPGAs Data Sheet: Overview

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder

Outline. Field Programmable Gate Arrays. Programming Technologies Architectures. Programming Interfaces. Historical perspective

Stratix vs. Virtex-II Pro FPGA Performance Analysis

FPGA Architecture Overview. Generic FPGA Architecture (1) FPGA Architecture

Synthesis of VHDL Code for FPGA Design Flow Using Xilinx PlanAhead Tool

SUBMITTED FOR PUBLICATION TO: IEEE TRANSACTIONS ON VLSI, DECEMBER 5, A Low-Power Field-Programmable Gate Array Routing Fabric.

High-Speed NAND Flash

L2: FPGA HARDWARE : ADVANCED DIGITAL DESIGN PROJECT FALL 2015 BRANDON LUCIA

EE178 Lecture Module 2. Eric Crabill SJSU / Xilinx Fall 2007

Programmable Logic Design Grzegorz Budzyń Lecture. 15: Advanced hardware in FPGA structures

Zynq Ultrascale+ Architecture

LogiCORE IP Soft Error Mitigation Controller v4.0

Transcription:

High Capacity and High Performance 20nm FPGAs Steve Young, Dinesh Gaitonde August 2014

Not a Complete Product Overview Page 2

Outline Page 3

Petabytes per month Increasing Bandwidth Global IP Traffic Growth 2011 2016 1 Growth of Xilinx FPGA Serial Bandwidth 6 Tb/s 5 Tb/s 4 Tb/s 3 Tb/s 2 Tb/s 1 Tb/s 1: Cisco Visual Networking Index, Forecast and Methodology, 2011 2016 130nm 90nm 65nm 40nm 28nm 20nm N Page 4

Higher Bandwidth IO 1 2 3 Page 5

High Throughput Demands Wider, Faster Data Paths 1,000 Gb/s Page 6

Addressing Interconnect Bottlenecks 1 2 3 Page 7

Clocking Challenge 1 2 3 Page 8

Benefits of UltraScale Clock Architecture Clock Domain 1 Clock Domain 1 32 Clock Domain 2 Clock Domain 2 Clock Domain 3 Global Clock Network Leaf Cell Clocks Clocking 32 centrally located global clock buffers Root always in center of device I/O Logic GT Clock Root Distribution Clocks Routing Clocks 192 720 distributed global buffers Root can be in any clock region Skew accumulates from center to edge Balanced skew per clock network Page 9

Routing Challenge 1 2 3 Page 10

CLB Innovations Enable Tighter Packing Removed slice boundary & added MUX Wider functions per block 2x distributed RAM density Wider carry chain Wider functions per block CLB = Slice Slice 0 UltraScale 7 Series CLB carry out Flip-flops sharing same CE are same color Dedicated inputs for each flip-flop Higher performance Reduces LUT utilization UltraScale 7 Series LUT U independent direct route Slice 1 U carry in carry out Bypass CE ignore and RST ignore, RST inversion Higher performance Eliminates synchronous reset bottlenecks 2x the number of CE s Improves CLB packing RST ignore, inversion CE ignore D IN RST CE Flip-flops sharing same CE are same color carry in Page 11

Total Wire Length Design Use Fewer CLBs, Shorter Wire Length Re-architected for flexible connectivity and efficient logic packing Better local routability, more control set flexibility, improved P&R algorithms Tighter packing results in shorter net delays, less wire switching Translates into higher utilization, greater performance, less power Total Wirelength UltraScale & Vivado vs Previous Architecture (Normalized to 7 series) 10 Average 7 series UltraScale Fully Used CLB Partially Used CLB Individual Designs Page 12

Additional Metal Layers 1 2 3 Page 13

Routing complexity Routability Benchmark Suite Metrics: Design Routing Complexity and Design Size Congestion Metrics displayed in Vivado No routing congestion High routing congestion Cannot route No congestion Design Size Page 14

Routing complexity Routing complexity UltraScale Results Vivado routes more complex designs on UltraScale UltraScale shows lower congestion on complex designs As a result, timing closure is accelerated Delivers 1 speedgrade higher Fmax No routing congestion High routing congestion Cannot route Page 15

2nd Generation Stacked Silicon Feature ~20,000 registered routing lines between die Clocking Architecture Spans SLR boundaries Foot-print compatibility between SSI and non-ssi devices Benefit Enables >500 MHz datapath performance between SLRs Deterministic, predictable timing Device size up to 44 million logic cells Abundant clock resources to meet demanding application Ability to seamlessly migrate from monolithic to 3D-IC devices SLR0 SLR1 SLR2 passive interposer Substrate Page 16

Power Optimizations Transceiver I/O up to 60% up to 50% Architectural optimizations Low power mode I/O multi-mode control (cont d from 28nm) DDR4 voltage reduction Dynamic up to 30% CLB packing & reduced wire length HW based clock gating on leaf cells Static Spartan-6/Virtex-6 (45nm/40nm) up to 50% up to 65% Transceiver I/O Dynamic Static 7 Series (28nm) up to 30% up to 40% up to 50% up to 40% 25-45% Transceiver I/O Dynamic Static UltraScale (20nm/16nm) BRAM hardened data cascading BRAM dynamic power gating DSP hardened features MMCM & PLL lower supply voltage Process node Power binning & lower voltage scaling 3D IC static power binned slices Page 17

Kintex UltraScale FPGAs Logic Resources Memory Resources Clock Resources I/O Resources Integrated IP Resources Speed Grades Footprint Compatible with Virtex UltraScale Page 18 Part Number XCKU035 XCKU040 XCKU060 XCKU075 XCKU100 XCKU115 Logic Cells 355,474 424,200 580,440 756,000 958,440 1,160,880 CLB Flip-Flops 406,256 484,800 663,360 864,000 1,095,360 1,326,720 CLB LUTs 203,128 242,400 331,680 432,000 547,680 663,360 Maximum Distributed RAM (Kb) 5,908 7,050 9,180 7,290 12,825 18,360 Block RAM/FIFO w/ecc (36 Kb each) 540 600 1,080 1,188 1,680 2,160 Block RAM/FIFO (18 Kb each) 1,080 1,200 2,160 2,376 3,360 4,320 Total Block RAM (Mb) 190 211 380 418 591 759 CMT (1 MMCM, 2 PLLs) 8 8 12 14 24 24 I/O DLL 40 40 48 64 64 64 Maximum Single-Ended HP I/Os 416 416 520 624 676 676 Maximum Differential HP I/O Pairs 192 192 240 288 312 312 Maximum Single-Ended HR I/Os 104 104 104 104 156 156 Maximum Differential HR I/O Pairs 48 48 48 48 72 72 DSP Slices 1,700 1,920 2,760 2,592 4,200 5,520 System Monitor 1 1 1 1 2 2 PCIe Gen1/2/3 2 3 3 4 6 6 Interlaken 0 0 0 2 0 0 100G Ethernet 0 0 0 1 0 0 GTH 16 Gb/s Transceivers 16 20 32 52 64 64 Commercial -1-1 -1-1 -1-1 Extended -2-3 -2-3 -2-3 -2-3 -2-3 -2-3 Industrial -1-1L -2-1 -1L -2-1 -1L -2-1 -1L -2-1 -1L -2-1 -1L -2 Package Package Footprint Dimensions (mm) HR I/O, HP I/O, GTH 16 Gb/s A676 27x27 104, 208, 16 104, 208, 16 A900 31x31 104, 364, 16 104, 364, 16 A1156 35x35 104, 416, 16 104, 416, 20 104, 416, 28 104, 416, 28 A1517 40x40 104, 520, 32 104, 520, 48 104, 520, 48 104, 520, 48 B1517 40x40 104, 260, 64 104, 260, 64 A1760 104, 624, 52 425x425 A1760 104, 624, 52 104, 624, 52 D1924 45x45 156, 676, 52 156, 676, 52 F1924 45x45 104, 624, 64 104, 624, 64

Virtex UltraScale FPGAs Memory Resources Clock Resources I/O Resources Integrated IP Resources Speed Grades Footprint Compatible with Kintex UltraScale Devices Page 19 Part Number XCVU065 XCVU080 XCVU095 XCVU125 XCVU160 XCVU190 XCVU440 Logic Cells 626,640 780,000 940,800 1,253,280 1,621,200 1,879,920 4,407,480 CLB Flip-Flops 716,160 891,424 1,075,200 1,432,320 1,852,800 2,148,480 5,037,120 CLB LUTs 358,080 445,712 537,600 716,160 926,400 1,074,240 2,518,560 Maximum Distributed RAM (Kb) 4,230 3,980 4,800 8,460 10,710 12,690 28,710 Block RAM/FIFO w/ecc (36 Kb each) 1,260 1,421 1,728 2,520 3,276 3,780 2,520 Block RAM/FIFO (18 Kb each) 2,520 2,842 3,456 5,040 6,552 7,560 5,040 Total Block RAM (Mb) 443 500 608 886 1152 1329 886 CMT (1 MMCM, 2 PLLs) 10 16 16 20 26 30 30 I/O DLL 40 64 64 80 104 120 120 Fractional PLL 5 8 8 10 13 15 0 Maximum Single-Ended HP I/Os 468 780 780 936 988 988 1,404 Maximum Differential HP I/O Pairs 216 360 360 432 456 456 648 Maximum Single-Ended HR I/Os 52 52 52 104 52 52 52 Maximum Differential HR I/O Pairs 24 24 24 48 24 24 24 DSP Slices 600 672 768 1,200 1,560 1,800 2,880 System Monitor 1 1 1 2 3 3 3 PCIe Gen1/2/3 2 4 4 4 4 6 6 Interlaken 3 6 6 6 9 9 0 100G Ethernet 3 4 4 6 7 9 3 GTH 16 Gb/s Transceivers 20 32 32 40 52 60 48 GTY 33 Gb/s Transceivers 20 32 32 40 52 60 0 Commercial -1-1 -1-1 -1-1 -1 Extended -2-3 -2-3 -2-3 -2-3 -2-3 -2-3 -2-3 Industrial -1-1L -2-1 -1L -2-1 -1L -2-1 -1L -2-1 -1L -2-1 -1L -2-1 -1L -2 Package Package Footprint (1) Dimensions (mm) HR I/O, HP I/O, GTH 16 Gb/s, GTY 33 Gb/s C1517 40x40 52, 468, 20, 20 52, 468, 24, 24 52, 468, 24, 24 B1517 40x40 52, 312, 32, 32 52, 312, 32, 32 52, 312, 40, 32 A1760 425x425 52, 676, 32, 16 52, 676, 32, 16 52, 676, 36, 16 D1924 45x45 52, 780, 28, 24 52, 780, 28, 24 52, 780, 28, 24 E1924 45x45 52, 624, 32, 32 52, 624, 32, 32 52, 624, 36, 36 475x475 (2) 52, 624, 36, 36 52, 624, 36, 36 J1924 45x45 52, 312, 32, 32 52, 312, 40, 40 475x475 (2) 52, 312, 52, 52 52, 312, 52, 52 A2377 50x50 104, 936, 28, 24 52, 988, 28, 24 52, 988, 28, 24 B2377 50x50 52, 1248, 36, 0 C2377 50x50 0, 520, 60, 60 A2892 55x55 52, 1404, 48, 0

Key Takeaways Increased IO bandwidth Up to 120 transceivers Up to 33 Gb/s per channel Routing architecture and Vivado designed to reduce congestion and improve performance Clocking architecture for flexibility and performance 2 nd generation SSI delivers "more than Moore" Up to 44 million logic cells in largest device Breakthrough architecture to address the most complex designs

Thank You! Page 21