Achieving Breakthrough Performance with Virtex-4, the World s Fastest FPGA

Similar documents
High-Performance Memory Interfaces Made Easy

The Next Generation 65-nm FPGA. Steve Douglass, Kees Vissers, Peter Alfke Xilinx August 21, 2006

H100 Series FPGA Application Accelerators

The DSP Primer 8. FPGA Technology. DSPprimer Home. DSPprimer Notes. August 2005, University of Strathclyde, Scotland, UK

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Field Programmable Gate Array (FPGA) Devices

Field Programmable Gate Array (FPGA)

Stratix II vs. Virtex-4 Performance Comparison

Altera Product Overview. Altera Product Overview

7-Series Architecture Overview

Stratix vs. Virtex-II Pro FPGA Performance Analysis

Creating High-Speed Memory Interfaces with Virtex-II and Virtex-II Pro FPGAs Author: Nagesh Gupta, Maria George

Virtex-II Architecture. Virtex II technical, Design Solutions. Active Interconnect Technology (continued)

Introduction to Field Programmable Gate Arrays

Basic FPGA Architecture Xilinx, Inc. All Rights Reserved

Enabling success from the center of technology. Interfacing FPGAs to Memory

Hardware Design with VHDL PLDs IV ECE 443

International Training Workshop on FPGA Design for Scientific Instrumentation and Computing November 2013.

FPGA Architecture Overview. Generic FPGA Architecture (1) FPGA Architecture

APEX II The Complete I/O Solution

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

Virtex-4 Family Overview

8. Migrating Stratix II Device Resources to HardCopy II Devices

Interfacing RLDRAM II with Stratix II, Stratix,& Stratix GX Devices

EE178 Lecture Module 2. Eric Crabill SJSU / Xilinx Fall 2007

Rich Sevcik Executive Vice President, Xilinx APAC: RS _January 05

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx

Summary. Introduction. Application Note: Virtex, Virtex-E, Spartan-IIE, Spartan-3, Virtex-II, Virtex-II Pro. XAPP152 (v2.1) September 17, 2003

FPGA architecture and design technology

Dynamic Phase Alignment for Networking Applications Author: Tze Yi Yeoh

Stratix. High-Density, High-Performance FPGAs. Available in Production Quantities

INTRODUCTION TO FPGA ARCHITECTURE

Digital Integrated Circuits

Programmable Logic Devices FPGA Architectures II CMPE 415. Overview This set of notes introduces many of the features available in the FPGAs of today.

Synthesis Options FPGA and ASIC Technology Comparison - 1

Introduction to FPGAs. H. Krüger Bonn University

Stratix. Introduction. Features... 10,570 to 114,140 LEs; see Table 1. FPGA Family. Preliminary Information

Parallel FIR Filters. Chapter 5

The Nios II Family of Configurable Soft-core Processors

DSP Resources. Main features: 1 adder-subtractor, 1 multiplier, 1 add/sub/logic ALU, 1 comparator, several pipeline stages

Altera FLEX 8000 Block Diagram

INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS)

Stratix. Introduction. Features... Programmable Logic Device Family. Preliminary Information

XA Spartan-6 Automotive FPGA Family Overview

TSEA44 - Design for FPGAs

1. Stratix IV Device Family Overview

Optimal Management of System Clock Networks

High Performance DDR4 interfaces with FPGA Flexibility. Adrian Cosoroaba and Terry Magee Xilinx, Inc.

Power Solutions for Leading-Edge FPGAs. Vaughn Betz & Paul Ekas

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses

System-on Solution from Altera and Xilinx

FPGA VHDL Design Flow AES128 Implementation

Chapter 2. Cyclone II Architecture

The Ultimate System Integration Platform. VIRTEX-5 FPGAs


Zynq-7000 All Programmable SoC Product Overview

XA Artix-7 FPGAs Data Sheet: Overview

VHX - Xilinx - FPGA Programming in VHDL

Introduction to Partial Reconfiguration Methodology

APEX Devices APEX 20KC. High-Density Embedded Programmable Logic Devices for System-Level Integration. Featuring. All-Layer Copper.

EITF35: Introduction to Structured VLSI Design

White Paper The Need for a High-Bandwidth Memory Architecture in Programmable Logic Devices

The Xilinx XC6200 chip, the software tools and the board development tools

High-Performance DDR3 SDRAM Interface in Virtex-5 Devices Author: Adrian Cosoroaba

Programmable Logic Design Grzegorz Budzyń Lecture. 15: Advanced hardware in FPGA structures

FPGA IO Resources. FPGA pin buffers can be configured for. IO columns include Xilinx SelectIO resources. Slew rate. Serializer/Deserializer (IOSERDES)

The S6000 Family of Processors

Synthesis of VHDL Code for FPGA Design Flow Using Xilinx PlanAhead Tool

1. Overview for the Arria V Device Family

Reduce Your System Power Consumption with Altera FPGAs Altera Corporation Public

SigmaRAM Echo Clocks

High-Performance Integer Factoring with Reconfigurable Devices

TECHNOLOGY BRIEF. Double Data Rate SDRAM: Fast Performance at an Economical Price EXECUTIVE SUMMARY C ONTENTS

Xilinx DSP. High Performance Signal Processing. January 1998

White Paper. ORSPI4 Field-Programmable System-on-a-Chip Solves Design Challenges for 10 Gbps Line Cards

NEW FPGAs BOOST SOFTWARE DEFINED RADIO PERFORMANCE

Gate Estimate. Practical (60% util)* (1000's) Max (100% util)* (1000's)

Synthesizable CIO DDR RLDRAM II Controller for Virtex-II Pro FPGAs Author: Rodrigo Angel

EECS150 - Digital Design Lecture 16 - Memory

7 Series FPGAs Overview

Lecture 41: Introduction to Reconfigurable Computing

Power vs. Performance: The 90 nm Inflection Point

EECS150 - Digital Design Lecture 17 Memory 2

ispxpld TM 5000MX Family White Paper

Interfacing FPGAs with High Speed Memory Devices

Power Considerations in High Performance FPGAs. Abu Eghan, Principal Engineer Xilinx Inc.

FPGA Implementation and Validation of the Asynchronous Array of simple Processors

EE219A Spring 2008 Special Topics in Circuits and Signal Processing. Lecture 9. FPGA Architecture. Ranier Yap, Mohamed Ali.

FPGAs: FAST TRACK TO DSP

EECS150 - Digital Design Lecture 16 Memory 1

ISSN: [Bilani* et al.,7(2): February, 2018] Impact Factor: 5.164

L2: FPGA HARDWARE : ADVANCED DIGITAL DESIGN PROJECT FALL 2015 BRANDON LUCIA

ispgdx2 vs. ispgdx Architecture Comparison

MAX 10 FPGA Device Overview

Programmable Logic. Any other approaches?

In-chip and Inter-chip Interconnections and data transportations for Future MPAR Digital Receiving System

Peter Alfke, Xilinx, Inc. Hot Chips 20, August Virtex-5 FXT A new FPGA Platform, plus a Look into the Future

Section I. Cyclone II Device Family Data Sheet

Section I. Cyclone II Device Family Data Sheet

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function.

Transcription:

Achieving Breakthrough Performance with Virtex-4, the World s Fastest FPGA Xilinx 90nm Design Seminar Series: Part I Xilinx - #1 in 90 nm

We Asked our Customers: What are your challenges? Shorter design time, faster obsolescence More competition, increasing cost pressure Demanding complexity and performance, Signal integrity problems caused by faster I/O Power consumption and thermal issues Today s seminar addresses Performance

90 nm Technology Proof that Moore s Law is still alive Every two years, twice the number of transistors Every five years, twice the speed Half the size = 2x chips/wafer = half the cost, Smaller distances = less capacitance = higher speed Less C and lower V = less dynamic power = fcv 2 Xilinx has extensive design and manufacturing experience Spartan-3 since 2003, Virtex-4 since mid-2004 Xilinx has shipped 100 times more 90nm devices than all other FPGA manufacturers combined

500 MHz Virtex-4 1. 90 nm technology is the foundation More transistors, faster speed and lower cost 2. Clever circuit design builds on that foundation Smaller size, lower power, and higher performance 3. Architectural innovation, novel features Hard-coded IP: PowerPC, BRAM-FIFO, DSP-MAC Source-synchronous I/O with delay-line on every pin, Dedicated Multi-Gigabit Transceivers up to 10 Gbps 4. Continuously improving design tools, easier to use the fastest FPGA family in the world

Architectural Innovation Logic Array Differential Clocking >1 Gbps diff I/O with ChipSync 450 MHz PowerPC with APU 622 Mbps to over 10 Gbps RocketIO XtremeDSP Slice BRAM and FIFO

Virtex-4: Highest FPGA Performance Performance 480 Gbps 10 Gbps 260 Gbps 702 DMIPS Benchmark 702 DMIPS Breakthrough Performance I/O LVDS Bandwidth I/O Memory Bandwidth High-speed Serial I/O On-chip RAM Speed DSP: 32-Tap Filter Embedded Processing Logic Fabric Performance

Better Functions = Higher Speed The biggest performance boost comes from: Versatile clocking Parallel I/O and memory interfaces Serial I/O, multi-gigabit transceivers Memory (distributed RAM, BlockRAM, FIFOs) Expandable DSP slices Fast synchronous counters Embedded processing Fabric logic resources and interconnect structure Functionality can triple the performance

Architectural Innovation Logic Array Differential Clocking >1 Gbps diff I/O with ChipSync 450 MHz PowerPC with APU 622 Mbps to over 10 Gbps RocketIO XtremeDSP Slice BRAM and FIFO

500 MHz Versatile Clocking 32 Global clock lines can cover the whole chip Fast rise and fall times, low skew, differential Optimal for DDR operation, avoids duty-cycle distortion (skew, transition time and distortion would lower design margins) Distributed local clocks: ional clocks avoid the need for slow clock routing Very fast I/O clocks, up to = 1Gbps Rugged clocks, the backbone of any design

Precise Clock Management Phase-aligned frequency control Eliminates on-chip clock delay Of vital importance on larger chips Can also eliminate PC-board clock delay Simultaneous multiplication and division Frequency synthesis Dynamic Clock Phase adjust (servo-controlled) in 256 steps with 50 ps granularity 500 MHz Picosecond precision and granularity

Architectural Innovation Logic Array Differential Clocking >1 Gbps diff I/O with ChipSync 450 MHz PowerPC with APU 622 Mbps to over 10 Gbps RocketIO XtremeDSP Slice BRAM and FIFO

High-Speed Inputs and Outputs Traditional System-Synchronous @ <200 MHz One central clock arrives everywhere simultaneously Source-Synchronous Clocking @ >200 MHz A dedicated clock always travels with the data Avoids all PC-board delay issues Receiver must precision-align clock or strobe with the data eye I/O data rate often exceeds internal data rate

Parallel I/O Optimized for Source-Synchronous Interfaces Programmable precision delay on every input pin Can align data to strobe or clock, or compensate pc board Serial-to-parallel converter on every input pin Parallel-to-serial converter on every output pin Does not rely on global clocks Speed and Bandwidth Networking Interfaces: 1+ Gbps differential Memory Interfaces: 600 Mbps single ended All I/O pins are created equal 500 MHz

Precise Timing Adjustment CLK DATA ChipSync IDELAY INC/DEC Virtex-4 4 FPGA Fabric State Machine 200 MHz (calibration) ISERDES IDELAY CNTL 500 MHz Independent IDELAY on every input pin Automatic calibration with 64 taps of 78 ps each

ChipSync Precision Timing Data valid (<1/3 ns) Data 533 Mbps Clock 267 MHz Leading and trailing edge uncertainty Example: 533 Mbps DDR2 SDRAM memory interface ChipSync IDELAY Automatically aligns data and clock 78 ps resolution for precise clock-to-data centering Increases design margins for higher system reliability Not available in any other FPGA, ASIC or ASSP

1 Gbps S/P on Every Input Pin 500 MHz ChipSync ISERDES n FPGA Fabric CLK BUFIO CLKDIV BUFR Virtex-4 FPGA Serial-to-Parallel Converter on every input pin Frequency division by 2, 3, 4, 5, 6, 7, 8 or 10 Works in conjunction with IDELAY block Dynamic signal alignment for bit, word, or clock Supports Dynamic Phase Alignment (DPA) Requires no resources in the fabric

1 Gbps P/S on Every Output Pin 500 MHz ChipSync OSERDES n FPGA Fabric CLK BUFIO CLKDIV BUFR Virtex-4 FPGA Parallel-to-Serial Converter on every output pin Frequency multiplication by 2, 3, 4, 5, 6, 7, 8 or 10 Requires no resources in the fabric

Wide Memory Interfaces Source-synchronous for highest performance All Virtex-4 pins have the same capabilities Wide memory interfaces on all 4 sides Highest bandwidth, up to 432 bits @ 600 Mbps DDR Evaluation board demonstrates interfaces to: DDR, DDR2, QDR-II SRAM, RLDRAM-II, FCRAM-II Supported by multi-standard evaluation board ML461

ML461 Advanced Memory Development System Available Now DDR2 DDR QDR II RLDRAM II FCRAM II Data rate 533 Mbps 400 Mbps 1.2 Gbps 600 Mbps 600 Mbps CLK Rate 267 MHz 200 MHz 300 MHz 300 MHz 300 MHz Data Width 144 bit (DIMM) 144 bit (DIMM) (72+72) bit 36 bit 36 bit I/O Standard SSTL18 SSTL 2 HSTL HSTL SSTL18

ML450 Source Synchronous I/F Development System Available Now Data rate 1+ Gbps Standards SPI-4.2 SFI-4 XSBI RapidIO Hyper Transport NPSI(CSIX) Utopia IV GFP

Fastest Serial I/O RocketIO transceivers Full-duplex w/ integrated SERDES, FIFOs, and CDR 8-24 channels per chip, Data rate: 622Mbps-10Gbps Transmit pre-emphasis, receive equalization Fastest I/O in any FPGA

Architectural Innovation Logic Array Differential Clocking >1 Gbps diff I/O with ChipSync 450 MHz PowerPC with APU 622 Mbps to over 10 Gbps RocketIO XtremeDSP Slice BRAM and FIFO

DSP Building Block FPGAs do high-performance DSP extremely well Thanks to massive parallelism (500+ engines) Fast clock rate, pipelined, high throughput Needs many fast Multiplier-Accumulators (MACs) Dedicated, repetitive (systolic) structure Slightly programmable, must be expandable Can be the fastest DSP, if done right

Traditional DSP Algorithm Implementation, 32-Tap FIR Filter 4-deep MAC blocks need fabric support Cascaded adder tree becomes the performance bottleneck Example: 32-tap FIR Filter 235.7 MHz in Slow grade 347.9 MHz in Fast grade Traditional Implementation Mult/MAC 29-32 M 32 M 31 M 30 M 29 Mult/MAC 25-28 Mult/MAC 21-24 Mult/MAC 17-20 Mult/MAC 13-16 Mult/MAC 9-12 Mult/MAC 5-8 Mult/MAC 1-4 FPGA Logic When you are fast, every single nanosecond hurts

Unbeatable DSP Performance ASMBL advantage: Cascade throughout vertical column without any help from the fabric Speed not affected by external adder tree 57% faster than the traditional approach Example: Systolic 32-tap FIR Filter DSP Column M32 M31 M30 M29 M28 M27 M26 M10 M9 M8 M7 M6 M5 M4 M3 M2 M1 500 MHz Virtex-4 FPGA 400 MHz in Slow grade in Fast grade It helps to do things right!

Expansion Without Extra Logic Data In C0 C1 C2 C3 C4 C5 C6 C7 C30 C31 X X X X X X X X X X + + + + + + + + + + Data Out unlimited pipelined cascade No ripple delay for any filter size 32-tap FIR filter runs at Uses 32 DSP Slices and no resources in the fabric

Expandable DSP Slice (MAC) z H M 0 50

Virtex-4 MUX Solution 500 MHz AB1[35:0] C1[47:0] C AB2[35:0] AB AB3[35:0] AB C2[47:0] C AB4[35:0] AB 0 0 0 0 0 ZMUX XMUX 0 ZMUX XMUX 0 ZMUX XMUX 0 ZMUX XMUX MUX[47:0] Six-to-one: 36-bit-wide MUX in only 4 DSP Slices Five-to-one: 36-bit-wide MUX in only 3 DSP Slices Two-level pipeline assures performance Innovative use of DSP slices, not just for filters

500 MHz + other Applications Counters: 32-bit synchronous loadable counter @ or in 1 slice (high speed thanks to smart carry structure) 48-bit synchronous non-loadable counter @ 18-bit Barrel shifter, using the multiplier 32-bit Direct Digital Synthesis @ Phase accumulator can achieve virtual 8 GHz operation with the help of one BlockRAM and 16 IDELAYs Each chip has between 32 and 512 such DSP slices

Architectural Innovation Logic Array Differential Clocking >1 Gbps diff I/O with ChipSync 450 MHz PowerPC with APU 622 Mbps to over 10 Gbps RocketIO XtremeDSP Slice BRAM and FIFO

500 MHz Fast 18Kb-BlockRAM Pipelined 500MHz synchronous operation Width-adjustable per port Read before write, or write before read Built-in FIFO controller in each BlockRAM Built-in Hamming error correction for 64 bits Ideal for Microprocessor data / instructions and fast state machines 48 to 552 BlockRAMs per chip

Fast State Machine Branch Control 8 bits 8 + 1 bits 2 bits BlockROM 1K x 9 Output Unused 256 states, 4-way branch, 350 MHz guaranteed or 128 states, 8-way branch, same speed or 64 states, 16-way branch, same speed

plus 36 More Outputs BlockROM Branch Control 8 bits 8 + 1 bits 2 bits 1K x 9 Output 8 bits 256 x 36 36 bits 36 additional parallel outputs from the other half of the BlockRAM

...and Many Control Inputs 3, 2, or 1 bits 6, 7, or 8 bits BlockROM Branch Control Up to 32 bits 4, 3, or 2 bits 1K x 9 9 bits Output Output Control 2, 1, or 0 bits 256 x 36 36 bits 64, 128, or 256 states with multi- branch capability 36 freely assigned + 8 encoded outputs Optional multiplexed control inputs All in one BlockRAM plus 2 CLBs

Dual-Clock FIFO Conceptually very simple: Use dual-port RAM + two counters + comparators Problem: Decoding FULL and EMPTY when using unrelated clocks for write and read Requires Gray-coded counters Complicates partial full/empty decoding Synchronize EMPTY flag to read clock Trailing edge causes metastability concerns Requires good understanding of asynchronous issues

500 MHz Dual-Clock FIFO A FIFO controller in every Virtex-4 BlockRAM Tested at with 10 14 flag cycles Testing details: Write at 200 MHz, read asynchr. at Goes EMPTY after every write, stops reading 200 M asynchronous arbitrations per second No errors for 10 14 cycles Don t worry about metastability, we did it for you! Complete solution in every Virtex-4 BlockRAM FIFO

Architectural Innovation Logic Array Differential Clocking >1 Gbps diff I/O with ChipSync 450 MHz PowerPC with APU 622 Mbps to over 10 Gbps RocketIO XtremeDSP Slice BRAM and FIFO

Embedded Microprocessors Many designs combine fast logic and arithmetic with much slower functions Fast logic is executed in the fabric nanosecond logic Slower, but more complex functions are best executed by µp or µcontroller microsecond or millisecond logic Slow functions benefit from µp design flexibility

Soft Microprocessor Cores Soft processors use fabric + BlockRAMs PicoBlaze, MicroBlaze Flexible and versatile, but more limited than a hard µp like PowerPC Soft µps use 100 to 1K slices plus RAMs MHz clock speed and up to 200 DMIPS Soft processors for simple and slow tasks

Highest-Performance FPGA Microprocessor High-performance PowerPC Core Up to two cores per Virtex-4 FX device 702 DMIPS per core Integrated 16Kbyte Instruction Cache Integrated 16Kbyte Data Cache Acceleration through Auxiliary Processor Unit (APU) Interface Provides direct access from FPGA fabric to PowerPC core High-performance coprocessor support More than 3 times faster than any soft µp

Architectural Innovation Logic Array Differential Clocking >1 Gbps diff I/O with ChipSync 450 MHz PowerPC with APU 622 Mbps to over 10 Gbps RocketIO XtremeDSP Slice BRAM and FIFO

Fabric Resources 500 MHz 1 CLB = 4 Slices = 8 VLUTs Variable Look-Up-Table for: Logic of 4, 5, 6 or 7 inputs Distributed memory 16-bit shift register, SRL16 No shared inputs Can pack unrelated functions Eight Flip-flops per CLB Fast carry Logic and routing support operation

Benchmarks Should Guide and Help the User Meaningful and realistic benchmarks must: Compare equivalent speed grades Use the best available software options Constrain both synthesis and place and route Use meaningful design examples Evaluate real functions, not just LUTs Utilize available embedded functions

Benchmarks Should Guide and Help the User "We agree that Xilinx' constraint-based benchmarking methodology delivers a robust and accurate assessment of device performance for a broad variety of real-world customer designs. "This methodology for measuring performance has enabled us to tune Synplify Pro to achieve continuously superior results for Xilinx devices. With the new version of the Synplify Pro software and the Xilinx ISE Software, Synplicity and Xilinx continue to offer the industry great results." Ken McElvain, CTO, Synplicity

Benchmarks to Measure Fabric Performance 100.0 80.0 60.0 40.0 20.0 Virtex-4 FPGAs Average: 15% advantage 0.0 0 10 20 30 40 50 60-20.0-40.0 90nm Competitor -60.0 Logic performance testing of fastest speed grades using real customer designs and the best available tools

Performance is not Determined by Traditional Benchmarks Alone Old designs obviously are not optimized for the new architectural features Performance should be measured as a composite of typical system requirements: Fabric, I/O, on-chip memory, memory interfaces, DSP, embedded processing, etc.

Virtex-4 Leads in 7 of 7 Performance Criteria 1.6x 3x 1.2x 1.4x 3x 1.1x Performance N/A I/O LVDS Bandwidth I/O Memory Bandwidth High-speed Serial I/O On-chip RAM DSP: 32-Tap Filter Embedded Processing Logic Fabric Performance Xilinx Virtex-4 FPGAs 90 nm competitor Data based on competitor s published datasheet numbers Performance up to 3x higher than competing FPGAs

Summary Xilinx is First in 90nm FPGAs 100x more devices shipped than by all other PLD vendors combined Virtex-4 is the world s fastest FPGA Traditional benchmarks show performance advantage in fabric New functions enable breakthrough performance + reduced power + lower cost Virtex-4 : Product of the Year

a a boost in speed,, reduction in power, and significant reduction in cost