Basic FPGA Architecture

Similar documents
Virtex-II Architecture

ECE 448 Lecture 5. FPGA Devices

ECE 545 Lecture 12. FPGA Resources. George Mason University

Basic FPGA Architecture Xilinx, Inc. All Rights Reserved

ECE 448 Lecture 5. FPGA Devices

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

The Virtex FPGA and Introduction to design techniques

Chapter 8 FPGA Basics

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses

Programmable Logic. Simple Programmable Logic Devices

Field Programmable Gate Arrays (FPGAs)

INTRODUCTION TO FPGA ARCHITECTURE

EE178 Lecture Module 2. Eric Crabill SJSU / Xilinx Fall 2007

International Training Workshop on FPGA Design for Scientific Instrumentation and Computing November 2013.

FPGA architecture and design technology

EITF35: Introduction to Structured VLSI Design

Field Programmable Gate Array (FPGA)

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function.

FPGA Architecture Overview. Generic FPGA Architecture (1) FPGA Architecture

Topics. Midterm Finish Chapter 7

The Next Generation 65-nm FPGA. Steve Douglass, Kees Vissers, Peter Alfke Xilinx August 21, 2006

Digital Integrated Circuits

Hardware Design with VHDL PLDs IV ECE 443

EE219A Spring 2008 Special Topics in Circuits and Signal Processing. Lecture 9. FPGA Architecture. Ranier Yap, Mohamed Ali.

! Program logic functions, interconnect using SRAM. ! Advantages: ! Re-programmable; ! dynamically reconfigurable; ! uses standard processes.

Virtex-II Architecture. Virtex II technical, Design Solutions. Active Interconnect Technology (continued)

Summary. Introduction. Application Note: Virtex, Virtex-E, Spartan-IIE, Spartan-3, Virtex-II, Virtex-II Pro. XAPP152 (v2.1) September 17, 2003

INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS)

ECE 699: Lecture 9. Programmable Logic Memories

EECS150 - Digital Design Lecture 16 - Memory

Programmable Logic Devices FPGA Architectures II CMPE 415. Overview This set of notes introduces many of the features available in the FPGAs of today.

Field Programmable Gate Array (FPGA) Devices

ECEU530. Project Presentations. ECE U530 Digital Hardware Synthesis. Rest of Semester. Memory Structures

FPGA Implementations

Introduction to Field Programmable Gate Arrays

EECS150 - Digital Design Lecture 13 - Project Description, Part 2: Memory Blocks. Project Overview

Memory and Programmable Logic

ECE 645: Lecture 1. Basic Adders and Counters. Implementation of Adders in FPGAs

EECS150 - Digital Design Lecture 16 Memory 1

Achieving Breakthrough Performance with Virtex-4, the World s Fastest FPGA

Outline. Field Programmable Gate Arrays. Programming Technologies Architectures. Programming Interfaces. Historical perspective

L2: FPGA HARDWARE : ADVANCED DIGITAL DESIGN PROJECT FALL 2015 BRANDON LUCIA

ECE 545: Lecture 11. Programmable Logic Memories

ECE 545: Lecture 11. Programmable Logic Memories. Recommended reading. Memory Types. Memory Types. Memory Types specific to Xilinx FPGAs

FPGA VHDL Design Flow AES128 Implementation

ΔΙΑΛΕΞΗ 2: FPGA Architectures

Evolution of Implementation Technologies. ECE 4211/5211 Rapid Prototyping with FPGAs. Gate Array Technology (IBM s) Programmable Logic

Very Large Scale Integration (VLSI)

Spiral 2-8. Cell Layout

H100 Series FPGA Application Accelerators

8. Migrating Stratix II Device Resources to HardCopy II Devices

Field Program mable Gate Arrays

Topics. Midterm Finish Chapter 7

Lecture 11 Memories in Xilinx FPGAs

FPGAs: FAST TRACK TO DSP

The Xilinx XC6200 chip, the software tools and the board development tools

Programmable Logic. Any other approaches?

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx

Power Solutions for Leading-Edge FPGAs. Vaughn Betz & Paul Ekas

Introduction to Modern FPGAs

EECS150 - Digital Design Lecture 3 - Field Programmable Gate Arrays (FPGAs) Project platform: Xilinx ML

Zynq-7000 All Programmable SoC Product Overview

Design Tools for 100,000 Gate Programmable Logic Devices

FYSE420 DIGITAL ELECTRONICS. Lecture 7

Virtex-4 Family Overview

CS Digital Systems Project Laboratory

7-Series Architecture Overview

Verilog for High Performance

Synthesis of VHDL Code for FPGA Design Flow Using Xilinx PlanAhead Tool

FPGA Implementation and Validation of the Asynchronous Array of simple Processors

Reduce Your System Power Consumption with Altera FPGAs Altera Corporation Public

TSEA44 - Design for FPGAs

Xilinx ASMBL Architecture

HDL Coding Style Xilinx, Inc. All Rights Reserved

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

Field Programmable Gate Array

Zynq AP SoC Family

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

Lecture 7. Standard ICs FPGA (Field Programmable Gate Array) VHDL (Very-high-speed integrated circuits. Hardware Description Language)

FPGA How do they work?

The DSP Primer 8. FPGA Technology. DSPprimer Home. DSPprimer Notes. August 2005, University of Strathclyde, Scotland, UK

High-Performance Integer Factoring with Reconfigurable Devices

Stratix II vs. Virtex-4 Performance Comparison

Actel s SX Family of FPGAs: A New Architecture for High-Performance Designs

Introduction to FPGAs. H. Krüger Bonn University

Microcomputers. Outline. Number Systems and Digital Logic Review

Stratix vs. Virtex-II Pro FPGA Performance Analysis

ispxpld TM 5000MX Family White Paper

FIELD PROGRAMMABLE GATE ARRAYS (FPGAS)

Power Consumption in 65 nm FPGAs

Reconfigurable Computing

ECE 636. Reconfigurable Computing. Lecture 2. Field Programmable Gate Arrays I

FPGA: What? Why? Marco D. Santambrogio

Design of Arithmetic circuits

PowerPlay Early Power Estimator User Guide for Cyclone III FPGAs

FPGA Power Management and Modeling Techniques

ProASIC PLUS FPGA Family

Lecture 41: Introduction to Reconfigurable Computing

EECS 151/251A Spring 2019 Digital Design and Integrated Circuits. Instructor: John Wawrzynek. Lecture 18 EE141

Transcription:

/25/2 CS4 igital System esign Technology Timeline r. Arshad Aziz Basic FPGA Architecture The esign Warrior s Guide to FPGAs evices, Tools, and Flows. ISBN 7567643 Copyright 24 Mentor Graphics Corp. (www.mentor.com) Major FPGA vendors SRAM-based FPGAs Xilinx Inc. www.xilinx.com Altera Corp. www.altera.com Atmel Corp. www.atmel.com Lattice Semiconductor Corp. www.latticesemi.com Feature Technology node Reprogrammable Reprogramming speed (inc. erasing) Volatile (must be programmed on power-up) Requires external configuration file SRAM State-of-the-art Yes (in system) Fast Yes Yes Antifuse One or more generations behind No ---- No No E2PROM / FLASH One or more generations behind Yes (in-system or offline) 3x slower than SRAM No (but can be if required) No Antifuse and flash-based FPGAs Actel Corp. www.actel.com QuickLogic Corp. www.quicklogic.com Good for prototyping Instant-on No Yes Yes Acceptable IP Security (especially when using Very Good Very Good bitstream encryption) Size of configuration cell Power consumption Yes (very good) Large (six transistors) Medium No Very small Low Yes (reasonable) Medium-small (two transistors) Medium Rad Hard No Yes Not really The Programmable Marketplace Q Calendar Year 25 Actel PLSegment 33% Lattice QuickLogic: 2% Other: 2% 5% 7% 5% FPGA Sub-Segment Xilinx 58% Low-cost FPGA Families High-performance Spartan 3 Virtex 4 LX / SX / FX Spartan 3E Virtex 5 LX Spartan 3L Xilinx Altera Xilinx Altera 3% % All Others Cyclone II Stratix II Stratix II GX Altera Source: Company reports Latest information available; computed on a 4-quarter rolling basis

/25/2 Xilinx Primary products: FPGAs and the associated CA software Xilinx Primary products: FPGAs and the associated CA software Programmable Logic evices Main headquarters in San Jose, CA Fabless* Semiconductor and Software Company UMC (Taiwan) {*Xilinx acquired an equity stake in UMC in 996} Seiko Epson (Japan) TSMC (Taiwan) ISE Alliance and Foundation Series esign Software Source: [Xilinx Inc.] Programmable Logic evices ISE Alliance and Foundation Series esign Software Main headquarters in San Jose, CA Fabless* Semiconductor and Software Company UMC (Taiwan) {*Xilinx acquired an equity stake in UMC in 996} Seiko Epson (Japan) TSMC (Taiwan) Source: [Xilinx Inc.] Xilinx FPGA Families Old families XC3, XC4, XC52 Old.5µm,.35µm and.25µm technology. Not recommended for modern designs. Low Cost Family Spartan/XL derived from XC4 Spartan-II derived from Virtex Spartan-IIE derived from Virtex-E Spartan-3 (9 nm) Spartan-3E (9 nm) Spartan-3A (9 nm) High-performance families Virtex (22 nm) Virtex-E, Virtex-EM (8 nm) Virtex-II, Virtex-II PRO (3 nm) Virtex-4 (9 nm) Virtex 5 (65 nm) Source: [Xilinx Inc.] General structure of an FPGA The esign Warrior s Guide to FPGAs evices, Tools, and Flows. ISBN 7567643 Copyright 24 Mentor Graphics Corp. (www.mentor.com) Xilinx FPGA Configurable Logic Blocks Generic FPGA architecture: Configurable Logic Block (CLB) Connection Block Block RAMs Block RAMs I/O Blocks Block RAMs Wire segments Switch Block Routing Channels I/O pad 2

/25/2 Xilinx CLB Xilinx Point of Reference The esign Warrior s Guide to FPGAs evices, Tools, and Flows. ISBN 7567643 Copyright 24 Mentor Graphics Corp. (www.mentor.com) A Xilinx CLB has FOUR slices Each slice has TWO logic cells Each logic cell has TWO s plus other logic (carry and control) plus a flip-flop/latch For SLICEL slices, these s can be configured as:. For SLICEM slices, these s can be configured as:. 2. 6 x istributed RAM (6 words x bit/word) 3. 6-bit Shift Register CLB Structure of Spartan 3 Simplified view of a Xilinx Logic Cell Each Virtex -II CLB contains four slices Local routing provides feedback between slices in the same CLB, and it provides routing to neighboring CLBs A switch matrix provides access to general routing resources Switch Matrix COUT COUT BUFT BUF T Slice S3 Slice S2 SHIFT Slice S Slice S CIN CIN Local Routing The esign Warrior s Guide to FPGAs evices, Tools, and Flows. ISBN 7567643 Copyright 24 Mentor Graphics Corp. (www.mentor.com) Simplified Slice Structure etailed Slice Structure Each slice has four outputs Two registered outputs, two non-registered outputs Two BUFTs associated with each CLB, accessible by all 6 CLB outputs Carry logic runs vertically, up only Two independent carry chains per CLB Slice PRE Carry CE Q Carry CLR PRE CE CLR Q The next few slides discuss the slice features s MUXF5, MUXF6, MUXF7, MUXF8 (only the F5 and F6 MUX are shown in this diagram) Carry Logic MULT_ANs Sequential Elements 3

/25/2 SRAM Cell (Pass Transistor) An SRAM cell can drive the gate (G) terminal of an NMOS transistor. If SRAM (M) = then signals passes from S An SRAM cell can be attached to the select line of a MUX to control it. A B C Look-Up Tables Combinatorial logic is stored in Look-Up Tables (s) Also called Function Generators (FGs) Capacity is limited by the number of inputs, not by the complexity elay through the is constant Combinatorial Logic Z ABCZ... Look Up Table () The is used to realize any Boolean function. Assume the function to be realized is y = (a&b)!c This could be achieved by loading the with the appropriate output values x x 2 x 3 x 4 y (Look-Up Table) Functionality x x 2 x 3 x 4 y x x 2 x 3 x 4 x x 2 x 3 x 4 y Look-Up tables are primary elements for logic implementation Each can implement any function of 4 inputs x x 2 y y 5-Input Functions implemented using two s One CLB Slice can implement any function of 5 inputs Logic function is partitioned between two s F5 multiplexer selects F4 F3 F2 F BX nbx BX A4 A3 A2 A A4 A3 A2 A ROM RAM WS WS ROM RAM I I F5 F5 GXOR G X 5-Input Functions implemented using two s X 5 X 4 X 3 X 2 X Y OUT 4

/25/2 edicated Expansion Multiplexers Connecting Look-Up Tables MUXF5 combines 2 s to create Any 5-input function (5) Or selected functions up to 9 inputs Or 4x multiplexer MUXF6 combines 2 slices to form Any 6-input function (6) Or selected functions up to 9 inputs 8x multiplexer edicated muxes are faster and more space efficient CLB Slice Slice MUXF5 MUXF5 MUXF6 CLB Slice S Slice S Slice S3 Slice S2 F5 F5 F6 F7 F5 F8 F5 F6 MUXF8 combines the two MUXF7 outputs (from the CLB above or below) MUXF6 combines slices S2 and S3 MUXF7 combines the two MUXF6 outputs MUXF6 combines slices S and S MUXF5 combines s in each slice Programmable Logic Block Early devices were based on the concept of programmable logic block, which comprised 3-input lookup table (), register that could act as flip flop or a latch, multiplexer, along with a few other elements. 3-, 4-, 5-, or 6-input s? The key feature of n-input is that it can implement any possible n-input combinational logic function. Adding more inputs allows you to represent more complex functions, but every time you add an input, you double the number of SRAM cells! The first FPGAs were based on 3-input s. FPGA vendors and researchers studied the relative merits of 3, 4, 5 and even 6 input S. The current consensus is that 4-input S offer the optimal balance of pros and cons. In the past, some devices were created using a mixture of different sizes because this offered the promise of optimal device utilization. However current logic synthesis tools prefer uniformity and regularity FPGA Function generators Example: Implement the function using: A B B C A B C 2-input s 3-input s 4-input s F A B B C A B C F = AB + BC + ABC F A B C F Fast Carry Logic Each CLB contains separate logic and routing for the fast generation of sum & carry signals Increases efficiency and performance of adders, subtractors, accumulators, comparators, and counters Carry logic is independent of normal logic and routing resources MSB LSB Carry Logic Routing 5

/25/2 Simple, fast, and complete arithmetic Logic edicated XOR gate for singlelevel sum completion Uses dedicated routing resources All synthesis tools can infer carry logic Fast Carry Logic CIN COUT To S of the next CLB First Carry Chain COUT SLICE S SLICE S CIN COUT To CIN of S2 of the next CLB COUT Second Carry Chain CIN CIN CLB SLICE S3 SLICE S2 Accessing Carry Logic All major synthesis tools can infer carry logic for arithmetic functions Addition (SUM <= A + B) Subtraction (IFF <= A - B) Comparators (if A < B then ) Counters (count <= count +) Flexible Sequential Elements Shift Register Either flip-flops or latches Two in each slice; eight in each CLB Inputs come from s or from an independent CLB input Separate set and reset controls Can be synchronous or asynchronous All controls are shared within a slice Control signals can be inverted locally within a slice CE FRSE_ FCPE CE CE S R PRE CLR LCPE PRE Q Q Q Each can be configured as shift register Serial in, serial out ynamically addressable delay up to 6 cycles For programmable pipeline Cascade for greater cycle delays Use CLB flip-flops to add depth IN CE CLK = Q CE Q CE Q CE Q CE OUT G CLR EPTH[3:] Shift Register Shift Register Example 2 Cycles 64 Operation A Operation B 4 Cycles 8 Cycles Operation C 64 3 Cycles 2 Cycles 3 Cycles 9-Cycle imbalance Register-rich FPGA Allows for addition of pipeline stages to increase throughput ata paths must be balanced to keep desired functionality 64 Operation A Operation B 4 Cycles 8 Cycles Operation C 3 Cycles Operation - NOP 9 Cycles 2 Cycles 64 Paths are Statically Balanced 6

/25/2 istributed RAM CLB configurable as istributed RAM An equals 6x RAM Cascade s to increase RAM size Synchronous write Asynchronous read Can create a synchronous read by using extra flip-flops Naturally, distributed RAM read is asynchronous Two s can make 32 x single-port RAM 6 x 2 single-port RAM 6 x dual-port RAM = RAM32XS WE WCLK A O A A2 A3 A4 or = RAM6XS WE WCLK A A A2 A3 RAM6X2S WE WCLK O A O A A2 A3 or O RAM6X WE A A A2 A3 WCLK PRA PRA PRA2 PRA3 SPO PO Xilinx Multipurpose The esign Warrior s Guide to FPGAs evices, Tools, and Flows. ISBN 7567643 Copyright 24 Mentor Graphics Corp. (www.mentor.com) Simplified view of a Xilinx Logic Cell RAM Blocks and Multipliers in Xilinx FPGAs The esign Warrior s Guide to FPGAs evices, Tools, and Flows. ISBN 7567643 Copyright 24 Mentor Graphics Corp. (www.mentor.com) The esign Warrior s Guide to FPGAs evices, Tools, and Flows. ISBN 7567643 Copyright 24 Mentor Graphics Corp. (www.mentor.com) Embedded Ram Blocks A lot of applications require the use of memory, so FPGAs now include relatively large chunks of embedded RAM called e-ram or Block RAM (BRAM). epending on the architecture of the component, these blocks might be positioned around the periphery of the device or organized as columns These blocks can be used for a variety of purposes, such as implementing standard single or dual port RAMs, FIFO, e.t.c. Block RAM Port A Spartan-3 ual-port Block RAM Port B Block RAM Most efficient memory implementation edicated blocks of memory Ideal for most memory requirements 4 to 4 memory blocks 8 kbits = 8,432 bits per block (6 k without parity bits) Use multiple blocks for larger memories Builds both single and true dual-port RAMs Synchronous write and read (different from distributed RAM) 7

/25/2 Spartan-3 Block RAM Amounts Block RAM can have various configurations 2 (port aspect ratios) 8k x 2 4k x 4 4 6k x 8,9 247 4,95 8+ 2k x (8+) 6,383 23 6+2 24 x (6+2) Block RAM Port Aspect Ratios Single-Port Block RAM ual-port Block RAM ual-port Bus Flexibility Port A In K-Bit epth WEA ENA RSTA CLKA ARA[9:] IA[7:] RAMB4_S6_S8 OA[7:] Port A Out 8-Bit Width WEB Port B In 2k-Bit epth ENB RSTB CLKB ARB[:] OB[8:] Port B Out 9-Bit Width IB[8:] Each port can be configured with a different data bus width Provides easy data width conversion without any additional logic 8

/25/2 Two Independent Single-Port RAMs Port A In 8K-Bit epth, AR[2:] Port B In 8K-Bit epth, AR[2:] WEA ENA RSTA CLKA ARA[2:] IA[] WEB ENB RSTB CLKB ARB[2:] IB[] Added advantage of True ual- Port No wasted RAM Bits Can split a ual-port 6K RAM into two Single-Port 8K RAM Simultaneous independent access to each RAM RAMB4_S_S OA[] OB[] Port A Out -Bit Width Port B Out -Bit Width To access the lower RAM Tie the MSB address bit to Logic Low To access the upper RAM Tie the MSB address bit to Logic High Embedded Multipliers Some functions, like multipliers are inherently slow if they are implemented by connecting a large number of programmable logic blocks together. Current FPGA incorporate special hard wired multiplier blocks which are typically located in close proximity to the embedded RAM blocks (Arithmetic Based Applications). 8 x 8 Embedded Multiplier Fast arithmetic functions Optimized to implement multiply / accumulate modules 8 x 8 Multiplier Embedded 8-bit x 8-bit multiplier 2 s complement signed operation Multipliers are organized in columns 8 x 8 signed multiplier Fully combinational Optional registers with CE & RST (pipeline) Independent from adjacent block RAM ata_a (8 bits) ata_b (8 bits) 8 x 8 Multiplier Output (36 bits) Positions of Multipliers Asynchronous 8-bit Multiplier 9

/25/2 8-bit Multiplier with Register A simple clock tree The esign Warrior s Guide to FPGAs evices, Tools, and Flows. ISBN 7567643 Copyright 24 Mentor Graphics Corp. (www.mentor.com) igital Clock Manager (CM) igital Clock Managers (CM) The clock pin is usually connected to special hard-wired function called a clock-manager that generates daughter clocks. The daughter clocks may be used to drive internal clock trees or external output pins that can be used to provide clocking services to other devices on the host circuit board. There might be multiple clock managers supporting only a subset of features (Jitter removal, Frequency Synthesis, ) The esign Warrior s Guide to FPGAs evices, Tools, and Flows. ISBN 7567643 Copyright 24 Mentor Graphics Corp. (www.mentor.com) CM: Jitter Removal In the real world clock edges may arrive a little early or a little late. A fuzzy clock would result (jitter) due to the delay encountered. The FPGA clock manager can be used to detect and correct for this jitter and provide a clean daughter clock signal for use inside the device. CM: Frequency Synthesis The frequency of the clock signal being presented to the FPGA from the outside world might not be exactly what the designer engineer wishes for. The clock manager can be used to generate daughter clocks with frequencies that are derived by multiplying or dividing the original signal.

/25/2 CM: Phase Shifting Basic I/O Block Structure Certain designs require the use of clocks that are phase shifted (delayed) with respect to each other. Some clock managers allow you to select from fixed phase shifts of common values such as 2 and 24 (for a three-phase clocking scheme) Three-State FF Enable Clock Set/Reset Output FF Enable Q EC SR Q EC SR Three-State Control Output Path irect Input FF Enable Registered Input Q EC SR Input Path IOB Functionality IOB provides interface between the package pins and CLBs Each IOB can work as uni- or bi-directional I/O Outputs can be forced into High Impedance Inputs and outputs can be registered advised for high-performance I/O Inputs can be delayed Configurable I/O Impedances The signals used to connect devices on today s circuit board often have fast edge rates. In order to prevent signals reflecting back it is necessary to apply appropriate terminating resistors to the FPGA input and output pins. In the past, resistors were applied as discrete components (outside the FPGA). Today's FPGAs allow the use of internal terminating resistors whose value can be configured by the user. FPGA Nomenclature Spartan 3 Family Attributes

/25/2 Spartan-3 FPGA Family Members 2 Virtex-II FPGA Family Virtex-II FPGA introduced followed by Virtex-II Pro in 23 444 8x8 Multipliers & 8kbit block RAMs introduced Gbit Serial I/O Communications & Power PC Processors Introduced Complex Floating Point Algorithm Implementation now possible Virtex-II / Pro 44, Logic Slices 444 8Kbits BRAMs 444 8x8 Multipliers 2 PowerPC Processors 2 Gbit I/O 64 Max User I/O Virtex II Pro Floorplan Virtex-II Pro (Selection) Up to 6 serial transceivers 622 Mbps to 3.25 Gbps to 4 PowerPCs 4 to 6 multi-gigabit transceivers 2 to 26 multipliers 3, to 5, logic cells 2k to 4M bits RAM 24 to 852 I/Os PowerPCs Logic cells Embedded Processor Cores (Hard and Soft) The majority of designs make use of microprocessors. These appeared as discrete devices on the circuit board. Lately, high-end FPGAs have become available that contain one or more embedded microprocessors (referred to as microprocessor cores). There are two types of cores: A hard microprocessor core is implemented as a dedicated predefined block (two approaches) A soft microprocessor core is implemented by configuring a group of programmable logic blocks to act as a microprocessor. Embedded Core (Inside) Xilinx and Altera tend to embed one or more microprocessor cores directly into the main FPGA fabric (PowerPC) In this case the design tools have to be able to take account of the presence of these blocks in the fabric (any memory used by the core is formed from the embedded RAM blocks). The main advantage of this scheme is the inherent speed advantages to be gained from having the processor core in intimate proximity to FPGA fabric. 2

/25/2 Soft Core As opposed to embedding a microprocessor physically into the fabric of the chip, it is possible to configure a group of programmable logic blocks to act as a microprocessor. Soft cores are simpler (more primitive) and slower than their hard-core counterparts. AVANTAGE?. The main advantage of this scheme is that the user need only implement a core if he/she needs it. 2. Also, the user can instantiate as many cores as they require until they run out of resources! Virtex Architectures Other Families include Virtex-II Pro Virtex-4 Virtex-5 Latest Family include Virtex-6 Built for high-performance applications Basic Architecture 74 Virtex-II Pro Architecture Contains embedded Processors and Multi-Gigabit Transceivers Virtex-4 Family Advanced Silicon Modular BLock (ASMBL) Architecture Optimized for logic, Embedded, and Signal Processing High performance True ual-port RAM - 8 Mb SelectIO - Ultra Technology - 64 I/O Advanced FPGA Logic 99k logic cells Resource LX FX SX XtremeSP Functionality - Embedded multipliers Logic 4K 2K LCs 2K 4K 4K LCs 23K 55K LCs Memory.9 6 Mb.6 Mb 2.3 5.7 Mb RocketIO and RocketIO X High-speed Serial Transceivers 622 Mbps to 3.25 Gbps PowerPC Processors 4+ MHz Clock Rate -2 CMs SP Slices SelectIO 4 2 32 96 24 96 4 2 32 92 24 896 4 8 28 52 32 64 XCITE igitally Controlled Impedance - Any I/O CM igital Clock Management - 2 3 nm, 9 layer copper in 3 mm wafer technology RocketIO PowerPC Ethernet MAC N/A N/A N/A 24 Channels or 2 Cores 2 or 4 Cores N/A N/A N/A Basic Architecture 75 Basic Architecture 76 Virtex-4 Architecture Virtex-5 Family Optimized for logic, Embedded, Signal Processing, and High-Speed Connectivity RocketIO Multi-Gigabit Transceivers 622 Mbps.3 Gbps Advanced CLBs 2K Logic Cells XtremeSP Technology Slices 256 8x8 GMACs PowerPC 45 with APU Interface 45 MHz, 68 MIPS Smart RAM New block RAM/FIFO Xesium Clocking Technology 5 MHz Tri-Mode Ethernet MAC // Mbps Gbps SelectIO ChipSync Source synch, XCITE Active Termination Virtex -5 Platforms Logic On-chip RAM SP Capabilities Parallel I/Os Serial I/Os PowerPC Processors LX LXT SXT FXT Logic Logic/Serial SP/Serial Emb./Serial The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. Basic Architecture 77 Basic Architecture 78 3

/25/2 Enhanced 36Kbit ual-port Block RAM / FIFO with Integrated ECC 55 MHz Clock Management Tile with CM and PLL SelectIO with ChipSync Technology and XCITE CI Advanced Configuration Options 25x8 SP Slice with Integrated ALU Tri-Mode // Mbps Ethernet MACs Virtex-5 Architecture New Most Advanced High- Performance Real 6 Logic Fabric PCI Express Endpoint Block System Monitor Function with Built-in in AC Next Generation PowerPC Embedded Processor RocketIO Transceiver Options Low-Power GTP: Up to 3.75 Gbps High-Performance GTX: Up to 6.5 Gbps 8x8 bit Embedded Pipelined Multipliers for efficient SP Up to eight on-chip igital Clock Managers to support multiple system clocks The Spartan-3 Family Built for high volume, low-cost applications Spartan-3 Configurable 8K Block RAMs + istributed RAM Bank 3 Bank Bank Bank 2 4 I/O Banks, Support for all I/O Standards including PCI, R333, RSS, mini-lvs Basic Architecture 79 Basic Architecture 8 Spartan-3 Family Based upon Virtex-II Architecture Optimized for Lower Cost Smaller process = lower core voltage.9 micron versus.5 micron Vccint =.2V versus.5v Logic resources Only one-half of the slices support RAM or SRL6s (SLICEM) Fewer block RAMs and multiplier blocks Clock Resources Fewer global clock multiplexers and CM blocks I/O Resources Fewer pins per package No internal 3-state buffers Support for different standards New standards:.2v LVCMOS,.8V HSTL, and SSTL efault is LVCMOS, versus LVTTL Basic Architecture 8 SLICEM and SLICEL Each Spartan -3 CLB contains four slices Similar to the Virtex -II Slices are grouped in pairs Left-hand SLICEM (Memory) s can be configured as memory or SRL6 Right-hand SLICEL (Logic) can be used as logic only Switch Matrix Basic Architecture 82 Left-Hand SLICEM SHIFTIN COUT Slice XY Slice XY SHIFTOUT CIN Right-Hand SLICEL COUT Slice XY Slice XY CIN Fast Connects Multiple omain-optimized Platforms Spartan-3E Features Basic Architecture 83 More gates per I/O than Spartan-3 Removed some I/O standards Higher-drive LVCMOS GTL, GTLP SSTL2_II HSTL_II_8, HSTL_I, HSTL_III LVS_EXT, ULVS R Cascade Internal data is presented on a single clock edge Basic Architecture 84 6 BUFGMUXes on left and right sides rive half the chip only In addition to eight global clocks Pipelined multipliers Additional configuration modes SPI, BPI Multi-Boot mode 4

/25/2 Spartan-3A SP Features Increased amount of block memory (BRAM) 52K of S3A8 vs 648 K of S3E6 More XtremeSP SP48A slices Replaces Embedded multiplier of Spartan-3E 34A 26 SP48As 8A 84 SP48As Basic Architecture 85 Spartan-3A SP Tuning SP Performance Integrated XtremeSP Slice Application optimized capacity Integrated pre-adder optimized for filters 25 MHz operation, standard speed grade Compatible with Virtex- SP Increased memory capacity and performance Also important for embedded processing, complex IP, etc Basic Architecture 86 XtremeSP SP48A Slice SP48 Comparison Function SP48 SP48E SP48A Benefit Multiplier 8 x 8 25 x 8 8 x 8 Reduces FPGA resource needs for SP algorithms. Pre-Adder No No Yes Reduces the critical path timing in FIR filter applications better performance. Important in FIR filter construction. Cascade Inputs One Two One Enables fast data path chaining of SP48 blocks for larger filters. Cascade Output Yes Yes Yes Enables fast data path chaining of SP48 blocks for larger filters. edicated C No Yes Yes The C input supports many 3-input mathematical functions, such as 3- input input addition and 2-input multiplication with a single addition and the very valuable rounding of multiplication away from zero. Adder 3 input 48 3 input 48 2 input 48 Supports simple add and accumulate functions. bit bit bit ynamic Yes Yes Yes Opmodes One SP48 can provide more than one function.. Multiply, Multiply-add, multiply-accumulate etc. ALU Logic No Yes No Similar to the ALU of a microprocessor. Enables the selection of ALU Functions function on a clock cycle basis Enables multiple functions to be selected. (Add, Subtract, or Compare) Pattern etect No Yes No This feature supports convergent rounding, underflow/overflow detection for saturation arithmetic, and auto-resetting counters/accumulators. SIM ALU No Yes No Support Enables parallel ALU operations on multiple data sets. Carry Signals Carry In Carry In & Carry In & Out Out Supports fast carry functions between SP blocks. Often a speed limiting path. Basic Architecture 87 Spartan-3A evice Table Spartan-3 Spartan-SP Spartan-3A Spartan-3A SP XC3S4A XC3S8A XC3S34A XtremeSP SP48A Slices - 84 26 edicated Multipliers 32 SP48As SP48As Block Ram Blocks 32 84 26 Block RAM (Kb) 576,52 2,268 istributed RAM (Kb) 76 26 373 FFs/s 22,528 33,28 47,744 Logic Cells 25,344 37,44 53, 72 CMs 8 8 8 Max iff I/O Pairs 227 227 23 CS484 9x9mm (.8mm pitch) - 39 39 *FG676 27x27mm (.mm pitch) 52 59 469 Basic Architecture 88 Latest Families Architecture Alignment Virtex-6 FPGAs Spartan-6 FPGAs 76K Logic Cell evice FIFO Logic Tri-mode EMAC System Monitor Common Resources -6 CLB BlockRAM SP Slices High-performance Clocking Parallel I/O HSS Transceivers* PCIe Interface 5K Logic Cell evice Hardened Memory Controllers 3.3 Volt compatible I/O Basic Architecture 89 *Optimized for target application in each family Enables IP Portability, Protects esign Investments Basic Architecture 9 5

/25/2 Market Size Addressing the Broad Range of Technical Requirements Spartan-6 LX Lowest cost logic + SP Basic Architecture 9 Spartan-6 LXT Lowest logic + high-speed serial Virtex-6 LXT The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appear High logic density + serial connectivity Application Market Segments Virtex-6 HXT The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. Ultra high-speed serial connectivity + logic Virtex-6 SXT SP + logic + serial connectivity + s More esigners Eccentrics Higher System Performance More design margin to simplify designs Higher integrated functionality Lower System Cost Reduce BOM Implement design in a smaller device & lower speedgrade Lower Power Help meet power budgets Eliminate heat sinks & fans Prevent thermal runaway Basic Architecture 92 Virtex-6 Family Virtex Product & Process Evolution Virtex-6 4-nm Virtex-5 65-nm Virtex-4 9-nm Virtex-II Pro 3-nm Virtex-II 5-nm Virtex-E 8-nm Virtex 22-nm st Generation 2nd Generation 3rd Generation 4th Generation 5th Generation 6th Generation Basic Architecture 93 elivering Balanced Performance, Power, and Cost Basic Architecture 94 Virtex-6 Base Platform 94 Strong Focus on Power Reduction Virtex-6 Logic Fabric Static Power Reduction Higher distribution of low leakage transistors ynamic Power Reduction Reduced capacitance through device shrink Reduced Core Voltage evices Lower Overall Power VCCINT =.9V option allows power / performance tradeoff I/O Power Improvements ynamic termination System Monitor Allows sophisticated monitoring of temperature and voltage Up to 5% Power Reduction vs. Previous Generation Basic Architecture 95 Virtex-6 Configurable Logic Block (CLB) Slice Each CLB contains two slices Each slice contains four 6-input Lookup Tables Slice (6) Slices implement logic functions (slice_l) Slices for memories and shift registers (slice_m) 6 implements All functions of up to 6 variables Two functions of up to 5 or less variables each Shift registers up to 32 stages long Power Memories Consumption of Benefits 64 bits Performance Benefits Cost Benefits Multiple configurations within a slice Shift register mode greatly reduces power consumption over FF implementation Basic Architecture 96 Increased ratio of slice_m memories available closer to the source or target logic Can pack logic and memory functions more efficiently CLB 6

/25/2 Higher SP Performance Most advanced SP architecture New optional pre-adder for symmetric filters 25x8 multiplier High resolution filters Efficient floating point support ALU-like second stage enables mapping of advanced operations Programmable op-code SIM support Addition / Subtraction / Logic functions Pattern detector Lowest power consumption Highest SP slice capacity Up to 2K SP Slices Virtex -6 LXT / SXT FPGAs Basic Architecture 97 Basic Architecture 98 Spartan-6 Family Spartan-6 Next Generation 45nm Spartan Family Increased performance & density Evolutionary feature enhancements ramatic cost & power reductions Two Silicon Platforms LX: Cost optimized Logic, Memory LXT: LX features plus High-Speed Serial Connectivity More unified & integrated with Virtex elivering the Optimal Balanced of Cost, Power & Performance Basic Architecture 99 Basic Architecture Spartan-6 Logic Evolution Higher Performance, Increased Utilization Modified Virtex 6-input 4 additional flip-flops per slice Higher utilization for register Spartan-3A Series & intensive designs Earlier / FF Pair Efficient & Capable Logic Arithmetic functions istributed RAM & shift registers Interconnect Up to 25% Higher Performance Basic Architecture 4 Great General-Purpose Logic NEW Efficient esign Spartan-6 / ual FF Pair 6 6-input & 2nd Flip- flop for Higher Utilization Spartan-6 CLB Logic Slices SliceM (25%) SliceL (25%) SliceX (5%) 6 8 Registers Carry Logic Wide Function Muxes istributed RAM / SRL logic 6 8 Registers Carry Logic Wide Function Muxes Slice mix chosen for the optimal balance of Cost, Power & Performance Basic Architecture 2 6 Optimized for Logic 8 Registers 7

/25/2 Spartan-6 Lowest Total Power Static power reductions Process & architectural innovations ynamic power reduction Lower node capacitance & architectural innovations More hard IP functionality Integrated transceivers & other logic reduces power Hard IP uses less current & power than soft IP Lower IO power Low power option -L reduces power even further Fewer supply rails reduces power Basic Architecture 3 Spartan-6 Hard Memory Controller New Hard Block Memory Controller Up to 4 controllers per device Why a Hard Memory Block? Very common design component Multiple customer benefits Customer Requests Higher performance Lower cost Lower power Easier designs Basic Architecture 4 Spartan-6 Hard Block Memory Controller Benefits Up to 8 Mbps Saves soft logic, smaller die edicated logic Timing closure no longer an issue Configurable MultiPort user interface CoreGen/MIG wizard & EK support Memory Controller Integrated SP Slice Only low cost FPGA with a hard memory controller Guaranteed memory interface performance providing Reduced engineering & board design time R, R2, R3 & LP R support Up to 2.8Mbps bandwidth for each memory controller Automatic calibration features Multiport structure for user interface Six 32-bit programmable ports from fabric Spartan-6 Controller interface to 4, 8 or 6 bit memories devices RAM SRAM FLASH EEPROM RAM R R2 R3 LP R 25 MHz implementation Fast multiplier & 48 bit adder ASIC-like performance Input and output registers for higher speed Optimizes FIR filter applications XtremeSP SP48A Slice Basic Architecture 5 Super Regional Training 6 Better, More BRAM Compare to Spartan-3A Twice the Capabilities, Half the Power, Hard Blocks! More Block RAMs 2x higher BRAM to Logic Cell ratio than Spartan-3A platform More port flexibility 8K can be split into two 9K BRAM blocks and can be independently addressed Improves buffering, caching & data storage Excellent for embedded processing, communication protocols Enables SP blocks to provide more efficient video and surveillance algorithms Basic Architecture 7 Lower Static Power 8K BRAM OR 9K BRAM 9K BRAM Feature Extended Spartan-3A (9nm) Spartan-6 (45nm) Logic Cells (Kbit) Up to 55K Up to 5K esign 4-input + FF 6-input + 2FF Block RAM (Mbit) Up to 2 Mbit Up to 5 Mbit Transceiver Count / Speed no Up to 8 / Up to 3.25 Gbps Voltage Scaling No (.2V only) Yes (.2V,.V) Static Power (typ mw) mw (smallest density) Up to 6% less! Memory Interface 4 Mbps R3 8 Mbps Max ifferential IO 64 Mbps 5 Mbps Multipliers/SP Up to 26 Multipliers / SP Up to 84 SP48 Blocks Memory Controllers no Up to 4 Hard Blocks Clock Management CM Only CM & PLL PCI Express Endpoint no Yes, Gen Security evice NA Only evice NA & AES Basic Architecture 8 8

Library IEEE; use ieee.std_logic_64.all; use ieee.std_logic_unsigned.all; entity RC5_core is port( clock, reset, encr_decr: in std_logic; data_input: in std_logic_vector(3 downto ); data_output: out std_logic_vector(3 downto ); out_full: in std_logic; key_input: in std_logic_vector(3 downto ); key_read: out std_logic; ); end AES_core; /25/2 Spartan-6 LX / LXT FPGAs FPGA esign Flow ** All memory controller support x6 interface, except in CS225 package where x8 only is supported Basic Architecture 9 esign and implement a simple unit permitting to speed up encryption with RC5-similar cipher with fixed key set on 83 microcontroller. Unlike in the experiment 5, this time your unit has to be able to perform an encryption algorithm by itself, executing 32 rounds.. esign process () Specification Verilog description (Your Verilog Source Files) esign process (2) Implementation (Mapping, Placing & Routing) Timing simulation Synthesis Functional simulation Post-synthesis simulation Configuration On chip testing esign Process control from Active-HL Logic Synthesis VHL description Circuit netlist architecture MLU_ATAFLOW of MLU is signal A:ST_LOGIC; signal B:ST_LOGIC; signal Y:ST_LOGIC; signal MUX_, MUX_, MUX_2, MUX_3: ST_LOGIC; begin A<=A when (NEG_A='') else not A; B<=B when (NEG_B='') else not B; Y<=Y when (NEG_Y='') else not Y; MUX_<=A and B; MUX_<=A or B; MUX_2<=A xor B; MUX_3<=A xnor B; with (L & L) select Y<=MUX_ when "", MUX_ when "", MUX_2 when "", MUX_3 when others; end MLU_ATAFLOW; 9

/25/2 Synthesis Tools Features of synthesis tools XST and others Interpret RTL code Synplify Pro: Produces synthesized circuit netlist in a standard EIF (.edf) format Can optionally produce.vhm (VHL code merged into one) file for post-synthesis simulation XST: Produces synthesized circuit netlist in NGC format Netlist is composed of gates in the particular Xilinx implementation library http://toolbox.xilinx.com/docsan/xilinx9/books/manuals.pdf has information on libraries Give preliminary performance estimates Some can display circuit schematics corresponding to EIF netlist Timing report after synthesis Performance Summary ******************* Worst slack in design: -.924 Requested Estimated Requested Estimated Clock Clock Starting Clock Frequency Frequency Period Period Slack Type Group ------------------------------------------------------------------------------------------------------- exam clk 85. MHz 78.8 MHz.765 2.688 -.924 inferred Inferred_clkgroup_ System 85. MHz 86.4 MHz.765.572.93 system default_clkgroup =========================================================== Implementation After synthesis the entire implementation process is performed by FPGA vendor tools Mapping 4 2 5 FF 3 FF2 2

/25/2 Placing FPGA Routing FPGA CLB SLICES Programmable Connections Map report header Release 7..3i Map H.4 Xilinx Mapping Report File for esign 'exam' esign Information ------------------ Command Line : c:\xilinx\bin\nt\map.exe -p 2S2FG256-6 -o map.ncd -pr b -k 4 -cm area -c -tx off exam.ngd exam.pcf Target evice : xc2s2 Target Package : fg256 Target Speed : -6 Mapper Version : spartan2 -- $Revision:.26.6.4 $ Mapped ate : Wed Nov 2 :5:5 25 Map report esign Summary -------------- Number of errors: Number of warnings: Logic Utilization: Number of Slice Flip Flops: 44 out of 4,74 3% Number of 4 input s: 73 out of 4,74 3% Logic istribution: Number of occupied Slices: 45 out of 2,352 6% Number of Slices containing only related logic: 45 out of 45 % Number of Slices containing unrelated logic: out of 45 % *See NOTES below for an explanation of the effects of unrelated logic Total Number 4 input s: 2 out of 4,74 4% Number used as logic: 73 Number used as a route-thru: 5 Number used as 6x RAMs: 32 Number of bonded IOBs: 74 out of 76 42% Number of GCLKs: out of 4 25% Number of GCLKIOBs: out of 4 25 Timing Score: Place & route report Asterisk (*) preceding a constraint indicates it was not met. This may be due to a setup or hold violation. -------------------------------------------------------------------------------- Constraint Requested Actual Logic Levels -------------------------------------------------------------------------------- TS_clk = PERIO TIMEGRP "clk".765 ns.765ns.622ns 3 HIGH 5% -------------------------------------------------------------------------------- OFFSET = OUT.765 ns AFTER COMP "clk".765ns.49ns -------------------------------------------------------------------------------- OFFSET = IN.765 ns BEFORE COMP "clk".765ns.442ns 2 -------------------------------------------------------------------------------- Post layout timing report Timing summary: --------------- Timing errors: Score: Constraints cover 4292 paths, nets, and 38 connections esign statistics: Minimum period:.622ns (Maximum frequency: 86.44MHz) Minimum input required time before clock:.442ns Minimum output required time after clock:.49ns 2

/25/2 Post-place-and-route simulation After place-and-route performed, can do post-place-and-route simulation Now have real timing information! Also can do static timing analysis: shows the worst case critical path in circuit Configuration Once a design is implemented, you must create a file that the FPGA can understand This file is called a bit stream: a BIT file (.bit extension) The BIT file can be downloaded directly to the FPGA, or can be converted into a PROM file which stores the programming information Configuration of SRAM based FPGAs System Gates vs. Real Gates One common metric used to measure the size of a device in the ASIC world is that of equivalent gates (e-gate) Convention used: A 2-input NAN function to represent one equivalent gate. An equivalent gate consists of an arbitrary number of transistors. ifferent vendors provide different functions in their cell libraries, where each implementation of each function requires a different number of transistors (difficult to compare capacity/complexity) Solution: Assign each function an equivalent gate value and sum all these values. The esign Warrior s Guide to FPGAs evices, Tools, and Flows. ISBN 7567643 Copyright 24 Mentor Graphics Corp. (www.mentor.com) How can we establish a basis for comparison between FPGAs and ASICs? Can an ASIC of 5, equivalent gates that needs to be migrated into an FPGA fit into a particular FPGA? FPGAs: System Gates System Gates: A 4-input can be used to represent anywhere between one and more than twenty 2-input primitive logic gates. Rule of thumb? ivide the system gates value by three, so a three million FPGA system gates would equate to one million ASIC equivalent gates!! However, to make comparisons between two different implementations on an FPGA (i.e. Floating point adder vs. Fixed point adder) designers should use the resources available in an FPGA: Number of 4-input s used Number of embedded multipliers Number of embedded RAM blocks State-of-the-Art FPGAs 65-9 nm process on 3 mm wafers Lower cost per function ( + register) Smaller and faster transistors: Higher speed System speed up to 5 MHz Mainly through smart interconnects, clock management, dedicated circuits, flexible I/O. Integrated transceivers running at Gigabits/sec More Logic and Better Features: >, s & flip-flops >2 embedded RAMs, and same number 8 x 8 multipliers 56 pins (balls) with >8 GP I/O 5 I/O standards, incl. LVS with internal termination 6 low-skew global clock lines Multiple clock management circuits On-chip microprocessor(s) and multi-gbps transceivers 22

/25/2 Latest evices: Capacity & Features Xilinx Virtex-5 65nm process Up to 96 I/Os >2 logic cells Up to 552 8kb block RAMs (~Mb RAM) 45 SP slices (8x8 multiplier-accumulator) 2 digital clock managers (CM) Altera Stratix-II 9nm process Up to 7 I/Os 79 logic elements 9.6Mb embedded RAM 96 SP blocks: 38 8x8 multipliers 2 PLLs FPGAs Becoming More Attractive 2 X Bigger Capacity Speed Price 5.5 X Faster 24 high-speed serial transceivers (622Mb/s to.gb/s) Up to four PowerPC 45 cores Serial I/O up to Gb/s No hard processor cores /9 /92 /93 /94 /95 /96 /97 /98 /99 Year 5 X Less Expensive Source: Xilinx FPGA Shortcomings Circuit elay elay increases due to programmable switches in the FPGA routing architecture Area Configuration cells and programmable resources incur substantial area penalty Power Typically not suited for low power applications Need to improve Performance ASIC FPGA Cost ASIC FPGA Time to market ASIC FPGA Conclusion FPGAs are the main enabler of Reconfigurable Computing Systems FPGAs fill the gap between Instruction Set Processors (GPs) and ASICS. Advantages: Flexible, programmable, isadvantages: Power dissipation, performance w.r.t. ASIC Applicability of FPGAs relies on CA tools provided by different vendors such as Xilinx and Altera RCS can be realized with several technologies: FPGAs: Fine/Medium Grain Coarse Grain Reconfigurable Architectures: CGRAs 23