The Virtex FPGA and Introduction to design techniques

Similar documents
Programmable Logic. Simple Programmable Logic Devices

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses

Virtex-II Architecture

! Program logic functions, interconnect using SRAM. ! Advantages: ! Re-programmable; ! dynamically reconfigurable; ! uses standard processes.

Chapter 8 FPGA Basics

CPE/EE 422/522. Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices. Dr. Rhonda Kay Gaede UAH. Outline

ECE 545 Lecture 12. FPGA Resources. George Mason University

Topics. Midterm Finish Chapter 7

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Field Programmable Gate Array (FPGA)

HDL Coding Style Xilinx, Inc. All Rights Reserved

Hardware Design with VHDL Design Example: BRAM ECE 443

Topics. Midterm Finish Chapter 7

Lecture 7. Standard ICs FPGA (Field Programmable Gate Array) VHDL (Very-high-speed integrated circuits. Hardware Description Language)

Field Programmable Gate Array

INTRODUCTION TO FPGA ARCHITECTURE

In our case Dr. Johnson is setting the best practices

ECEU530. Project Presentations. ECE U530 Digital Hardware Synthesis. Rest of Semester. Memory Structures

ECE 645: Lecture 1. Basic Adders and Counters. Implementation of Adders in FPGAs

ECE 448 Lecture 5. FPGA Devices

PROGRAMMABLE MODULES SPECIFICATION OF PROGRAMMABLE COMBINATIONAL AND SEQUENTIAL MODULES

Lecture 12 VHDL Synthesis

EE178 Lecture Module 2. Eric Crabill SJSU / Xilinx Fall 2007

Basic FPGA Architecture Xilinx, Inc. All Rights Reserved

FPGAs in a Nutshell - Introduction to Embedded Systems-

CDA 4253 FPGA System Design Op7miza7on Techniques. Hao Zheng Comp S ci & Eng Univ of South Florida

Luleå University of Technology Kurskod SMD098 Datum Skrivtid

ECE 545 Lecture 17 RAM. George Mason University

Evolution of Implementation Technologies. ECE 4211/5211 Rapid Prototyping with FPGAs. Gate Array Technology (IBM s) Programmable Logic

ECE 699: Lecture 9. Programmable Logic Memories

VHDL in 1h. Martin Schöberl

Luleå University of Technology Kurskod SMD152 Datum Skrivtid

Lecture 3: Modeling in VHDL. EE 3610 Digital Systems

Lecture 11 Memories in Xilinx FPGAs

FPGA architecture and design technology

Altera FLEX 8000 Block Diagram

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

DIGITAL LOGIC WITH VHDL (Fall 2013) Unit 1

Virtex-II Architecture. Virtex II technical, Design Solutions. Active Interconnect Technology (continued)

CS211 Digital Systems/Lab. Introduction to VHDL. Hyotaek Shim, Computer Architecture Laboratory

ECE 545: Lecture 11. Programmable Logic Memories

Timing in synchronous systems

ECE 545: Lecture 11. Programmable Logic Memories. Recommended reading. Memory Types. Memory Types. Memory Types specific to Xilinx FPGAs

EECS150 - Digital Design Lecture 16 Memory 1

Digital System Construction

VHDL simulation and synthesis

ECE 448 Lecture 5. FPGA Devices

CprE 583 Reconfigurable Computing

Chapter 2. Cyclone II Architecture

ECE 545 Lecture 12. FPGA Embedded Resources 12/8/11. Resources. Recommended reading. Use of Embedded FPGA Resources in SHA-3 Candidates

Pipelining & Verilog. Sequential Divider. Verilog divider.v. Math Functions in Coregen. Lab #3 due tonight, LPSet 8 Thurs 10/11

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function.

EECS150 - Digital Design Lecture 16 - Memory

Abi Farsoni, Department of Nuclear Engineering and Radiation Health Physics, Oregon State University

The Xilinx XC6200 chip, the software tools and the board development tools

Xilinx ASMBL Architecture

CCE 3202 Advanced Digital System Design

Hardware Description Language VHDL (1) Introduction

Schedule. ECE U530 Digital Hardware Synthesis. Rest of Semester. Midterm Question 1a

Review: Timing. EECS Components and Design Techniques for Digital Systems. Lec 13 Storage: Regs, SRAM, ROM. Outline.

TSEA44 - Design for FPGAs

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 15 Memories

ELCT 501: Digital System Design

Assignment. Last time. Last time. ECE 4514 Digital Design II. Back to the big picture. Back to the big picture

ECE 545 Lecture 8. Data Flow Description of Combinational-Circuit Building Blocks. George Mason University

CMPT 250: Computer Architecture. Using LogicWorks 5. Tutorial Part 1. Somsubhra Sharangi

C-Based Hardware Design

Digital Design Laboratory Lecture 2

1 ST SUMMER SCHOOL: VHDL BOOTCAMP PISA, JULY 2013

FPGA Lecture for LUPO and GTO Vol , 31 August (revised 2013, 19 November) H. Baba

Design Problem 3 Solutions

CSE 260 Introduction to Digital Logic and Computer Design. Exam 1 Solutions

ECE 448 Lecture 4. Sequential-Circuit Building Blocks. Mixing Description Styles

What is Xilinx Design Language?

EENG 2910 Project III: Digital System Design. Due: 04/30/2014. Team Members: University of North Texas Department of Electrical Engineering

Tutorial 4 HDL. Outline VHDL PROCESS. Modeling Combinational Logic. Structural Description Instantiation and Interconnection Hierarchy

CCE 3202 Advanced Digital System Design

COE 405, Term 062. Design & Modeling of Digital Systems. HW# 1 Solution. Due date: Wednesday, March. 14

Verilog for High Performance

ENGG3380: Computer Organization and Design Lab4: Buses and Peripheral Devices

A Dynamically Reconfigurable FPGA-based Content Addressable Memory for IP Characterization

Sequential Statement

Asynchronous FIFO Design

Summary of FPGA & VHDL

Sequential Logic - Module 5

CDA 4253 FGPA System Design Xilinx FPGA Memories. Hao Zheng Comp Sci & Eng USF

VHDL And Synthesis Review

Digital Systems Design

PINE TRAINING ACADEMY

Outline of Presentation

Virtex -E 1.8 V Field Programmable Gate Arrays

The process. Sensitivity lists

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

Virtex -E 1.8 V Extended Memory Field Programmable Gate Arrays

Programmable Logic Devices FPGA Architectures II CMPE 415. Overview This set of notes introduces many of the features available in the FPGAs of today.

Architecture by Xilinx, Inc. All rights reserved.

Two HDLs used today VHDL. Why VHDL? Introduction to Structured VLSI Design

Introduction to VHDL #3

Transcription:

The Virtex FPGA and Introduction to design techniques SM098 Computation Structures Lecture 6 Simple Programmable Logic evices Programmable Array Logic (PAL) AN-OR arrays are common blocks in SPL and CPL architectures Implements two level logic functions like: F = ABC + B + C SM098 Computation Structures Lecture 6 2

36 Product Term Allocator Additional Product Terms (from other macrocells) Product Term Set 0 Product Term Clock Product Term Reset Product Term OE Additional Product Terms (from other macrocells) Global Set/Reset Global Clocks 3 S /T R OUT PTOE X5878 To FastCONNECT Switch Matrix To Blocks X5879 Simple Programmable Logic evices I - I 8 CLK/ I0 8 Programmable AN Array 32 x 64 Vantis PALV6V8 MC 0 MC MC 2 MC 3 MC 4 MC 5 MC 6 MC 7 OE/I 9 0 2 3 4 5 6 7 773-0 X 0 OE V CC 0 0 0 0 To Adjacent Macrocell Macrocell SG SL0 X 0 X 0 X SL X CLK 0 0 X *SG SL0 X From Adjacent Pin SM098 Computation Structures Lecture 6 3 Complex Programmable Logic evices CPLs have much higher capacity than SPLs, but the architecture is similar. Function block Xilinx XC9500 architecture Macrocell JTAG Port 3 JTAG Controller In-System Programming Controller From 36 FastCONNECT Switch Matrix Programmable AN-Array Product Term Allocators 8 8 8 OUT PTOE To FastCONNECT Switch Matrix To Blocks 36 8 Function Block Macrocells to 8 Macrocell 8 /GCK /GSR /GTS 3 2 or 4 Blocks FastCONNECT Switch Matrix 36 8 36 8 36 8 Function Block 2 Macrocells to 8 Function Block 3 Macrocells to 8 Function Block N Macrocells to 8 3 Global Global Set/Reset Clocks Macro cell SM098 Computation Structures Lecture 6 4

Field Programmable Gate Arrays - Xilinx XC4000 SM098 Computation Structures Lecture 6 5 Virtex Architecture SRAM based, needs external configuration memory Two main configurable elements: configurable logic blocks (CLBs) and input/output blocks (IOBs) CLBs interconnect through a general routing matrix (GRM). The VersaRing interface provides additional routing resources around the periphery of the device. The Virtex architecture also includes the following circuits that connect to the GRM. edicated block memories of 4096 bits each Clock LLs for clock-distribution delay compensation and clock domain control 3-State buffers (BUFTs) associated with each CLB that rive dedicated segmentable horizontal routing resources LL IOBs VersaRing BRAMs LL IOBs VersaRing CLBs VersaRing IOBs LL IOBs VersaRing BRAMs LL SM098 Computation Structures Lecture 6 6

Virtex routing resources A view from FPGA editor. Blue boxes are slices (2 slices = CLB). Grey lines are local interconnect. Red lines are long lines. Green lines are pin wires. Three switch boxes per CLB. SM098 Computation Structures Lecture 6 7 Virtex clock distribution There are four primary global clock nets that are driven by four global buffers. If these clock nets are used clock skew will not be a problem. GCLKPA3 GCLKPA2 Global Clock Rows GCLKBUF3 GCLKBUF2 Global Clock Column Global Clock Spine GCLKBUF GCLKBUF0 GCLKPA GCLKPA0 gclkbu_2.eps SM098 Computation Structures Lecture 6 8

Virtex IOB The Virtex IOBs are configurable to support several different high speed standards CE CE Weak Keeper SR CE CE OBUFT PA SR I CE Programmable elay IBUF SR Vref R LK CE ds022_02_09300 SM098 Computation Structures Lecture 6 9 Virtex CLB Xilinx definitions: Logic cell (LC) - 4 input LUT, carry logic and a storage element A slice consist of two LCs A CLB consists of 4.5 CLBs. The /2 LC comes from the fact that some additional logic is available for implementing functions with more than 4 inputs COUT COUT G4 G3 G2 G LUT Carry & Control SP EC YB Y Y G4 G3 G2 G LUT Carry & Control SP EC YB Y Y BY RC XB BY RC XB F4 F3 F2 F LUT Carry & Control SP EC X X F4 F3 F2 F LUT Carry & Control SP EC X X BX RC RC BX Slice Slice 0 CIN CIN slice_b.eps SM098 Computation Structures Lecture 6 0

Virtex slice - detailed view The additional logic are the F5 and F6 multiplexers. COUT CY YB G4 G3 G2 G I3 I2 I I0 WE LUT I O 0 INIT EC Y Y BY REV F5IN F6 XB CY F5 F5 CK WE WSO BY G X BX A4 WSH BX I INIT EC X F4 F3 F2 F I3 I2 I I0 WE LUT I O REV 0 SR CLK CE CIN SM098 Computation Structures Lecture 6 Virtex - look-up tables The Virtex LUTs can be configure to implement: 4-input LUTs 6x-bit synchronous RAM Two LUTs in one slice can be combined to implement 6x2-bit or 32x-bit synchronous RAM 6x-bit dual-port synchronous RAM 6-bit shift register SM098 Computation Structures Lecture 6 2

Virtex slice - FPGA Editor view SM098 Computation Structures Lecture 6 3 library ieee; use ieee.std_logic_64.all; Example entity Example is port ( A, B, C, : in std_logic; -- Inputs Reset, Clk, En : in std_logic; -- Reset, Clock, Clock enable Y : out std_logic); -- Output end Example; architecture RTL of Example is begin -- RTL process(clk) begin if rising_edge(clk) then if Reset = then Y <= 0 ; elsif En = then Y <= A xor B xor C xor ; end if; end if; end process; end RTL; How will this be implemented? How many slices? SM098 Computation Structures Lecture 6 4

Example SM098 Computation Structures Lecture 6 5 Example 2 8-bit adder with carry input and output How can this be implemented in a Virtex? How many slices? library ieee; use ieee.std_logic_64.all; use ieee.numeric_std.all; entity Example2 is port ( A, B : in unsigned(7 downto 0); Cin : in std_logic; R : out unsigned(7 downto 0); Cout : out std_logic); end Example2; architecture RTL of Example2 is begin -- RTL process(a, B, Cin) variable r_tmp : unsigned(8 downto 0); variable cin_tmp : integer range 0 to ; begin if Cin = 0 then cin_tmp := 0; else cin_tmp := ; end if; r_tmp := ( 0 & A) + B + cin_tmp; R <= r_tmp(7 downto 0); Cout <= r_tmp(8); end process; end RTL; SM098 Computation Structures Lecture 6 6

Example 2 Four slices - the carry chain is the high lighted (red) net Next slide shows this slice SM098 Computation Structures Lecture 6 7 Example 2 One full adder per slice SM098 Computation Structures Lecture 6 8

A Clk Reset FC [0] s[0] FC [0] [] s[] FC [] [2] s[2] FC [2] [3] s[3] FC [3] [4] s[4] FC [4] [5] s[5] FC [5] [6] s[6] FC [6] [7] s[7] FC [7] [8] s[8] FC [8] [9] s[9] FC [9] FC [0] [0] FC [] [] FC [2] [2] FC [3] [3] s[0] FC [4] [4] s[] s[2] s[3] s[4] s[5] Y library ieee; use ieee.std_logic_64.all; entity Example3 is port ( A : in std_logic; Clk, Reset : in std_logic; Y, Y2 : out std_logic); end Example3; Example 3 - shift register architecture RTL of Example3 is signal S, S2 : std_logic_vector(5 downto 0); begin -- RTL 6 FFs 8 slices Shift : process(clk, Reset) begin if Reset = then S <= (others => 0 ); elsif rising_edge(clk) then S <= S(4 downto 0) & A; end if; end process; Shift2 : process(clk) begin if rising_edge(clk) then S2 <= S2(4 downto 0) & A; end if; end process; Y <= S(5); Y2 <= S2(5); end RTL A Clk 0 SRL6 A0 A A2 A3 CLK un2.i_ /2 slice F un2.out[0] Y2 SM098 Computation Structures Lecture 6 9 Virtex Block RAM Each Block RAM is a synchronous dual-ported 4096-bit RAM with independent control signals for each port ata widths may be configured independently WEA ENA RSTA CLKA ARA[#:0] IA[#:0] WEB ENB RSTB CLKB ARB[#:0] IB[#:0] RAMB4_S#_S# OA[#:0] OB[#:0] You have actually already used the block RAM in one lab. Virtex evice # of Blocks Total Block SelectRAM Bits XCV50 8 32,768 XCV00 0 40,960 XCV50 2 49,52 XCV200 4 57,344 XCV300 6 65,536 XCV400 20 8,920 XCV600 24 98,304 XCV800 28 4,688 XCV000 32 3,072 SM098 Computation Structures Lecture 6 20

Virtex LLs A elayed Locked Loop (LL) can align internal and external clocks. Effectively eliminates onchip clock distribution delay. This maximizes the achievable speed. Chip Chip 2 LL LL Clock Clock ata Comparator Error elay Clock distribution Virtex have four LLs. The LLs can also be used to divide or double the incoming clock frequency internally. The output of the LL can drive the global clock routing recourses and clock skew can be eliminated. SM098 Computation Structures Lecture 6 2 Virtex compared to Virtex-E Virtex Maximum Block RAM Maximum evice System Gates CLB Array Logic Cells Available Bits SelectRAM+ Bits XCV50 57,906 6x24,728 80 32,768 24,576 XCV00 08,904 20x30 2,700 80 40,960 38,400 XCV50 64,674 24x36 3,888 260 49,52 55,296 XCV200 236,666 28x42 5,292 284 57,344 75,264 XCV300 322,970 32x48 6,92 36 65,536 98,304 XCV400 468,252 40x60 0,800 404 8,920 53,600 XCV600 66, 48x72 5,552 52 98,304 22,84 XCV800 888,439 56x84 2,68 52 4,688 30,056 XCV000,24,022 64x96 27,648 52 3,072 393,26 Virtex-E evice System Gates Logic Gates CLB Array Logic Cells ifferential Pairs User BlockRAM Bits istributed RAM Bits XCV50E 7,693 20,736 6 x 24,728 83 76 65,536 24,576 XCV00E 28,236 32,400 20 x 30 2,700 83 96 8,920 38,400 XCV200E 306,393 63,504 28 x 42 5,292 9 284 4,688 75,264 XCV300E 4,955 82,944 32 x 48 6,92 37 36 3,072 98,304 XCV400E 569,952 29,600 40 x 60 0,800 83 404 63,840 53,600 XCV600E 985,882 86,624 48 x 72 5,552 247 52 294,92 22,84 XCV000E,569,78 33,776 64 x 96 27,648 28 660 393,26 393,26 XCV600E 2,88,742 49,904 72 x 08 34,992 344 724 589,824 497,664 XCV2000E 2,54,952 58,400 80 x 20 43,200 344 804 655,360 64,400 XCV2600E 3,263,755 685,584 92 x 38 57,32 344 804 753,664 82,544 XCV3200E 4,074,387 876,096 04 x 56 73,008 344 804 85,968,038,336 SM098 Computation Structures Lecture 6 22

How to find the best implementation? You have to know the target architecture in order to make efficient design implementations Synthesis tools will not always provide the optimal solution. Structural coding can aid the synthesis tool - provided that the designer knows a better solution Use vendor specific module generations tools, such as Xilinx CoreGenerator. CoreGenerator can generate optimized cores such as arithmetic functions, FFTs, FIR filters etc SM098 Computation Structures Lecture 6 23 CoreGenerator flow CORE Generator VHO VEO HL Editor Behavioral Simulation Models VHL Verilog HL Test Bench EN Verilog & VHL Instantiation Symbol VHL Verilog Synthesizer EIF Xilinx CoreLib CORE Generator or IP Install SF HL Editor <Vendor> CoreLib Timing Simulation Flow Schematic Editor Schematic Simulation Tools simprim Unified EIF Functional Simulation Flow EIF VHL Verilog Unisim VITAL & Verilog simprim Implementation Tools VITAL, Verilog, Gate-level EIF VHL Verilog SF X8974 SM098 Computation Structures Lecture 6 24

What is best - what are the requirements? Some requirements can be: Short time to market Low resource usage - area High operating frequency Low power consumption (Mikael will talk about this next lecture) epending on what requirement is most important, different design solutions will be oprimal for the particular requirements SM098 Computation Structures Lecture 6 25 Time to market If time to market is the most important requirement your boss will not be satisfied if you try to optimize other requirements that are already met. Your will not get a raise if you manage to save 5 CLBs because you spent two days optimizing a counter. This probably how most of you work in the lab. You try to meet the lab requirements before the deadline but don t care much if your solution is the most efficient in terms of speed or area. Am I right? SM098 Computation Structures Lecture 6 26

Resource usage If you are optimizing for area you should consider Sequential execution instead of parallel execution Bit serial implementation of data paths Scheduling of data paths, interleaving of resources in time Choosing the algorithm that minimizes area... SM098 Computation Structures Lecture 6 27 Speed If you are optimizing for speed you should consider Parallel execution Pipelining Choosing the fastest algorithm... Next and last lecture I will give you a practical example on how one algorithm, a FIR filtering, can be implemented in hardware. We will optimize it for area and for speed and we will come up with two separate implementations SM098 Computation Structures Lecture 6 28

Final question Which of these two implementations are optimal? Max x T Max x T A 2T A T B C 3A 2T T A F B C A T 2T 3A F 3A A T T S ecoder S ecoder A A Critical path = 3T Area = 8A Critical path = 4T Area = 6A SM098 Computation Structures Lecture 6 29