The Virtex FPGA and Introduction to design techniques

The Virtex FPGA and Introduction to design techniques SM098 Computation Structures Lecture 6 Simple Programmable Logic evices Programmable Array Logic (PAL) AN-OR arrays are common blocks in SPL and CPL architectures Implements two level logic functions like: F = ABC + B + C SM098 Computation Structures Lecture 6 2

36 Product Term Allocator Additional Product Terms (from other macrocells) Product Term Set 0 Product Term Clock Product Term Reset Product Term OE Additional Product Terms (from other macrocells) Global Set/Reset Global Clocks 3 S /T R OUT PTOE X5878 To FastCONNECT Switch Matrix To Blocks X5879 Simple Programmable Logic evices I - I 8 CLK/ I0 8 Programmable AN Array 32 x 64 Vantis PALV6V8 MC 0 MC MC 2 MC 3 MC 4 MC 5 MC 6 MC 7 OE/I 9 0 2 3 4 5 6 7 773-0 X 0 OE V CC 0 0 0 0 To Adjacent Macrocell Macrocell SG SL0 X 0 X 0 X SL X CLK 0 0 X *SG SL0 X From Adjacent Pin SM098 Computation Structures Lecture 6 3 Complex Programmable Logic evices CPLs have much higher capacity than SPLs, but the architecture is similar. Function block Xilinx XC9500 architecture Macrocell JTAG Port 3 JTAG Controller In-System Programming Controller From 36 FastCONNECT Switch Matrix Programmable AN-Array Product Term Allocators 8 8 8 OUT PTOE To FastCONNECT Switch Matrix To Blocks 36 8 Function Block Macrocells to 8 Macrocell 8 /GCK /GSR /GTS 3 2 or 4 Blocks FastCONNECT Switch Matrix 36 8 36 8 36 8 Function Block 2 Macrocells to 8 Function Block 3 Macrocells to 8 Function Block N Macrocells to 8 3 Global Global Set/Reset Clocks Macro cell SM098 Computation Structures Lecture 6 4

Field Programmable Gate Arrays - Xilinx XC4000 SM098 Computation Structures Lecture 6 5 Virtex Architecture SRAM based, needs external configuration memory Two main configurable elements: configurable logic blocks (CLBs) and input/output blocks (IOBs) CLBs interconnect through a general routing matrix (GRM). The VersaRing interface provides additional routing resources around the periphery of the device. The Virtex architecture also includes the following circuits that connect to the GRM. edicated block memories of 4096 bits each Clock LLs for clock-distribution delay compensation and clock domain control 3-State buffers (BUFTs) associated with each CLB that rive dedicated segmentable horizontal routing resources LL IOBs VersaRing BRAMs LL IOBs VersaRing CLBs VersaRing IOBs LL IOBs VersaRing BRAMs LL SM098 Computation Structures Lecture 6 6

Virtex routing resources A view from FPGA editor. Blue boxes are slices (2 slices = CLB). Grey lines are local interconnect. Red lines are long lines. Green lines are pin wires. Three switch boxes per CLB. SM098 Computation Structures Lecture 6 7 Virtex clock distribution There are four primary global clock nets that are driven by four global buffers. If these clock nets are used clock skew will not be a problem. GCLKPA3 GCLKPA2 Global Clock Rows GCLKBUF3 GCLKBUF2 Global Clock Column Global Clock Spine GCLKBUF GCLKBUF0 GCLKPA GCLKPA0 gclkbu_2.eps SM098 Computation Structures Lecture 6 8

Virtex IOB The Virtex IOBs are configurable to support several different high speed standards CE CE Weak Keeper SR CE CE OBUFT PA SR I CE Programmable elay IBUF SR Vref R LK CE ds022_02_09300 SM098 Computation Structures Lecture 6 9 Virtex CLB Xilinx definitions: Logic cell (LC) - 4 input LUT, carry logic and a storage element A slice consist of two LCs A CLB consists of 4.5 CLBs. The /2 LC comes from the fact that some additional logic is available for implementing functions with more than 4 inputs COUT COUT G4 G3 G2 G LUT Carry & Control SP EC YB Y Y G4 G3 G2 G LUT Carry & Control SP EC YB Y Y BY RC XB BY RC XB F4 F3 F2 F LUT Carry & Control SP EC X X F4 F3 F2 F LUT Carry & Control SP EC X X BX RC RC BX Slice Slice 0 CIN CIN slice_b.eps SM098 Computation Structures Lecture 6 0

Virtex slice - detailed view The additional logic are the F5 and F6 multiplexers. COUT CY YB G4 G3 G2 G I3 I2 I I0 WE LUT I O 0 INIT EC Y Y BY REV F5IN F6 XB CY F5 F5 CK WE WSO BY G X BX A4 WSH BX I INIT EC X F4 F3 F2 F I3 I2 I I0 WE LUT I O REV 0 SR CLK CE CIN SM098 Computation Structures Lecture 6 Virtex - look-up tables The Virtex LUTs can be configure to implement: 4-input LUTs 6x-bit synchronous RAM Two LUTs in one slice can be combined to implement 6x2-bit or 32x-bit synchronous RAM 6x-bit dual-port synchronous RAM 6-bit shift register SM098 Computation Structures Lecture 6 2

Virtex slice - FPGA Editor view SM098 Computation Structures Lecture 6 3 library ieee; use ieee.std_logic_64.all; Example entity Example is port ( A, B, C, : in std_logic; -- Inputs Reset, Clk, En : in std_logic; -- Reset, Clock, Clock enable Y : out std_logic); -- Output end Example; architecture RTL of Example is begin -- RTL process(clk) begin if rising_edge(clk) then if Reset = then Y <= 0 ; elsif En = then Y <= A xor B xor C xor ; end if; end if; end process; end RTL; How will this be implemented? How many slices? SM098 Computation Structures Lecture 6 4

Example SM098 Computation Structures Lecture 6 5 Example 2 8-bit adder with carry input and output How can this be implemented in a Virtex? How many slices? library ieee; use ieee.std_logic_64.all; use ieee.numeric_std.all; entity Example2 is port ( A, B : in unsigned(7 downto 0); Cin : in std_logic; R : out unsigned(7 downto 0); Cout : out std_logic); end Example2; architecture RTL of Example2 is begin -- RTL process(a, B, Cin) variable r_tmp : unsigned(8 downto 0); variable cin_tmp : integer range 0 to ; begin if Cin = 0 then cin_tmp := 0; else cin_tmp := ; end if; r_tmp := ( 0 & A) + B + cin_tmp; R <= r_tmp(7 downto 0); Cout <= r_tmp(8); end process; end RTL; SM098 Computation Structures Lecture 6 6

Example 2 Four slices - the carry chain is the high lighted (red) net Next slide shows this slice SM098 Computation Structures Lecture 6 7 Example 2 One full adder per slice SM098 Computation Structures Lecture 6 8

A Clk Reset FC [0] s[0] FC [0] [] s[] FC [] [2] s[2] FC [2] [3] s[3] FC [3] [4] s[4] FC [4] [5] s[5] FC [5] [6] s[6] FC [6] [7] s[7] FC [7] [8] s[8] FC [8] [9] s[9] FC [9] FC [0] [0] FC [] [] FC [2] [2] FC [3] [3] s[0] FC [4] [4] s[] s[2] s[3] s[4] s[5] Y library ieee; use ieee.std_logic_64.all; entity Example3 is port ( A : in std_logic; Clk, Reset : in std_logic; Y, Y2 : out std_logic); end Example3; Example 3 - shift register architecture RTL of Example3 is signal S, S2 : std_logic_vector(5 downto 0); begin -- RTL 6 FFs 8 slices Shift : process(clk, Reset) begin if Reset = then S <= (others => 0 ); elsif rising_edge(clk) then S <= S(4 downto 0) & A; end if; end process; Shift2 : process(clk) begin if rising_edge(clk) then S2 <= S2(4 downto 0) & A; end if; end process; Y <= S(5); Y2 <= S2(5); end RTL A Clk 0 SRL6 A0 A A2 A3 CLK un2.i_ /2 slice F un2.out[0] Y2 SM098 Computation Structures Lecture 6 9 Virtex Block RAM Each Block RAM is a synchronous dual-ported 4096-bit RAM with independent control signals for each port ata widths may be configured independently WEA ENA RSTA CLKA ARA[#:0] IA[#:0] WEB ENB RSTB CLKB ARB[#:0] IB[#:0] RAMB4_S#_S# OA[#:0] OB[#:0] You have actually already used the block RAM in one lab. Virtex evice # of Blocks Total Block SelectRAM Bits XCV50 8 32,768 XCV00 0 40,960 XCV50 2 49,52 XCV200 4 57,344 XCV300 6 65,536 XCV400 20 8,920 XCV600 24 98,304 XCV800 28 4,688 XCV000 32 3,072 SM098 Computation Structures Lecture 6 20

Virtex LLs A elayed Locked Loop (LL) can align internal and external clocks. Effectively eliminates onchip clock distribution delay. This maximizes the achievable speed. Chip Chip 2 LL LL Clock Clock ata Comparator Error elay Clock distribution Virtex have four LLs. The LLs can also be used to divide or double the incoming clock frequency internally. The output of the LL can drive the global clock routing recourses and clock skew can be eliminated. SM098 Computation Structures Lecture 6 2 Virtex compared to Virtex-E Virtex Maximum Block RAM Maximum evice System Gates CLB Array Logic Cells Available Bits SelectRAM+ Bits XCV50 57,906 6x24,728 80 32,768 24,576 XCV00 08,904 20x30 2,700 80 40,960 38,400 XCV50 64,674 24x36 3,888 260 49,52 55,296 XCV200 236,666 28x42 5,292 284 57,344 75,264 XCV300 322,970 32x48 6,92 36 65,536 98,304 XCV400 468,252 40x60 0,800 404 8,920 53,600 XCV600 66, 48x72 5,552 52 98,304 22,84 XCV800 888,439 56x84 2,68 52 4,688 30,056 XCV000,24,022 64x96 27,648 52 3,072 393,26 Virtex-E evice System Gates Logic Gates CLB Array Logic Cells ifferential Pairs User BlockRAM Bits istributed RAM Bits XCV50E 7,693 20,736 6 x 24,728 83 76 65,536 24,576 XCV00E 28,236 32,400 20 x 30 2,700 83 96 8,920 38,400 XCV200E 306,393 63,504 28 x 42 5,292 9 284 4,688 75,264 XCV300E 4,955 82,944 32 x 48 6,92 37 36 3,072 98,304 XCV400E 569,952 29,600 40 x 60 0,800 83 404 63,840 53,600 XCV600E 985,882 86,624 48 x 72 5,552 247 52 294,92 22,84 XCV000E,569,78 33,776 64 x 96 27,648 28 660 393,26 393,26 XCV600E 2,88,742 49,904 72 x 08 34,992 344 724 589,824 497,664 XCV2000E 2,54,952 58,400 80 x 20 43,200 344 804 655,360 64,400 XCV2600E 3,263,755 685,584 92 x 38 57,32 344 804 753,664 82,544 XCV3200E 4,074,387 876,096 04 x 56 73,008 344 804 85,968,038,336 SM098 Computation Structures Lecture 6 22

How to find the best implementation? You have to know the target architecture in order to make efficient design implementations Synthesis tools will not always provide the optimal solution. Structural coding can aid the synthesis tool - provided that the designer knows a better solution Use vendor specific module generations tools, such as Xilinx CoreGenerator. CoreGenerator can generate optimized cores such as arithmetic functions, FFTs, FIR filters etc SM098 Computation Structures Lecture 6 23 CoreGenerator flow CORE Generator VHO VEO HL Editor Behavioral Simulation Models VHL Verilog HL Test Bench EN Verilog & VHL Instantiation Symbol VHL Verilog Synthesizer EIF Xilinx CoreLib CORE Generator or IP Install SF HL Editor <Vendor> CoreLib Timing Simulation Flow Schematic Editor Schematic Simulation Tools simprim Unified EIF Functional Simulation Flow EIF VHL Verilog Unisim VITAL & Verilog simprim Implementation Tools VITAL, Verilog, Gate-level EIF VHL Verilog SF X8974 SM098 Computation Structures Lecture 6 24

What is best - what are the requirements? Some requirements can be: Short time to market Low resource usage - area High operating frequency Low power consumption (Mikael will talk about this next lecture) epending on what requirement is most important, different design solutions will be oprimal for the particular requirements SM098 Computation Structures Lecture 6 25 Time to market If time to market is the most important requirement your boss will not be satisfied if you try to optimize other requirements that are already met. Your will not get a raise if you manage to save 5 CLBs because you spent two days optimizing a counter. This probably how most of you work in the lab. You try to meet the lab requirements before the deadline but don t care much if your solution is the most efficient in terms of speed or area. Am I right? SM098 Computation Structures Lecture 6 26

Resource usage If you are optimizing for area you should consider Sequential execution instead of parallel execution Bit serial implementation of data paths Scheduling of data paths, interleaving of resources in time Choosing the algorithm that minimizes area... SM098 Computation Structures Lecture 6 27 Speed If you are optimizing for speed you should consider Parallel execution Pipelining Choosing the fastest algorithm... Next and last lecture I will give you a practical example on how one algorithm, a FIR filtering, can be implemented in hardware. We will optimize it for area and for speed and we will come up with two separate implementations SM098 Computation Structures Lecture 6 28

Final question Which of these two implementations are optimal? Max x T Max x T A 2T A T B C 3A 2T T A F B C A T 2T 3A F 3A A T T S ecoder S ecoder A A Critical path = 3T Area = 8A Critical path = 4T Area = 6A SM098 Computation Structures Lecture 6 29