EN2911X: Reconfigurable Computing Topic 01: Programmable Logic Prof. Sherief Reda School of Engineering, Brown University Fall 2012 1
FPGA architecture Programmable interconnect Programmable logic blocks [Maxfield 04] Programmable logic element Objective of this lecture: study organization of programmable logic blocks and interconnects 2
Block logic element (BLE) [Rose 04] [Maxfield 04] How is the number of bits in a K-input table? How many Boolean functions can a K-input LUT implement? What is the best LUT size? 3
A closer look at the BLE 4
Example [from J. Zambreno] F = A 0 A 1 A 3 + A 1 A 2 Ā 3 + Ā 0 Ā 1 Ā 2 4-input LUT 16 bits 3-input LUT 24 bits 2-input LUT 28 bits 5
Logic block clusters (logic array block LAB, configurable logic block CLB) Assume K-input LUT in each BLE and assume N BLEs per logic cluster The BLEs in each logic clusters are fully connected or nearly-fully connected What are the best values for I, K, and N? [Betz-Rose 97] 6
To implement in FPGAs, designs need to be decomposed and mapped to LBs Map to a LUT in a CLB [Figure from Cong FPGA 01] 7
Heterogeneous reconfigurable logic Reconfigurable fabric might contain non-reconfigurable elements that interface to the logic blocks through the programmable interconnect fabric Examples: Embedded memory Embedded multipliers, adders, MAC Embedded processors 8
Embedded memory blocks Costly to implement memory with configurable logic blocks add hard chunks of RAM blocks Position/size vary depending on the FPGA device. Size varies from few thousands (or tens of thousands) per RAM block [Maxfield 04] Each block can be used independently or combined to form larger RAM blocks Could be single or dual-port RAMs 9
Embedded multipliers and adders Multipliers are inherently slow if implemented by connecting a large number of programmable logic blocks add hard-wired multiplier blocks Typically located close to the embedded RAM blocks Some FPGA use Multiply-And- Accumulate (MAC) blocks (useful in DSP applications) 10
Programmable routing Wires provide the necessary communication fabric to route the output of one computational node to the inputs of another computational node Why routing is very crucial? Routing resources occupy a larger area than logic resources in an FPGA Wire delay grows quadratically as a function of its length Technology scaling reduces device delay but increases wire delay 11
General routing definitions track channel segment CLB CLB CLB CLB A wire segment is a wire unbroken by programmable switches A track is a sequence of one or more wire segments in a line. The segments could be connected by switches at their ends A routing channel is a group of parallel tracks. The channel width is the number of tracks in the channel 12
Connection blocks: formed where CLB input or output pins connect to the routing channels Life would have been easy if only logic blocks within the same column or row need to communicate! 13
Segment-segment switch design for bidirectional wires track channel segment CLB CLB CLB CLB [Lemieux 04] 14
Switch blocks: formed wherever horizontal and vertical channels intersect Switch box Switch box size grows quadratically as a function of the number of its input wires 15
Bidirectional switch details [Lemieux 04, Tessier] 16
Segmented and hierarchical routing segmented routing hierarchical routing Short wires accommodate local traffic Short wires can be connected together using switch boxes to emulate longer wires Also contain long wires to allow efficient communication without passing through switches Routing within a group of logic blocks occur at the local level Longer hierarchical wires connect different groups 17
Programming the FPGA Configuration data in Configuration data out = I/O pin/pad = SRAM cell Configuration memory that determine the programmability of the logic blocks and interconnects 18
Programmable switch technology Anti-fuse SRAM Switch by default is OFF; when programmed it is ON. Advantages: negligible delay small area overhead Disadvantages: not really reconfigurable; one time programmable Flash Switch by default is ON; when programmed it is OFF. Advantages: programming not lost when device is turned off. Disadvantages: requires more manufacturing steps SRAM bit cell stores the programmability of the device Advantages: can be reconfigured quickly and as repeatedly as required no special fabrication steps Disadvantages: takes more area loses charge when turned off 19
Case study: Altera s Cyclone II device Two dimensional array of Logic Array Blocks (LABs), with 16 Logic Elements (LEs) in each LAB. Embedded memory blocks (M4K) and multipliers (18x18) PLL (Phased Locked Loops) are used to generate clock signal for a range of frequencies EP2C35 (in DE2 board) has 60 columns and 45 rows for a total of 33216 LEs. 105 M4K blocks and 35 embedded multipliers. 20
Logic element organization (normal mode) The LE has two operating modes: normal and arithmetic Normal mode is suitable for general logic implementation 4-input LUT 6 input connections 3 output connections LAB-wide synchronous/asynchronous clear and load signals. Clock signal 21
Logic element organization (arithmetic mode) Arithmetic mode is suitable for implementing adders, counters, accumulators and comparators The LUT is split into two 3-input LUTs (ideal for implementing 2-bit full adders) and basic carry chain 22
Logic array block organization Each LAB consists of the following: 16 LEs, LAB control signals, LE carry chains, register chains and local interconnects Local interconnects transfer signals between LEs in the same LAB and is driven by column and row interconnects and LE outputs within the same LAB Neighboring LABs, PLLs, M4K RAM and multipliers from the left and right can also drive an LAB s local interconnect Each LE can drive 48 Les through fast local and direct interconnects 23
Register/carry chain connections with a LAB 24
Multi-track interconnects Multitrack interconnect consists of row (directlink, R4, R24) and column (register chain, C4, C16) R4/C4 interconnects spans 4 blocks (right, left / top, down) R24/C16 spans 24/16 blocks and connects to R4/C4 interconnects R4/C4 can drive each other to extend their range 25
C4 interconnections C4 interconnects drive local and R4 interconnect up to 4 rows C16 column interconnects span 16 LABs and provide long column connections C16 column interconnects indirectly drive LAB local interconnects via C4 and R4 and interconnects 26
Embedded RAMs and multipliers 4608 RAM bits (w or w/o parity) 250 MHz performance Either single or dual port memory Can also be configured as FIFO ideal for DSP applications 250 Mhz performance Either configured as one 18 bit multiplier or two independent 9 bit multipliers 27
IO Element (IOE) structure IO Element (IOE) structure (allows bidirectional signals) 5 IOE per row I/O block Row I/O blocks drive C4, R4, R24 or direct link interconnects. Column I/O blocks drive C4, C16 interconnects 28