Logic Emulation with Virtual Wires

Size: px

Start display at page:

Download "Logic Emulation with Virtual Wires"

Erika Simon
6 years ago
Views:

1 Page 1 Logic Emulation with Virtual Wires Jonathan Babb, Russell Tessier, Matthew Dahl, Silvina Hanono, David Hoki, and Anant Agarwal MIT Laboratory for Computer Science Cambridge, MA Abstract Logic emulation enables designers to functionally verify complex integrated circuits prior to chip fabrication. However, traditional FPGA-based logic emulators have poor inter-chip communication bandwidth, commonly limiting gate utilization to less than 20 percent. Global routing contention mandates the use of expensive crossbar and PC-board technology in a system of otherwise low-cost, commodity parts. Even with crossbar technology, current emulators only use a fraction of potential communication bandwidth because they dedicate each FPGA pin (physical wire) to a single emulated signal (logical wire). Virtual Wires overcome pin limitations by intelligently multiplexing each physical wire among multiple logical wires and pipelining these connections at the maximum clocking frequency of the FPGA. The resulting increase in bandwidth allows effective use of low dimension, direct interconnect. The size of the FPGA array can be decreased as well, resulting in low cost logic emulation. This paper covers major contributions of the MIT Virtual Wires project. In the context of a complete emulation system, we analyze phase-based static scheduling and routing algorithms, present Virtual Wires synthesis methodologies, and overview an operational prototype with 20Kgate boards. Results, including in-circuit emulation of a SPARC microprocessor, indicate that Virtual Wires eliminate the need for expensive crossbar technology while increasing FPGA utilization beyond 45 percent. Theoretical analysis predicts that Virtual Wires emulation scales with FPGA size and average routing distance, while traditional emulation does not. 1 Introduction Field Programmable Gate Array (FPGA) based logic emulators are capable of emulating complex logic designs at clock speeds four to six orders of magnitude faster than software simulators. This performance is achieved by partitioning a logic design, described by a netlist, across an interconnected Minute Hour Day Week Month Compilation Time Logic Simulation Accelerated Simulation Year Month Week Day Hour Logic Emulation Figure 1: Verification Alternatives Final Silicon Execution Time Minute array of FPGAs. The netlist partition on each FPGA, configured directly into logic circuitry, is then executed at near hardware speeds. Figure 1 compares logic emulation to other prototyping methods, including simulation and accelerated simulation, as well as to final silicon. The y-axis measures relative time for compiling or constructing a hypothetical design, while the x-axis measures relative time for executing one set of test vectors on this design. As an example, consider final silicon which takes months to construct and runs a set of vectors in less than one minute. The same design and vector set could be compiled for a logic simulator on the order of minutes, but would take years to execute. Logic emulation fills a wide gap between simulation and actual silicon. With both a moderately fast compile time and a fast execution time, emulation offers a compromise between the programmability of software and the fast execution speed of hardware. Logic emulators are further characterized by interconnection topology, target FPGA, and supporting software. The interconnection topology describes the arrangement of FPGA devices and routing resources. Example interconnects include full crossbars and two-dimensional meshes. Impor-

2 Page 2 Not Limited unused FPGA pins unused FPGA gates Pin Limited no unused pins some unused gates Gate Limited some unused pins no unused gates Balanced no unused pins no unused gates Figure 2: Partition Limitation Scenarios tant target FPGA properties include gate count, pin count, and mapping efficiency. Supporting software is extensive, combining netlist translators, logic optimizers, technology mappers, global and FPGA-specific partitioners, placers, and routers. Traditional emulators are gate inefficient due to inherent pin limitations in the FPGA devices. To reduce pin limitations, these emulators supplement FPGAs with custom crossbars chips and expensive PC-board and backplane technology, further increasing the per-gate cost of emulation. This paper suggests an alternative solution to pin limitations based on multiplexing of FPGA resources. 1.1 Virtual Wires In existing emulator architectures, both the logic configuration and the network connectivity remain fixed for the duration of the emulation. Every emulated partition of the input design, one per FPGA, consists of a set of gates and a set of signals communicating to other partitions. Each emulated gate is mapped to one or more FPGA equivalent gates and each inter-partition emulated signal is allocated to a pair of pins between two FPGAs. Thus for a partition to be feasible, the partition gate and pin requirements must be no greater that the available FPGA resources. These constraints yield four possible scenarios (Figure 2). When typical circuits are mapped onto available FPGA devices, partitions are predominately pin limited. That is, all available FPGA gates cannot be utilized due to lack of pin resources to support them. We demonstrate this resulting bandwidth gap with a set of partitionings of the Sparcle and CMMU benchmarks (see Section 5.1) for various gate counts. Figure 3 shows the resulting curves, plotted on a log-log scale. Partition gate count is scaled by a factor of two to get FPGA equivalent gates with an assumed mapping efficiency of 50%. On the same curve we plot the pin and gate capacity of target FPGAs: the Xilinx 3000 and 4000 series [40], the Altera Flex 8000 series [3], and the Atmel 6000 series [5]. For equal average gate counts in the benchmark partitions and FPGA devices, the required average pin counts for partitions are much greater than the available pin capacity of the FPGAs. FPGA / Partiton Pin Count Alewife Cache Controller partitions Sparcle partitions Xilinx 3000 & 4000 FPGAs Xilinx 4000H FPGAs Atmel FPGAs Altera FPGAs Altera MCM Bandwidth Gap FPGA / Partition Gate Count Figure 3: Pin Count as a Function of FPGA Partition Size Pin limits set a hard upper bound on the maximum usable gate count any FPGA gate count can provide. Low utilization of gate resources increases both the number of FPGAs needed for emulation and the time required to emulate a particular design. This discrepancy will only get worse as technology scales; current trends indicate that available gate counts are increasing faster than available pin counts. Future breakthroughs in area I/O [27] may partially address this problem for FPGA packaging, but will leave open the more difficult issues of inter-board and system-level communication. Additionally, any new technology will be challenged to keep up as minimum feature size decreases faster than required bonding area. Virtual Wires eliminate the pin limitation problem of previous emulators by intelligently multiplexing each physical wire among multiple logical wires and pipelining these connections at the maximum clocking frequency of the FPGA. 1 A Virtual Wire represents a simple connection between a logical output on one FPGA and a logical input on another FPGA. Established via a pipelined, statically-routed communication network, these Virtual Wires increase available off-chip communication bandwidth by multiplexing the use of FPGA pin resources (physical wires) among multiple emulation signals (logical wires). Without Virtual Wires, one to one allocation of logical wires to physical wires does not exploit available pin bandwidth because: emulation clock frequencies are one or two orders of magnitude lower than the potential FPGA frequency; all logical wires are not active simultaneously. 1 Although this paper focuses on logic emulation, Virtual Wires can be applied to any multi-chip system.

3 Page 3 FPGA #1 FPGA #2 FPGA #1 FPGA #2 Logical Outputs Logical Outputs Logical Inputs Logical Outputs Logical Inputs Shift Loops Mux Logical Inputs Physical Wire Physical Wire Figure 4: Hard Wire Interconnect Figure 5: Virtual Wire Interconnect However, by clocking physical wires at the maximum frequency of the FPGA technology, several logical connections can share the same physical resource. Figure 4 shows an example of six logical wires allocated to six physical wires. Figure 5 shows the same example with the six logical wires sharing a single physical wire. The physical wire is multiplexed between two pipelined shift loops (Section 3). Each register in the pipeline carries a single bit of information from one logical output to the corresponding logical input in the neighboring FPGA. Systems based on Virtual Wires exploit several properties of digital circuits to boost bandwidth from available pins. In a logic design, evaluation flows from system inputs to system outputs. In a synchronous design with no combinational loops, this flow can be represented as a directed acyclic graph. Thus, through analysis of the underlying logic circuit, logical values between circuit partitions only need to be transmitted once. Furthermore, since circuit communication is inherently static, communication patterns will repeat in a predictable fashion. By exploiting this predictability, communications can be scheduled to increase pin utilization. 1.2 Emulation Software Software for logic emulation with Virtual Wires roughly follows the standard emulation tool flow (Figure 6). The input, a netlist of the logic design to be emulated, is transformed into a multi-fpga configuration bitstream to be downloaded onto the emulator. Not shown are the technology libraries, target FPGA characteristics, and FPGA interconnect topology needed to make the correct transformations. We next describe the standard steps. Translator: The input netlist to be emulated is typically generated with a hardware description language or a schematic capture program. The netlist must be syntactically translated into a format readable by the emulation software. Commercial and public domain tools are available for generic source-to-source translation. At MIT we used both Verilog and LSI logic formats. Tech Mapper: The translated netlist is still specified in terms of the source technology library for example LSI Logic s LCA100K technology [26]. Before emulation, the netlist must be mapped to a target library of FPGA primitives. Although commercial and public domain tools are also available for mapping, our simple and fast technique is to create a mapping library which describes each source primitive in terms of primitives in the target library. The inefficiency of this mapping can be largely recovered with a following logic optimization pass. Partitioner: After mapping the netlist to the target technology, the netlist is divided into partitions, each of which can fit into a single target FPGA. Without Virtual Wires, each partition must have both fewer gates and fewer pins than the target device. With Virtual Wires, the total gate count, including the overhead of Virtual Wires multiplexing logic, must be no greater than the target FPGA gate count. In the MIT implementation, we used the InCA Concept Silicon partitioner [19]. This partitioner performs K-way partitioning with min-cut and clustering techniques. Logic Netlist Translator Tech Mapper Partitioner Global Placer Global Router FPGA APR FPGA Configuration Bitstream VW Scheduler VW Synthesizer Figure 6: Emulation Software Flow

4 Page 4 FPGA Host Workstation Emulation System Target System Figure 7: Virtual Wires Emulation System Global Placer: Individual circuit partitions must be placed into specific FPGAs. An ideal placement minimizes system communication, thus requiring less routing resources. We wrote a simple placer based on simulated annealing [21] to minimize total Manhattan wire length. Global Router: In traditional emulation, inter-fpga communication is established with a global routing phase. If crossbars are employed, this phase must also determine the routing configuration for each crossbar as well as pinassignments of partition I/Os to FPGA pins. For Virtual Wires emulation, there are no direct physical connections between partitions, and this phase is completely replaced with new virtualization software to be described in this paper. FPGA APR: Once routing is complete, there is one netlist for each FPGA. Each netlist must be processed with FPGA specific automated place-and-route (APR) software to produce configuration bitstreams. We used the XACT [40] software for Xilinx FPGAs. With Virtual Wires, we replace the global router of traditional software with modules created to specifically support automatic pin multiplexing: the Virtual Wires Scheduler and the Virtual Wires Synthesizer (Figure 6). Together we refer to the transformation performed by these two components as virtualization. Although each emulation step is an intriguing aspect of CAD research, this paper focuses on these novel virtualization components, described below. Virtual Wires Scheduler: The resulting set of netlist partitions mapped to each FPGA, in conjunction with the routing resource constraints of the emulation system, is used to determine an appropriate schedule of logical wires onto physical wires. This schedule establishes a feasible timespace route for every logical wire, while guaranteeing that all multi-fpga combinational paths are correctly ordered. Schedule optimizations include minimizing the total time needed to execute the circuit, as well as minimizing the Virtual Wires logic overhead. While this scheduling problem is similar to those encountered in high level synthesis, it is complicated by inter-fpga routing constraints and the need to account for multiplexing overheads. In Section 2, we describe the phase-based scheduling algorithm implemented at MIT. Virtual Wires Synthesizer: This step implements the chosen routing schedule by synthesizing special multiplexers and registers that are added to the circuit partition in each FPGA. This logic is effectively a pipelined, statically routed network in the FPGA technology itself. For maximum efficiency, the synthesizer takes into account the underlying idiosyncrasies of the target FPGA technology. For example, FPGA pin assignment and allocation of internal tri-state buses are carefully optimized. The resulting synthesized architectures provide insight into Virtual Wires implementation. Section 3 compares three different architectures for the Xilinx 4000 series. 1.3 Low Cost Emulation System Although virtualization can be used to map input designs to any FPGA-based logic emulator, the process is most valuable when enabling the use of inexpensive, direct interconnect and cheap, low pin count FPGAs. To demonstrate this advantage, we have constructed FPGA boards composed of sixteen mesh-connected FPGAs and commodity SRAMs. These boards may themselves be mesh-connected, leading to straightforward software mapping and simplified system scalability. This system (Figure 7), described in Section 4 has demonstrated the following functionality: In-circuit emulation: FPGA array mimics one or more components of the target system and is pod-connected to the chip sockets of those missing components. Simulation acceleration: FPGA array replaces a piece of a simulation model and connects to the software simulation environment by remote calls through the host interface. Hardware subroutines: FPGA array implements a Verilog version of a subroutine in a C program and connects to the software by remote calls through the host interface [9].

5 Page 5 Section 5 describes our results for both in-circuit emulation and simulation acceleration of the Sparcle benchmark on our system, including booting a multiprocessor operating system. We leave the exploration of hardware subroutines to future reports. 1.4 Scalable Technology Not only can Virtual Wires be used to compose low-cost systems of gigantic numbers of FPGAs, but this technology also scales as FPGA sizes increase. To demonstrate this scalability, Section 6 uses Rent s Rule to derive theoretical models of emulation gate overheads for systems with and without Virtual Wires. This model accounts for the mismatch between circuit communication and FPGA communication in the hard-wired case and includes a topological factor that explains why a mesh topology does not scale without Virtual Wires. With this model we show how the derived Virtual Wire utilization scales with increasing FPGA device size and average routing distance, while hard-wired utilization may not. 1.5 Overview The rest of this paper is organized as follows: Section 2 describes the Virtual Wires scheduling and routing algorithms. Section 3 then compares three Virtual Wires synthesis architectures. After Section 4 describes our demonstration hardware system, Section 5 then present results for both simulation acceleration and in-circuit emulation on this system. Section 6 analyzes the overhead and scalability of Virtual Wires versus hard-wires. Finally, Section 7 describes related work in the field and Section 8 makes concluding remarks. 2 Scheduling Algorithms Virtualization replaces the inter-fpga routing steps of traditional emulation with software that synthesizes a routing network into the netlist partition on each FPGA. This network establishes global routes via statically scheduled bits rather than hard-wired interconnections. The first phase of this approach is a scheduling and routing algorithm. Our phase-based methodology suffices to prove the concept of Virtual Wires scheduling and is within a factor of two of more optimal algorithms presented in recent literature [32]. Before describing the scheduling algorithms, let us first introduce the basic operating principles of Virtual Wires. 2.1 Phase-Based Operating Principles Emulation Clock phase 1 phase 2 phase 3 phase 4 Evaluation microcycle Communication Figure 8: Clocking Framework uclk uenable this period into a number of microcycles determined by a free-running CLK (Figure 8). In this scheme, a microcycle is the shortest distinguishable unit of time. All routing is scheduled in discrete microcycle increments. These microcycles are grouped into sequential phases to support combinational paths that extend across multiple chips. The advantage of this approach is a decoupling of logic execution speed from inter-chip communication speed, allowing highspeed communication cycles to co-exist with a long-latency emulation clock period. A Enable signal divides each phase into an evaluation time span and a communication time span. Within a phase, a given number of microcycles are dedicated to the evaluation of the FPGA logic, followed by a set of cycles to communicate the results to other partitions in destination FPGAs. Evaluation takes place at the beginning of a phase, with logical inputs being propagated through each circuit partition to determine logical outputs for that phase. Not all inputs are available at the beginning of each phase, and not all outputs are produced. For inputs which are available, all logic is evaluated and subsequent outputs are produced. Each input and output transmission will be assigned to a single phase such that signal precedences are observed. At the end of the phase, the produced outputs are communicated to other circuit partitions at the microcycle clock rate. All necessary phases must be executed by the end of the emulation clock period. For simplicity, we limited our approach to synchronous logic with a single global emulation clock. Any asynchronous signals cannot be statically-routed and therefore must be hard-wired to dedicated FPGA pins. Virtual Wires can be extended to multiple clocks [32] and gated clocks, as well as certain types of asynchronous logic, such as multiple asynchronous clock domains. The emulation clock period is the clock period of the logic design being emulated. To facilitate multiplexing we break

6 Page 6 In1 In2 AND OR Out1 External Inputs W3 Partition1 W1 W2 W5 Partition2 W4 Partition3 W6 External Outputs In3 AND Out2 Interpartition Wire Clk D REG CLK Out3 Dependence Relation Figure 10: Depth Calculation Example Figure 9: Dependence Calculation Example 2.2 Definition of Dependence and Depth Two timing analysis computations, dependence and depth, aid in Virtual Wires scheduling. Both dependence and depth apply to inter-partition wires. To analyze input to output Dependence, we scan the logic in each partition to determine the set of outputs to which a combinational path exists from each input. An output is said to be a dependent (or a child) of an input if a change in that input can combinationally change the output. The dependence relationships between inputs and outputs for a given partition are derived recursively from those of its constituent logic elements. In determining dependence, we assume that all outputs of a combinational library primitive are dependents of all the inputs of that primitive. Similarly, no outputs are dependents of any of the inputs for sequential primitives. Let Depend[i] denote the set of outputs of a given partition that are dependents of an input of the same partition connected to an inter-partition wire i. Similarly, let D ;1 [i] represent the set of inputs that are ancestors to an output driving an inter-partition wire i. By our definition, inputs to storage elements and external outputs will have no dependents: Depend[i] =, and outputs of storage elements as well as external inputs will have no ancestors: D ;1 [i] =. Figure 9 shows an example circuit partition containing four interconnected primitive logic elements with three inputs (not including the clock) and three outputs. The dependence relationships for this partition are as follows: Depend[In1] =fout1g, Depend[In2] =fout1out2g, Depend[In3] =fout1out2g. Likewise, the ancestors are as follows: D ;1 [Out1] =fin1in2in3g, D ;1 [Out2] =fin2in3g, D ;1 [Out3] = (REG is a storage element). The depth calculations use the dependence relationships. The depth of inter-partition wire i is the largest number of partitions in a forward combinational path starting at that wire. Depth is computed recursively from the wire dependence sets such that for each wire i: ( 0 ifdepend[i] = Depth[i] = 1 + max Depth[j] j2depend[i] otherwise (1) Figure 10 shows an example design with three partitions and six inter-partition wires. The dashed-lines denote inputoutput dependence relationships. In this example, wires are at the following depths: depth 0: W4, W6 depth 1: W2,W3,W5 depth 2: W1 Our phase assignment algorithm uses depth to prioritize routing of critical paths. During scheduling, although both W1 and W2 have no ancestors, W1 has a greater depth and will be given priority. 2.3 Phase Assignment Algorithm The goal of the phase assignment algorithm is to determine an appropriate schedule of logic wires between design partitions onto physical wires between FPGAs. The resulting schedule must establish a feasible time-space route for every logical wire, while observing FPGA routing resources constraints and guaranteeing that all multi-fpga combinational paths are correctly ordered. The core scheduling algorithm consists of a shortest path router inside a greedy phase assignment loop. Within the

7 Page 7 Given: I : set of inter-partition design wires ComCycles : communication cycles per phase c : total micro-cycles per phase Produce: S : output schedule which assigns each wire i 2 I to a shiftgroup in a particular phase Procedure PhaseAssign call Depend CalcDependents(I) call Depth CalcDepth(Depend) initialize Done array to false for each wire i 2 I for each dependent wire j 2 Depend[i] DependCount[j] DependCount[j] + 1 n 0 /*phase counter*/ loop forever call RouteInit W wires with Done[i]=false and DependCount[i]=0 if W is empty exit loop n n+1 sort W by Depth[i], greatest depth first for each wire i 2 W src FPGA partition where source of i is placed dest FPGA partition where dest of i is placed path Route(src,dest) if path exists then distance = length(path) - 1 maxsignals ComCycles - distance L i, and up to maxsignals additional wires from W that have the same src and dest as i for each j 2 L delete j from W Done[j] true for each k 2 Depend[j] DependCount[k] DependCount[k] - 1 shiftgroup fn, path, L g add shiftgroup to schedule S endif endloop save n, c in schedule S /*final phase and cycle count*/ end Procedure main loop of the phase assignment algorithm, Figure 11, as many wires as possible are scheduled and routed. Once no more wires are available to schedule, the algorithm advances to the next phase. All unscheduled wires are thus pushed to the following phase when either their ancestors have not been scheduled, or there is no remaining routing path available in that phase. The phases are processed sequentially and no attempt is made to go back and optimize previously scheduled phases. Given enough phases and at least one potential path between all pairs of FPGAs, any design can be scheduled. This is easier than hard-wired routing problems, in which various rip-up and re-try strategies may be needed to find a feasible route. The algorithm starts by first calculating the dependence and depth arrays for all wires as described in the previous section. An additional array, Done[i], is set to false to mark each wire i as unscheduled. The algorithm then initializes a DependCount[i] array from the dependence information of each wire. When this counter reaches zero, as the algorithm progresses, wire i will be ready to schedule. The algorithm proceeds by assigning wires to phases until all wires have been scheduled. Advancement to following phases occurs when no wires can be scheduled in the current phase. Within each phase, ready wires with the greatest depth are iterated first, guaranteeing that critical paths are given priority. The routing algorithm is successively called to route as many ready wires as possible. Once a successful route is established from a source to destination FPGA, as many as ComCycles ; distance additional ready wires between the same source and destination are formed into a shiftgroup, where ComCyclesis the number of communication cycles in a phase and distance is the number of FPGA crossings in the routing path. For example, if their are eight cycles per phase and the distance is three, a total of five wires can be routed in the same shiftgroup. The additional wires are also prioritized by depth. For each routed wire j in the shiftgroup, Done[j] is set, and the set of wires k 2 Depend[j] are iterated to decrement DependCount[k]. IfDependCount[k] =0, wire k can be scheduled in a following phase. Any ready signals that are not successfully routed in a phase are automatically delayed to following phases. As long as the delayed signals are not on the critical path, the total number of phases will not be affected. The ComCycles parameter specifies the number of microcyles to spend in communication during each phase. This number must be greater than the routing diameter of the topology to guarantee that all signals can route. For the results in Section 5, it turns out that eight communication cycles match the eight-way tri-state bussing of the Xilinx architecture. Figure 11: Phase Assignment Algorithm

8 Page Route Algorithm Given: T : emulator topology src : source FPGA in T dest : destination FPGA in T Produce: path : list of FPGAs along shortest route from src to dest Procedure RouteInit for each FPGA src 2 T for each FPGA dest 2 T Reserve[src,dest] connections in T from src to dest Avail[src,dest] (Reserve[src,dest] 6= 0) end Procedure Procedure Route(src, dest) path ShortestPath(src,dest,Avail) /* Dijkstra s algorithm */ if path exists then for each FPGA f 2 path if f = src then /* first FPGA in path */ src1 f else /* following FPGAs in path */ dest1 f Reserve[src1,dest1] Reserve[src1,dest1]-1 Avail[src1,dest1] (Reserve[src1,dest1] 6= 0) src1 f endif return path else return null path endif end Procedure Figure 12: Routing Algorithm The goal of the routing algorithm (Figure 12) is to find a shortest available path, in terms of FPGAs, between the source and destination FPGA of a set of inter-partition wires. The algorithm keeps track of the reserved and available physical connections between FPGAs in the emulator topology and is repeatedly called from the inner loop of the phase assignment algorithm. Route uses shortest path analysis with a cost function based on channel availability. Shortest path routing minimizes both the number of microcycles needed per phase and intermediate hop logic overhead. Before the beginning of each phase, a reservation matrix, Reserve[i j], is initialized to the number of physical connections between FPGAs i and j in the emulator topology. Route applies Dijkstra s shortest path algorithm [34] to channel availability, Avail[i j] =(Reserve[i j] 6= 0), to determine the shortest path between the source and destination FPGAs. If the shortest path exists, then the reservation matrix is updated by subtracting one from each element along that path and route returns with this path; else, route returns unsuccessfully. After each successful route, PhaseAssign forms a new shiftgroup data structure (Figure 13). This data structure includes the phase number, FPGA path, and set of logical wires in that group. This information is written to the schedule file, to be passed to the synthesis phase of virtualization. 2.5 Execution Speed Analysis Before proceeding, let us compute the execution speed of Virtual Wires emulation. Based on our phased operating principles, the emulation clock cycle time will be determined by the total number of microcycles needed: v = n c (2) where n is the number of phases and c the number of cycles per phase as previously defined. If c is the same across all phases, then we can immediately recognize that: n L c D (3) Shiftgroup f phase number source FPGA, 2nd FPGA, 3rd FPGA,..., dest. FPGA logical wire 1, logical wire 2,..., logical wire N g Figure 13: Shiftgroup Data Structure where D is the maximum distance of any shiftgroup route (in the worst-case D is the network diameter of the FPGA topology), and L is the length of the critical path in the design netlist, equivalent to the maximum depth. That is, there must be enough cycles in each phase to route a signal across the diameter of the network as well as enough total phases to schedule the longest combinational path between circuit partitions. Additionally, we recognize that the total number of microcycles is also constrained by the maximum multiplexing performed at each FPGA: v P C =P f (4)

9 Page 9 where P C is the maximum circuit communication requirement, including partition pins and any additional pins required for through hops, and P f is the pin count of each FPGA 2. Combining these two observations and assuming that the number of microcycles per phase is constant across all phases, we get the following best-case speed result: Best-case microcycles: The cumulative microcycle count for all phases within a scheduled emulation clock period is bounded below by the following equation: Design Partition Design Partition Input Registers Output Registers VW FSM Hops v max( latency bound z } { L D bandwidth bound z } { P C =P f ): (5) where L is critical path length, D is network diameter, P C is the maximum circuit partition pin count including through hops, and P f is the FPGA pin count. In our practical experience, design emulation speed is determined predominately by the latency bound. 2.6 Improvements We proposed the preceding algorithms to demonstrate the feasibility of Virtual Wires and for ease of implementation of the synthesis structures described in the following section. These algorithms can be improved by scheduling at the granularity of a single microcycle and eliminating the phase barriers altogether. The advantages of such improvements [32] include: possible initiation of computation and subsequent routing as early as one microcycle after a signal arrives at a destination rather than waiting for the following phase, potential overlapping of computation with communication in different parts of the system rather than execution in exclusive time spans, support of different propagation delays for individual wires rather than observing a worst-case delay for all computation in a phase, flexible scheduling of wires at the microcycle granularity rather than scheduling of dedicating pipeline paths per phase. This scheduling also eliminates costly pipeline filling overhead at the beginning and end of each phase. We continue by describing the synthesis architectures designed at MIT. These architectures implement the virtualized routing network produced by the phase assignment and routing procedures. 2 Note that we have ignored pipeline startup overhead associated with each shiftgroup. FPGA Figure 14: Virtualized FPGA Composition 3 Synthesis Architectures Although it would be possible to design an FPGA with multiplexed pins, we implemented Virtual Wires without custom hardware support. That is, the virtualization process synthesizes the required multiplexing components directly into the FPGA netlists, to be downloaded with the original design partitions. Thus, any existing FPGA-based logic emulation system can take advantage of Virtual Wires. After discussing the synthesis algorithms, this section proceeds to describe three of many possible synthesis architectures based on shift registers in Xilinx FPGAs. 3.1 Synthesis Algorithm The Virtual Wires Synthesizer flowchart component in Figure 6, takes the following input: emulator topology, external design I/O pin assignment, routing schedule, design partition netlists, and produces: one virtualized netlist for each FPGA. As shown in Figure 14, these new netlists contain the original design partition along with all necessary Virtual Wires communication logic. Synthesized logic includes input and output shift registers, through hops, and the VW-FSM finite state machine control logic. After reading in the appropriate inputs from previous compilation stages, the Synthesizer algorithm ( Figure 16) proceeds to synthesize the VW-FSM control logic for each FPGA. For the most part, this logic is identical for each

10 Page 10 u C L K u E n a b l e V W F S M C o n t r o l S i g n a l s Figure 15: VW-FSM Control Logic Given: T : emulator topology E : external I/O pin assignment S : routing schedule D : set of design partition netlists Produce: X : set of virtualized FPGA netlists Procedure Synthesize N number of phases in S C cycles per phase in S for each FPGA f 2 T given (N,C) synthesize control logic vwfsm[f] virtuallogic[f] vwfsm[f] for each phase P 2 S for each shiftgroup G 2 P R inter-fpga routing path for G L number of logical wires 2 shiftgroup G for each FPGA f 2 R: if f is first FPGA in R then logic synthesize output shifter of length L assign each logical output in G to logic assign physical FPGA output pin to logic else if f is an intermediate FPGA in R then logic synthesize intermediate hop assign physical FPGA I/O pins to logic else /* if f is last FPGA in R */ logic synthesize input shifter of length L assign each logical input in G to logic assign physical FPGA input pin to logic endif assign control nets for vwfsm[f] to logic virtuallogic[f] virtuallogic[f] + logic for each external I/O e 2 E assign e to its specified FPGA physical pin for each FPGA f 2 T partition[f] design partition in D placed on f X[f] virtuallogic[f] + partition[f] end Procedure Figure 16: Synthesis Algorithm FPGA, determined solely by the number of phases and microcycles per phases in the schedule. The VW-FSM (Figure 15) logic takes as input the CLK and the Enable signals, distributed to each FPGA, and generates the appropriate control signals during each microcycle. As described in Section 2, the CLK is the free running pipeline clock, while the Enable clock is synchronized to the emulation clock and determines when to start the communication sequence for each emulation phase. The output control signals are responsible for strobing logical wires into the appropriate registers and controlling multiplexer selection. The algorithm iterates through the shiftgroups in each phase to construct the input, intermediate hop, and output architectures. Each shiftgroup data structure contains the logical wires assigned to that group as well as the group s phase and FPGA path. As the architectures are synthesized, they are connected between the partition logical wires and FPGA physical wires as well as to the appropriate control signals. Not shown in the algorithm, the Synthesizer also makes low level implementation decisions at this time to optimize the use of limited FPGA resources, including tristate busses and combinational logic blocks. In addition, we have added a simple pin permutation algorithm which minimizes the use of on-chip routing resources for hops. The Synthesizer lastly assigns any external connections to corresponding periphery FPGA pins. Some of these pins connect to external interface hardware for communication with a logic simulator or other control programs. Additional pins provide global clocks and sequencing signals. The remaining pins may be connected to external pods to support in-circuit emulation. The accumulated logic synthesized for each FPGA is then merged with the original design partition for that FPGA and a final virtualized netlist file is output in FPGA format (XNF for Xilinx). These files are input to the FPGA-specific placeand-route stage which creates the emulator bitstream. 3.2 Shift Register Architectures We now compare three shift register architectures synthesizable to Xilinx 4000 FPGAs. Full Shift Register The full shift register architecture was originally proposed as a proof-of-concept Virtual Wires implementation [7]. This architecture consists of identical input and output shift loops (Figure 17). In output mode, shift loops load emulated signal states at the beginning of each phase and shift these states out serially onto a routed physical connection at the microcycle rate. For connections requiring multiple hops, a one-bit shift register is placed in each intermediate FPGA (Figure 18), forming a shift register pipeline between source and destination. At the end of the pipeline, corresponding

11 Page 11 Phase Enable Logical Outputs Shift/Rotate Physical Input Load uclk si ld D so si D so si D so si D ld ld ld so Physical Output Logical Inputs Figure 17: Full Shift Register Architecture Phase Enable Unused Pad Unused Pad Physical Input D Physical Output Shift/Rotate Physical Input si so si so si so uclk uclk Shift/Hold Figure 18: Intermediate Hop Architecture Logical Inputs input mode shift loops de-multiplex and latch the emulated signals and drive them into the emulated logic. Note that the input shift loops must store their state so that all emulated logic inputs are available for subsequent evaluation. Output logic, however, can be reused for multiple groups of emulation signals in different phases. To support per-phase routing, each inter-fpga I/O pad is preceded by a multiplexer that selects the appropriate shift loop output during its active phase. Pads are bidirectional with the pad driver enable signal asserted during phases in which that pad is an output. To minimize associated pad logic, the Synthesizer groups inputs and outputs separately when possible. Gated Shift Register To reduce the Virtual Wires consumption of core FPGA resources, the Synthesizer can utilize architecture-specific FPGA features. In low-cost, low-pincount FPGA parts, many of the I/O pads are not connected to pins, and the Synthesizer can concatenate their registers to form Virtual Wires shift registers (Figure 19). Due to pad configuration constraints, these shift registers cannot be parallel-loaded, so they are not usable for output shiftgroups. However, the Synthesizer can place input shiftgroups and intermediate hop shift registers here. Since input shiftgroups must hold the emulated signal state after receiving it, and these I/O registers do not have clock enables, the Synthesizer generates and distributes a gated CLK. During the portions of the Virtual Wires cycle in which the emulated logic is being evaluated, this clock is frozen. In addition, the length of the input shiftgroups is adjusted to divide evenly into the number of Figure 19: Gated Shift Registers Using Pad Registers CLKs between evaluation periods so that the state in these registers can recirculate without change. In the 84-pin PLCC Xilinx 4005, this approach recovered 102 input and hop shift register bits. However, clock gating and the slower timing of the I/O pad registers reduced the achievable CLK rate. Addressable Shift Register A further variation is to replace the output shiftgroup shift registers with tristate multiplexers available in the Xilinx architecture (Figure 20). The Synthesizer creates an additional set of global control signals, labeled cycle enables, to enable each bit of the multiplexer during the appropriate micro-clock tick of each Virtual Wires phase. The Synthesizer also replaces the input shift registers with sets of individual register bits whose clock enables are controlled by the phase signal as before, but whose clocks are successive cycle enables. This architecture considerably reduces the cost of the Virtual Wires shift loops in terms of logic resources, but the additional control signals and the use of the tri-state multiplexers add routing overhead. This overhead is reduced somewhat by placing many of the additional signals on global clock nets. Also, strategic use of the I/O pad registers for pipelining recovers speed. Finally, this architecture can support the more flexible Virtual Wires scheduling methods described in [32].

12 Page 12 Phase Enable Logical Outputs Cycle Enables Output Shiftgroup Cycle Enables Physical Input Phase Enable Physical Output Input Shiftgroup Figure 20: Addressable Shift Loop Architecture D ena D ena D ena D ena Logical Inputs Resource Design Total Logic Logic (Virtual Overhead) Fig 17 Fig 19 Fig 20 Packed CLBs (Total) (32%) (33%) (17%) Lookup Tables (Combinational) (30%) (31%) (12%) Registers (Sequential) (48%) (27%) (36%) Xilinx PIPs (Routing) (42%) (48%) (37%) Average Resource Usage per FPGA CLK Speed 33MHz 24MHz 25MHz Emulation Speed 1.2MHz.89MHz.93MHz Maximum Clock Speed Comparisons Table 1: Architectural Comparison Table 1 compares each architecture in implementing the smallest benchmark circuit, Palindrome (see Section 5.1), on the 16-FPGA demonstration hardware presented in Section 5. Speed is measured in terms of the CLK speed. We calculated overhead as a percentage of consumed resources taken up by Virtual Wires. This Virtual Wire resource consumption is computed by subtracting the emulation logic resource consumption from total resource consumption. We measured emulation logic resource consumption by compiling un-virtualized partitions onto high pin count FPGAs. CLB refers to the basic Xilinx Combinational Logic Block, which includes both combinational lookup tables and sequential registers. We list the Programmable Interconnect Points (PIPs) as reported by the Xilinx router. Note that the reported numbers are for hardware emulation and do not include any additional speed and resource overheads that may be attributed to simulation acceleration. The full shift register implementation is relatively fast but has significant overhead. The gated shift register architecture using the I/O pad registers is somewhat slower due to the reduced speed of these registers. This version does use fewer of the core registers, but routing overhead is higher because of the greater wiring distances covered between the pad registers and the core logic. Finally, the addressable scheme, has generally lower logic and routing overhead while maintaining moderate speed. The results in the remainder of this paper are based on the basic full shift register scheme, although we believe the addressable scheme to be the best of the three schemes because it can support more sophisticated scheduling algorithms as described in [32]. 4 Demonstration Hardware System Our demonstration hardware building block is a scalable emulation board [36] which is inexpensive to manufacture and easy to build (Figure 21). One or more boards are interfaced to a host workstation. Each board contains sixteen Xilinx FPGAs [40] interconnected in a two-dimensional nearest-neighbor mesh. The board is six layers, uses only through-hole devices, and is ten inches square in size. System size is scaled by attaching additional boards on any of the four sides of the current system boards, without the need for crossbars or esoteric backplane technology. On board SRAM supports emulation of large design memories. At the present time we use a SparcStation 10 as the host interface although the emulation board may be reconfigured to interface to virtually any host computer. Any emulation board can communicate with this host workstation through either a serial or S-bus communication port. These interfaces are used to both observe in-circuit emulation status and to provide circuit inputs and outputs for simulation acceleration. The following sections further detail the important features of the demonstration system. 4.1 FPGA Array To emphasize the utility of Virtual Wires for inter-chip communication, we used no expensive crossbar chips and only low pin-count FPGAs (84-pin PLCCs). These FPGAs may be clocked at speeds approaching 40MHz thus resulting in small inter-chip delays and high emulation throughput. Figure 22 shows the board schematic. Each FPGA communicates with its four nearest neighbors (logically North, South, East, and West) through eight bidirectional I/O signals. To minimize multi-fpga routing resources, these I/O signals are distributed along the chip package in an alternating pattern (Figure 23). Thus I/O port signals are physically allocated so that adjacent I/O pins are assigned to signals from differing ports. With this permutation a signal passing through the chip need only be routed the length of several pins rather than across the body of the entire chip. Two

Page 13 Figure 21: Virtual Wires Emulation Board NSEWNSEW RS 232 Serial 68HC11 8 8 8 8 North Connector ( 3 2 ) Sbus Interface S N W ES N W E 84 Pin PLCC Package SNWESNWE N SE W NS E W Pin Group West

13 Page 13 Figure 21: Virtual Wires Emulation Board NSEWNSEW RS 232 Serial 68HC North Connector ( 3 2 ) Sbus Interface S N W ES N W E 84 Pin PLCC Package SNWESNWE N SE W NS E W Pin Group West ( 3 2 ) East ( 3 2 ) Figure 23: FPGA Pin Permutation South Connector ( 3 2 ) Figure 22: Emulation Board Schematic pin groups are located along each side of the package. This connection scheme is analyzed in detail in [18]. 4.2 Scaling to Multiple-Board Systems Each Virtual Wires prototype board can function as either a stand-alone system or as part of a larger group of boards. Multiple-board systems are constructed by connecting individual boards together to form a two-dimensional mesh (Figure 24). A clock driver chip and remote leads for clock cables allow one board to serve as a single global source for the other boards. CLK signals are fanned out to lo-

14 Page 14 Board 0 Board 1 words of data between the Sbus card [16] and the eight bit North port of the Xilinx chip in the upper right position of the array. 4.4 Application Board 2 Board 3 Figure 24: Multi-Board Emulation System cal logic at the destination boards using a clock distribution chip. Bi-directional FPGA I/O signals along the periphery of each board extend across connectors in each of the four directions. FPGA configuration information is transferred in a serial chain to all FPGAs in the system starting at the board connected to the download cable in the upper rightmost corner of the mesh. System size is currently constrained to a total of ten boards (160 chips) by the Xilinx-imposed limit on the configuration bitstream, although this limit may be overcome with multiple download cables. 4.3 Memory and Host Interfaces Each FPGA in the mesh has twenty-two dedicated I/O lines which interface to a 64K4 SRAM chip. These chips can be used to emulate sections of on-chip memory and are populated as needed. Virtual Wires software is used to multiplex address, data, and control signals for the SRAM so that a number of individual memory accesses to the same SRAM chip may take place during each emulation clock cycle. Twenty nanosecond SRAMs are used in the current prototype. SRAM interface signals have been allocated to dedicated FPGA pins to reduce capacitive loading on inter- FPGA signal lines and to simplify system software. A low bandwidth serial interface via an embedded microcontroller provides access to the array for data transfer, configuration, and FPGA state readback. Data signals from the controller interface directly to the North port of the FPGA in the upper left corner of the array. The embedded controller has the capability to download configuration information to the array thus eliminating the need for an additional download cable from Xilinx. To provide a higher bandwidth interface to the host, a seventeenth Xilinx chip serves as an intermediary between an Sbus interface card located in the host SparcStation and the Xilinx array. This chip is capable of transferring The prototype system allows the logical behavior of one circuit component to be emulated while the rest of the system is simulated. It contains a simulation interface to both the LSI Logic LSIM and Cadence Verilog simulators. At a given simulated clock edge, software drivers transfer data representing inputs to the host workstation which subsequently forwards this data to the emulation system via the serial or S-bus host interface. Once output results are generated, the drivers return them for display or further simulation. As an alternative to simulation acceleration, a target system may be interfaced directly to the emulator with a prototyping pod. This pod plugs into the chip socket in the target system. After FPGA configuration, the emulation system exchanges data with the target system at each emulation clock while performing internal evaluation at FPGA device speeds. In both modes of operation, simulation and emulation, the usability of the system is enhanced by our InnerView Hardware Debugger [17]. This tool consists of host software, embedded controller software, and FPGA circuitry which extracts the emulation state from all FPGAs and coordinates this state with the internal register names of the design under emulation. This tool takes advantage of the serial interface s capability to perform readback from FPGAs on a chip by chip basis. The controller can be programmed to trigger a readback bitstream from any FPGA in the system, and subsequently transfer the values back to the host workstation for evaluation. 5 Results We have successfully compiled designs up to 18K gates onto the demonstration system. In conjunction with the scheduling and synthesis algorithms described in this paper, we used the Synopsys Design Compiler [33] for translation and mapping, the InCA Concept Silicon partitioner [19] for partitioning, our simulated-annealing-based placer, and the standard Xilinx-provided tools for FPGA-specific place and route. Compile time is roughly 3 to 4 hours on a Sparc- Station 10, with 90 percent of this time consumed by the vendor specific FPGA compile. This compilation can thus be accelerated by processing independent FPGA compiles in parallel. The following sections describe our benchmarks and report simulation and emulation results.

Virtual Wires: Overcoming Pin Limitations in FPGA-based Logic Emulators

Page 1 Virtual Wires: Overcoming Pin Limitations in FPGA-based Logic Emulators Jonathan Babb, Russell Tessier, and Anant Agarwal MIT Laboratory for Computer Science Cambridge, MA 02139 Abstract Existing