Logic Emulation with Virtual Wires

Size: px
Start display at page:

Download "Logic Emulation with Virtual Wires"

Transcription

1 Page 1 Logic Emulation with Virtual Wires Jonathan Babb, Russell Tessier, Matthew Dahl, Silvina Hanono, David Hoki, and Anant Agarwal MIT Laboratory for Computer Science Cambridge, MA Abstract Logic emulation enables designers to functionally verify complex integrated circuits prior to chip fabrication. However, traditional FPGA-based logic emulators have poor inter-chip communication bandwidth, commonly limiting gate utilization to less than 20 percent. Global routing contention mandates the use of expensive crossbar and PC-board technology in a system of otherwise low-cost, commodity parts. Even with crossbar technology, current emulators only use a fraction of potential communication bandwidth because they dedicate each FPGA pin (physical wire) to a single emulated signal (logical wire). Virtual Wires overcome pin limitations by intelligently multiplexing each physical wire among multiple logical wires and pipelining these connections at the maximum clocking frequency of the FPGA. The resulting increase in bandwidth allows effective use of low dimension, direct interconnect. The size of the FPGA array can be decreased as well, resulting in low cost logic emulation. This paper covers major contributions of the MIT Virtual Wires project. In the context of a complete emulation system, we analyze phase-based static scheduling and routing algorithms, present Virtual Wires synthesis methodologies, and overview an operational prototype with 20Kgate boards. Results, including in-circuit emulation of a SPARC microprocessor, indicate that Virtual Wires eliminate the need for expensive crossbar technology while increasing FPGA utilization beyond 45 percent. Theoretical analysis predicts that Virtual Wires emulation scales with FPGA size and average routing distance, while traditional emulation does not. 1 Introduction Field Programmable Gate Array (FPGA) based logic emulators are capable of emulating complex logic designs at clock speeds four to six orders of magnitude faster than software simulators. This performance is achieved by partitioning a logic design, described by a netlist, across an interconnected Minute Hour Day Week Month Compilation Time Logic Simulation Accelerated Simulation Year Month Week Day Hour Logic Emulation Figure 1: Verification Alternatives Final Silicon Execution Time Minute array of FPGAs. The netlist partition on each FPGA, configured directly into logic circuitry, is then executed at near hardware speeds. Figure 1 compares logic emulation to other prototyping methods, including simulation and accelerated simulation, as well as to final silicon. The y-axis measures relative time for compiling or constructing a hypothetical design, while the x-axis measures relative time for executing one set of test vectors on this design. As an example, consider final silicon which takes months to construct and runs a set of vectors in less than one minute. The same design and vector set could be compiled for a logic simulator on the order of minutes, but would take years to execute. Logic emulation fills a wide gap between simulation and actual silicon. With both a moderately fast compile time and a fast execution time, emulation offers a compromise between the programmability of software and the fast execution speed of hardware. Logic emulators are further characterized by interconnection topology, target FPGA, and supporting software. The interconnection topology describes the arrangement of FPGA devices and routing resources. Example interconnects include full crossbars and two-dimensional meshes. Impor-

2 Page 2 Not Limited unused FPGA pins unused FPGA gates Pin Limited no unused pins some unused gates Gate Limited some unused pins no unused gates Balanced no unused pins no unused gates Figure 2: Partition Limitation Scenarios tant target FPGA properties include gate count, pin count, and mapping efficiency. Supporting software is extensive, combining netlist translators, logic optimizers, technology mappers, global and FPGA-specific partitioners, placers, and routers. Traditional emulators are gate inefficient due to inherent pin limitations in the FPGA devices. To reduce pin limitations, these emulators supplement FPGAs with custom crossbars chips and expensive PC-board and backplane technology, further increasing the per-gate cost of emulation. This paper suggests an alternative solution to pin limitations based on multiplexing of FPGA resources. 1.1 Virtual Wires In existing emulator architectures, both the logic configuration and the network connectivity remain fixed for the duration of the emulation. Every emulated partition of the input design, one per FPGA, consists of a set of gates and a set of signals communicating to other partitions. Each emulated gate is mapped to one or more FPGA equivalent gates and each inter-partition emulated signal is allocated to a pair of pins between two FPGAs. Thus for a partition to be feasible, the partition gate and pin requirements must be no greater that the available FPGA resources. These constraints yield four possible scenarios (Figure 2). When typical circuits are mapped onto available FPGA devices, partitions are predominately pin limited. That is, all available FPGA gates cannot be utilized due to lack of pin resources to support them. We demonstrate this resulting bandwidth gap with a set of partitionings of the Sparcle and CMMU benchmarks (see Section 5.1) for various gate counts. Figure 3 shows the resulting curves, plotted on a log-log scale. Partition gate count is scaled by a factor of two to get FPGA equivalent gates with an assumed mapping efficiency of 50%. On the same curve we plot the pin and gate capacity of target FPGAs: the Xilinx 3000 and 4000 series [40], the Altera Flex 8000 series [3], and the Atmel 6000 series [5]. For equal average gate counts in the benchmark partitions and FPGA devices, the required average pin counts for partitions are much greater than the available pin capacity of the FPGAs. FPGA / Partiton Pin Count Alewife Cache Controller partitions Sparcle partitions Xilinx 3000 & 4000 FPGAs Xilinx 4000H FPGAs Atmel FPGAs Altera FPGAs Altera MCM Bandwidth Gap FPGA / Partition Gate Count Figure 3: Pin Count as a Function of FPGA Partition Size Pin limits set a hard upper bound on the maximum usable gate count any FPGA gate count can provide. Low utilization of gate resources increases both the number of FPGAs needed for emulation and the time required to emulate a particular design. This discrepancy will only get worse as technology scales; current trends indicate that available gate counts are increasing faster than available pin counts. Future breakthroughs in area I/O [27] may partially address this problem for FPGA packaging, but will leave open the more difficult issues of inter-board and system-level communication. Additionally, any new technology will be challenged to keep up as minimum feature size decreases faster than required bonding area. Virtual Wires eliminate the pin limitation problem of previous emulators by intelligently multiplexing each physical wire among multiple logical wires and pipelining these connections at the maximum clocking frequency of the FPGA. 1 A Virtual Wire represents a simple connection between a logical output on one FPGA and a logical input on another FPGA. Established via a pipelined, statically-routed communication network, these Virtual Wires increase available off-chip communication bandwidth by multiplexing the use of FPGA pin resources (physical wires) among multiple emulation signals (logical wires). Without Virtual Wires, one to one allocation of logical wires to physical wires does not exploit available pin bandwidth because: emulation clock frequencies are one or two orders of magnitude lower than the potential FPGA frequency; all logical wires are not active simultaneously. 1 Although this paper focuses on logic emulation, Virtual Wires can be applied to any multi-chip system.

3 Page 3 FPGA #1 FPGA #2 FPGA #1 FPGA #2 Logical Outputs Logical Outputs Logical Inputs Logical Outputs Logical Inputs Shift Loops Mux Logical Inputs Physical Wire Physical Wire Figure 4: Hard Wire Interconnect Figure 5: Virtual Wire Interconnect However, by clocking physical wires at the maximum frequency of the FPGA technology, several logical connections can share the same physical resource. Figure 4 shows an example of six logical wires allocated to six physical wires. Figure 5 shows the same example with the six logical wires sharing a single physical wire. The physical wire is multiplexed between two pipelined shift loops (Section 3). Each register in the pipeline carries a single bit of information from one logical output to the corresponding logical input in the neighboring FPGA. Systems based on Virtual Wires exploit several properties of digital circuits to boost bandwidth from available pins. In a logic design, evaluation flows from system inputs to system outputs. In a synchronous design with no combinational loops, this flow can be represented as a directed acyclic graph. Thus, through analysis of the underlying logic circuit, logical values between circuit partitions only need to be transmitted once. Furthermore, since circuit communication is inherently static, communication patterns will repeat in a predictable fashion. By exploiting this predictability, communications can be scheduled to increase pin utilization. 1.2 Emulation Software Software for logic emulation with Virtual Wires roughly follows the standard emulation tool flow (Figure 6). The input, a netlist of the logic design to be emulated, is transformed into a multi-fpga configuration bitstream to be downloaded onto the emulator. Not shown are the technology libraries, target FPGA characteristics, and FPGA interconnect topology needed to make the correct transformations. We next describe the standard steps. Translator: The input netlist to be emulated is typically generated with a hardware description language or a schematic capture program. The netlist must be syntactically translated into a format readable by the emulation software. Commercial and public domain tools are available for generic source-to-source translation. At MIT we used both Verilog and LSI logic formats. Tech Mapper: The translated netlist is still specified in terms of the source technology library for example LSI Logic s LCA100K technology [26]. Before emulation, the netlist must be mapped to a target library of FPGA primitives. Although commercial and public domain tools are also available for mapping, our simple and fast technique is to create a mapping library which describes each source primitive in terms of primitives in the target library. The inefficiency of this mapping can be largely recovered with a following logic optimization pass. Partitioner: After mapping the netlist to the target technology, the netlist is divided into partitions, each of which can fit into a single target FPGA. Without Virtual Wires, each partition must have both fewer gates and fewer pins than the target device. With Virtual Wires, the total gate count, including the overhead of Virtual Wires multiplexing logic, must be no greater than the target FPGA gate count. In the MIT implementation, we used the InCA Concept Silicon partitioner [19]. This partitioner performs K-way partitioning with min-cut and clustering techniques. Logic Netlist Translator Tech Mapper Partitioner Global Placer Global Router FPGA APR FPGA Configuration Bitstream VW Scheduler VW Synthesizer Figure 6: Emulation Software Flow

4 Page 4 FPGA Host Workstation Emulation System Target System Figure 7: Virtual Wires Emulation System Global Placer: Individual circuit partitions must be placed into specific FPGAs. An ideal placement minimizes system communication, thus requiring less routing resources. We wrote a simple placer based on simulated annealing [21] to minimize total Manhattan wire length. Global Router: In traditional emulation, inter-fpga communication is established with a global routing phase. If crossbars are employed, this phase must also determine the routing configuration for each crossbar as well as pinassignments of partition I/Os to FPGA pins. For Virtual Wires emulation, there are no direct physical connections between partitions, and this phase is completely replaced with new virtualization software to be described in this paper. FPGA APR: Once routing is complete, there is one netlist for each FPGA. Each netlist must be processed with FPGA specific automated place-and-route (APR) software to produce configuration bitstreams. We used the XACT [40] software for Xilinx FPGAs. With Virtual Wires, we replace the global router of traditional software with modules created to specifically support automatic pin multiplexing: the Virtual Wires Scheduler and the Virtual Wires Synthesizer (Figure 6). Together we refer to the transformation performed by these two components as virtualization. Although each emulation step is an intriguing aspect of CAD research, this paper focuses on these novel virtualization components, described below. Virtual Wires Scheduler: The resulting set of netlist partitions mapped to each FPGA, in conjunction with the routing resource constraints of the emulation system, is used to determine an appropriate schedule of logical wires onto physical wires. This schedule establishes a feasible timespace route for every logical wire, while guaranteeing that all multi-fpga combinational paths are correctly ordered. Schedule optimizations include minimizing the total time needed to execute the circuit, as well as minimizing the Virtual Wires logic overhead. While this scheduling problem is similar to those encountered in high level synthesis, it is complicated by inter-fpga routing constraints and the need to account for multiplexing overheads. In Section 2, we describe the phase-based scheduling algorithm implemented at MIT. Virtual Wires Synthesizer: This step implements the chosen routing schedule by synthesizing special multiplexers and registers that are added to the circuit partition in each FPGA. This logic is effectively a pipelined, statically routed network in the FPGA technology itself. For maximum efficiency, the synthesizer takes into account the underlying idiosyncrasies of the target FPGA technology. For example, FPGA pin assignment and allocation of internal tri-state buses are carefully optimized. The resulting synthesized architectures provide insight into Virtual Wires implementation. Section 3 compares three different architectures for the Xilinx 4000 series. 1.3 Low Cost Emulation System Although virtualization can be used to map input designs to any FPGA-based logic emulator, the process is most valuable when enabling the use of inexpensive, direct interconnect and cheap, low pin count FPGAs. To demonstrate this advantage, we have constructed FPGA boards composed of sixteen mesh-connected FPGAs and commodity SRAMs. These boards may themselves be mesh-connected, leading to straightforward software mapping and simplified system scalability. This system (Figure 7), described in Section 4 has demonstrated the following functionality: In-circuit emulation: FPGA array mimics one or more components of the target system and is pod-connected to the chip sockets of those missing components. Simulation acceleration: FPGA array replaces a piece of a simulation model and connects to the software simulation environment by remote calls through the host interface. Hardware subroutines: FPGA array implements a Verilog version of a subroutine in a C program and connects to the software by remote calls through the host interface [9].

5 Page 5 Section 5 describes our results for both in-circuit emulation and simulation acceleration of the Sparcle benchmark on our system, including booting a multiprocessor operating system. We leave the exploration of hardware subroutines to future reports. 1.4 Scalable Technology Not only can Virtual Wires be used to compose low-cost systems of gigantic numbers of FPGAs, but this technology also scales as FPGA sizes increase. To demonstrate this scalability, Section 6 uses Rent s Rule to derive theoretical models of emulation gate overheads for systems with and without Virtual Wires. This model accounts for the mismatch between circuit communication and FPGA communication in the hard-wired case and includes a topological factor that explains why a mesh topology does not scale without Virtual Wires. With this model we show how the derived Virtual Wire utilization scales with increasing FPGA device size and average routing distance, while hard-wired utilization may not. 1.5 Overview The rest of this paper is organized as follows: Section 2 describes the Virtual Wires scheduling and routing algorithms. Section 3 then compares three Virtual Wires synthesis architectures. After Section 4 describes our demonstration hardware system, Section 5 then present results for both simulation acceleration and in-circuit emulation on this system. Section 6 analyzes the overhead and scalability of Virtual Wires versus hard-wires. Finally, Section 7 describes related work in the field and Section 8 makes concluding remarks. 2 Scheduling Algorithms Virtualization replaces the inter-fpga routing steps of traditional emulation with software that synthesizes a routing network into the netlist partition on each FPGA. This network establishes global routes via statically scheduled bits rather than hard-wired interconnections. The first phase of this approach is a scheduling and routing algorithm. Our phase-based methodology suffices to prove the concept of Virtual Wires scheduling and is within a factor of two of more optimal algorithms presented in recent literature [32]. Before describing the scheduling algorithms, let us first introduce the basic operating principles of Virtual Wires. 2.1 Phase-Based Operating Principles Emulation Clock phase 1 phase 2 phase 3 phase 4 Evaluation microcycle Communication Figure 8: Clocking Framework uclk uenable this period into a number of microcycles determined by a free-running CLK (Figure 8). In this scheme, a microcycle is the shortest distinguishable unit of time. All routing is scheduled in discrete microcycle increments. These microcycles are grouped into sequential phases to support combinational paths that extend across multiple chips. The advantage of this approach is a decoupling of logic execution speed from inter-chip communication speed, allowing highspeed communication cycles to co-exist with a long-latency emulation clock period. A Enable signal divides each phase into an evaluation time span and a communication time span. Within a phase, a given number of microcycles are dedicated to the evaluation of the FPGA logic, followed by a set of cycles to communicate the results to other partitions in destination FPGAs. Evaluation takes place at the beginning of a phase, with logical inputs being propagated through each circuit partition to determine logical outputs for that phase. Not all inputs are available at the beginning of each phase, and not all outputs are produced. For inputs which are available, all logic is evaluated and subsequent outputs are produced. Each input and output transmission will be assigned to a single phase such that signal precedences are observed. At the end of the phase, the produced outputs are communicated to other circuit partitions at the microcycle clock rate. All necessary phases must be executed by the end of the emulation clock period. For simplicity, we limited our approach to synchronous logic with a single global emulation clock. Any asynchronous signals cannot be statically-routed and therefore must be hard-wired to dedicated FPGA pins. Virtual Wires can be extended to multiple clocks [32] and gated clocks, as well as certain types of asynchronous logic, such as multiple asynchronous clock domains. The emulation clock period is the clock period of the logic design being emulated. To facilitate multiplexing we break

6 Page 6 In1 In2 AND OR Out1 External Inputs W3 Partition1 W1 W2 W5 Partition2 W4 Partition3 W6 External Outputs In3 AND Out2 Interpartition Wire Clk D REG CLK Out3 Dependence Relation Figure 10: Depth Calculation Example Figure 9: Dependence Calculation Example 2.2 Definition of Dependence and Depth Two timing analysis computations, dependence and depth, aid in Virtual Wires scheduling. Both dependence and depth apply to inter-partition wires. To analyze input to output Dependence, we scan the logic in each partition to determine the set of outputs to which a combinational path exists from each input. An output is said to be a dependent (or a child) of an input if a change in that input can combinationally change the output. The dependence relationships between inputs and outputs for a given partition are derived recursively from those of its constituent logic elements. In determining dependence, we assume that all outputs of a combinational library primitive are dependents of all the inputs of that primitive. Similarly, no outputs are dependents of any of the inputs for sequential primitives. Let Depend[i] denote the set of outputs of a given partition that are dependents of an input of the same partition connected to an inter-partition wire i. Similarly, let D ;1 [i] represent the set of inputs that are ancestors to an output driving an inter-partition wire i. By our definition, inputs to storage elements and external outputs will have no dependents: Depend[i] =, and outputs of storage elements as well as external inputs will have no ancestors: D ;1 [i] =. Figure 9 shows an example circuit partition containing four interconnected primitive logic elements with three inputs (not including the clock) and three outputs. The dependence relationships for this partition are as follows: Depend[In1] =fout1g, Depend[In2] =fout1out2g, Depend[In3] =fout1out2g. Likewise, the ancestors are as follows: D ;1 [Out1] =fin1in2in3g, D ;1 [Out2] =fin2in3g, D ;1 [Out3] = (REG is a storage element). The depth calculations use the dependence relationships. The depth of inter-partition wire i is the largest number of partitions in a forward combinational path starting at that wire. Depth is computed recursively from the wire dependence sets such that for each wire i: ( 0 ifdepend[i] = Depth[i] = 1 + max Depth[j] j2depend[i] otherwise (1) Figure 10 shows an example design with three partitions and six inter-partition wires. The dashed-lines denote inputoutput dependence relationships. In this example, wires are at the following depths: depth 0: W4, W6 depth 1: W2,W3,W5 depth 2: W1 Our phase assignment algorithm uses depth to prioritize routing of critical paths. During scheduling, although both W1 and W2 have no ancestors, W1 has a greater depth and will be given priority. 2.3 Phase Assignment Algorithm The goal of the phase assignment algorithm is to determine an appropriate schedule of logic wires between design partitions onto physical wires between FPGAs. The resulting schedule must establish a feasible time-space route for every logical wire, while observing FPGA routing resources constraints and guaranteeing that all multi-fpga combinational paths are correctly ordered. The core scheduling algorithm consists of a shortest path router inside a greedy phase assignment loop. Within the

7 Page 7 Given: I : set of inter-partition design wires ComCycles : communication cycles per phase c : total micro-cycles per phase Produce: S : output schedule which assigns each wire i 2 I to a shiftgroup in a particular phase Procedure PhaseAssign call Depend CalcDependents(I) call Depth CalcDepth(Depend) initialize Done array to false for each wire i 2 I for each dependent wire j 2 Depend[i] DependCount[j] DependCount[j] + 1 n 0 /*phase counter*/ loop forever call RouteInit W wires with Done[i]=false and DependCount[i]=0 if W is empty exit loop n n+1 sort W by Depth[i], greatest depth first for each wire i 2 W src FPGA partition where source of i is placed dest FPGA partition where dest of i is placed path Route(src,dest) if path exists then distance = length(path) - 1 maxsignals ComCycles - distance L i, and up to maxsignals additional wires from W that have the same src and dest as i for each j 2 L delete j from W Done[j] true for each k 2 Depend[j] DependCount[k] DependCount[k] - 1 shiftgroup fn, path, L g add shiftgroup to schedule S endif endloop save n, c in schedule S /*final phase and cycle count*/ end Procedure main loop of the phase assignment algorithm, Figure 11, as many wires as possible are scheduled and routed. Once no more wires are available to schedule, the algorithm advances to the next phase. All unscheduled wires are thus pushed to the following phase when either their ancestors have not been scheduled, or there is no remaining routing path available in that phase. The phases are processed sequentially and no attempt is made to go back and optimize previously scheduled phases. Given enough phases and at least one potential path between all pairs of FPGAs, any design can be scheduled. This is easier than hard-wired routing problems, in which various rip-up and re-try strategies may be needed to find a feasible route. The algorithm starts by first calculating the dependence and depth arrays for all wires as described in the previous section. An additional array, Done[i], is set to false to mark each wire i as unscheduled. The algorithm then initializes a DependCount[i] array from the dependence information of each wire. When this counter reaches zero, as the algorithm progresses, wire i will be ready to schedule. The algorithm proceeds by assigning wires to phases until all wires have been scheduled. Advancement to following phases occurs when no wires can be scheduled in the current phase. Within each phase, ready wires with the greatest depth are iterated first, guaranteeing that critical paths are given priority. The routing algorithm is successively called to route as many ready wires as possible. Once a successful route is established from a source to destination FPGA, as many as ComCycles ; distance additional ready wires between the same source and destination are formed into a shiftgroup, where ComCyclesis the number of communication cycles in a phase and distance is the number of FPGA crossings in the routing path. For example, if their are eight cycles per phase and the distance is three, a total of five wires can be routed in the same shiftgroup. The additional wires are also prioritized by depth. For each routed wire j in the shiftgroup, Done[j] is set, and the set of wires k 2 Depend[j] are iterated to decrement DependCount[k]. IfDependCount[k] =0, wire k can be scheduled in a following phase. Any ready signals that are not successfully routed in a phase are automatically delayed to following phases. As long as the delayed signals are not on the critical path, the total number of phases will not be affected. The ComCycles parameter specifies the number of microcyles to spend in communication during each phase. This number must be greater than the routing diameter of the topology to guarantee that all signals can route. For the results in Section 5, it turns out that eight communication cycles match the eight-way tri-state bussing of the Xilinx architecture. Figure 11: Phase Assignment Algorithm

8 Page Route Algorithm Given: T : emulator topology src : source FPGA in T dest : destination FPGA in T Produce: path : list of FPGAs along shortest route from src to dest Procedure RouteInit for each FPGA src 2 T for each FPGA dest 2 T Reserve[src,dest] connections in T from src to dest Avail[src,dest] (Reserve[src,dest] 6= 0) end Procedure Procedure Route(src, dest) path ShortestPath(src,dest,Avail) /* Dijkstra s algorithm */ if path exists then for each FPGA f 2 path if f = src then /* first FPGA in path */ src1 f else /* following FPGAs in path */ dest1 f Reserve[src1,dest1] Reserve[src1,dest1]-1 Avail[src1,dest1] (Reserve[src1,dest1] 6= 0) src1 f endif return path else return null path endif end Procedure Figure 12: Routing Algorithm The goal of the routing algorithm (Figure 12) is to find a shortest available path, in terms of FPGAs, between the source and destination FPGA of a set of inter-partition wires. The algorithm keeps track of the reserved and available physical connections between FPGAs in the emulator topology and is repeatedly called from the inner loop of the phase assignment algorithm. Route uses shortest path analysis with a cost function based on channel availability. Shortest path routing minimizes both the number of microcycles needed per phase and intermediate hop logic overhead. Before the beginning of each phase, a reservation matrix, Reserve[i j], is initialized to the number of physical connections between FPGAs i and j in the emulator topology. Route applies Dijkstra s shortest path algorithm [34] to channel availability, Avail[i j] =(Reserve[i j] 6= 0), to determine the shortest path between the source and destination FPGAs. If the shortest path exists, then the reservation matrix is updated by subtracting one from each element along that path and route returns with this path; else, route returns unsuccessfully. After each successful route, PhaseAssign forms a new shiftgroup data structure (Figure 13). This data structure includes the phase number, FPGA path, and set of logical wires in that group. This information is written to the schedule file, to be passed to the synthesis phase of virtualization. 2.5 Execution Speed Analysis Before proceeding, let us compute the execution speed of Virtual Wires emulation. Based on our phased operating principles, the emulation clock cycle time will be determined by the total number of microcycles needed: v = n c (2) where n is the number of phases and c the number of cycles per phase as previously defined. If c is the same across all phases, then we can immediately recognize that: n L c D (3) Shiftgroup f phase number source FPGA, 2nd FPGA, 3rd FPGA,..., dest. FPGA logical wire 1, logical wire 2,..., logical wire N g Figure 13: Shiftgroup Data Structure where D is the maximum distance of any shiftgroup route (in the worst-case D is the network diameter of the FPGA topology), and L is the length of the critical path in the design netlist, equivalent to the maximum depth. That is, there must be enough cycles in each phase to route a signal across the diameter of the network as well as enough total phases to schedule the longest combinational path between circuit partitions. Additionally, we recognize that the total number of microcycles is also constrained by the maximum multiplexing performed at each FPGA: v P C =P f (4)

9 Page 9 where P C is the maximum circuit communication requirement, including partition pins and any additional pins required for through hops, and P f is the pin count of each FPGA 2. Combining these two observations and assuming that the number of microcycles per phase is constant across all phases, we get the following best-case speed result: Best-case microcycles: The cumulative microcycle count for all phases within a scheduled emulation clock period is bounded below by the following equation: Design Partition Design Partition Input Registers Output Registers VW FSM Hops v max( latency bound z } { L D bandwidth bound z } { P C =P f ): (5) where L is critical path length, D is network diameter, P C is the maximum circuit partition pin count including through hops, and P f is the FPGA pin count. In our practical experience, design emulation speed is determined predominately by the latency bound. 2.6 Improvements We proposed the preceding algorithms to demonstrate the feasibility of Virtual Wires and for ease of implementation of the synthesis structures described in the following section. These algorithms can be improved by scheduling at the granularity of a single microcycle and eliminating the phase barriers altogether. The advantages of such improvements [32] include: possible initiation of computation and subsequent routing as early as one microcycle after a signal arrives at a destination rather than waiting for the following phase, potential overlapping of computation with communication in different parts of the system rather than execution in exclusive time spans, support of different propagation delays for individual wires rather than observing a worst-case delay for all computation in a phase, flexible scheduling of wires at the microcycle granularity rather than scheduling of dedicating pipeline paths per phase. This scheduling also eliminates costly pipeline filling overhead at the beginning and end of each phase. We continue by describing the synthesis architectures designed at MIT. These architectures implement the virtualized routing network produced by the phase assignment and routing procedures. 2 Note that we have ignored pipeline startup overhead associated with each shiftgroup. FPGA Figure 14: Virtualized FPGA Composition 3 Synthesis Architectures Although it would be possible to design an FPGA with multiplexed pins, we implemented Virtual Wires without custom hardware support. That is, the virtualization process synthesizes the required multiplexing components directly into the FPGA netlists, to be downloaded with the original design partitions. Thus, any existing FPGA-based logic emulation system can take advantage of Virtual Wires. After discussing the synthesis algorithms, this section proceeds to describe three of many possible synthesis architectures based on shift registers in Xilinx FPGAs. 3.1 Synthesis Algorithm The Virtual Wires Synthesizer flowchart component in Figure 6, takes the following input: emulator topology, external design I/O pin assignment, routing schedule, design partition netlists, and produces: one virtualized netlist for each FPGA. As shown in Figure 14, these new netlists contain the original design partition along with all necessary Virtual Wires communication logic. Synthesized logic includes input and output shift registers, through hops, and the VW-FSM finite state machine control logic. After reading in the appropriate inputs from previous compilation stages, the Synthesizer algorithm ( Figure 16) proceeds to synthesize the VW-FSM control logic for each FPGA. For the most part, this logic is identical for each

10 Page 10 u C L K u E n a b l e V W F S M C o n t r o l S i g n a l s Figure 15: VW-FSM Control Logic Given: T : emulator topology E : external I/O pin assignment S : routing schedule D : set of design partition netlists Produce: X : set of virtualized FPGA netlists Procedure Synthesize N number of phases in S C cycles per phase in S for each FPGA f 2 T given (N,C) synthesize control logic vwfsm[f] virtuallogic[f] vwfsm[f] for each phase P 2 S for each shiftgroup G 2 P R inter-fpga routing path for G L number of logical wires 2 shiftgroup G for each FPGA f 2 R: if f is first FPGA in R then logic synthesize output shifter of length L assign each logical output in G to logic assign physical FPGA output pin to logic else if f is an intermediate FPGA in R then logic synthesize intermediate hop assign physical FPGA I/O pins to logic else /* if f is last FPGA in R */ logic synthesize input shifter of length L assign each logical input in G to logic assign physical FPGA input pin to logic endif assign control nets for vwfsm[f] to logic virtuallogic[f] virtuallogic[f] + logic for each external I/O e 2 E assign e to its specified FPGA physical pin for each FPGA f 2 T partition[f] design partition in D placed on f X[f] virtuallogic[f] + partition[f] end Procedure Figure 16: Synthesis Algorithm FPGA, determined solely by the number of phases and microcycles per phases in the schedule. The VW-FSM (Figure 15) logic takes as input the CLK and the Enable signals, distributed to each FPGA, and generates the appropriate control signals during each microcycle. As described in Section 2, the CLK is the free running pipeline clock, while the Enable clock is synchronized to the emulation clock and determines when to start the communication sequence for each emulation phase. The output control signals are responsible for strobing logical wires into the appropriate registers and controlling multiplexer selection. The algorithm iterates through the shiftgroups in each phase to construct the input, intermediate hop, and output architectures. Each shiftgroup data structure contains the logical wires assigned to that group as well as the group s phase and FPGA path. As the architectures are synthesized, they are connected between the partition logical wires and FPGA physical wires as well as to the appropriate control signals. Not shown in the algorithm, the Synthesizer also makes low level implementation decisions at this time to optimize the use of limited FPGA resources, including tristate busses and combinational logic blocks. In addition, we have added a simple pin permutation algorithm which minimizes the use of on-chip routing resources for hops. The Synthesizer lastly assigns any external connections to corresponding periphery FPGA pins. Some of these pins connect to external interface hardware for communication with a logic simulator or other control programs. Additional pins provide global clocks and sequencing signals. The remaining pins may be connected to external pods to support in-circuit emulation. The accumulated logic synthesized for each FPGA is then merged with the original design partition for that FPGA and a final virtualized netlist file is output in FPGA format (XNF for Xilinx). These files are input to the FPGA-specific placeand-route stage which creates the emulator bitstream. 3.2 Shift Register Architectures We now compare three shift register architectures synthesizable to Xilinx 4000 FPGAs. Full Shift Register The full shift register architecture was originally proposed as a proof-of-concept Virtual Wires implementation [7]. This architecture consists of identical input and output shift loops (Figure 17). In output mode, shift loops load emulated signal states at the beginning of each phase and shift these states out serially onto a routed physical connection at the microcycle rate. For connections requiring multiple hops, a one-bit shift register is placed in each intermediate FPGA (Figure 18), forming a shift register pipeline between source and destination. At the end of the pipeline, corresponding

11 Page 11 Phase Enable Logical Outputs Shift/Rotate Physical Input Load uclk si ld D so si D so si D so si D ld ld ld so Physical Output Logical Inputs Figure 17: Full Shift Register Architecture Phase Enable Unused Pad Unused Pad Physical Input D Physical Output Shift/Rotate Physical Input si so si so si so uclk uclk Shift/Hold Figure 18: Intermediate Hop Architecture Logical Inputs input mode shift loops de-multiplex and latch the emulated signals and drive them into the emulated logic. Note that the input shift loops must store their state so that all emulated logic inputs are available for subsequent evaluation. Output logic, however, can be reused for multiple groups of emulation signals in different phases. To support per-phase routing, each inter-fpga I/O pad is preceded by a multiplexer that selects the appropriate shift loop output during its active phase. Pads are bidirectional with the pad driver enable signal asserted during phases in which that pad is an output. To minimize associated pad logic, the Synthesizer groups inputs and outputs separately when possible. Gated Shift Register To reduce the Virtual Wires consumption of core FPGA resources, the Synthesizer can utilize architecture-specific FPGA features. In low-cost, low-pincount FPGA parts, many of the I/O pads are not connected to pins, and the Synthesizer can concatenate their registers to form Virtual Wires shift registers (Figure 19). Due to pad configuration constraints, these shift registers cannot be parallel-loaded, so they are not usable for output shiftgroups. However, the Synthesizer can place input shiftgroups and intermediate hop shift registers here. Since input shiftgroups must hold the emulated signal state after receiving it, and these I/O registers do not have clock enables, the Synthesizer generates and distributes a gated CLK. During the portions of the Virtual Wires cycle in which the emulated logic is being evaluated, this clock is frozen. In addition, the length of the input shiftgroups is adjusted to divide evenly into the number of Figure 19: Gated Shift Registers Using Pad Registers CLKs between evaluation periods so that the state in these registers can recirculate without change. In the 84-pin PLCC Xilinx 4005, this approach recovered 102 input and hop shift register bits. However, clock gating and the slower timing of the I/O pad registers reduced the achievable CLK rate. Addressable Shift Register A further variation is to replace the output shiftgroup shift registers with tristate multiplexers available in the Xilinx architecture (Figure 20). The Synthesizer creates an additional set of global control signals, labeled cycle enables, to enable each bit of the multiplexer during the appropriate micro-clock tick of each Virtual Wires phase. The Synthesizer also replaces the input shift registers with sets of individual register bits whose clock enables are controlled by the phase signal as before, but whose clocks are successive cycle enables. This architecture considerably reduces the cost of the Virtual Wires shift loops in terms of logic resources, but the additional control signals and the use of the tri-state multiplexers add routing overhead. This overhead is reduced somewhat by placing many of the additional signals on global clock nets. Also, strategic use of the I/O pad registers for pipelining recovers speed. Finally, this architecture can support the more flexible Virtual Wires scheduling methods described in [32].

12 Page 12 Phase Enable Logical Outputs Cycle Enables Output Shiftgroup Cycle Enables Physical Input Phase Enable Physical Output Input Shiftgroup Figure 20: Addressable Shift Loop Architecture D ena D ena D ena D ena Logical Inputs Resource Design Total Logic Logic (Virtual Overhead) Fig 17 Fig 19 Fig 20 Packed CLBs (Total) (32%) (33%) (17%) Lookup Tables (Combinational) (30%) (31%) (12%) Registers (Sequential) (48%) (27%) (36%) Xilinx PIPs (Routing) (42%) (48%) (37%) Average Resource Usage per FPGA CLK Speed 33MHz 24MHz 25MHz Emulation Speed 1.2MHz.89MHz.93MHz Maximum Clock Speed Comparisons Table 1: Architectural Comparison Table 1 compares each architecture in implementing the smallest benchmark circuit, Palindrome (see Section 5.1), on the 16-FPGA demonstration hardware presented in Section 5. Speed is measured in terms of the CLK speed. We calculated overhead as a percentage of consumed resources taken up by Virtual Wires. This Virtual Wire resource consumption is computed by subtracting the emulation logic resource consumption from total resource consumption. We measured emulation logic resource consumption by compiling un-virtualized partitions onto high pin count FPGAs. CLB refers to the basic Xilinx Combinational Logic Block, which includes both combinational lookup tables and sequential registers. We list the Programmable Interconnect Points (PIPs) as reported by the Xilinx router. Note that the reported numbers are for hardware emulation and do not include any additional speed and resource overheads that may be attributed to simulation acceleration. The full shift register implementation is relatively fast but has significant overhead. The gated shift register architecture using the I/O pad registers is somewhat slower due to the reduced speed of these registers. This version does use fewer of the core registers, but routing overhead is higher because of the greater wiring distances covered between the pad registers and the core logic. Finally, the addressable scheme, has generally lower logic and routing overhead while maintaining moderate speed. The results in the remainder of this paper are based on the basic full shift register scheme, although we believe the addressable scheme to be the best of the three schemes because it can support more sophisticated scheduling algorithms as described in [32]. 4 Demonstration Hardware System Our demonstration hardware building block is a scalable emulation board [36] which is inexpensive to manufacture and easy to build (Figure 21). One or more boards are interfaced to a host workstation. Each board contains sixteen Xilinx FPGAs [40] interconnected in a two-dimensional nearest-neighbor mesh. The board is six layers, uses only through-hole devices, and is ten inches square in size. System size is scaled by attaching additional boards on any of the four sides of the current system boards, without the need for crossbars or esoteric backplane technology. On board SRAM supports emulation of large design memories. At the present time we use a SparcStation 10 as the host interface although the emulation board may be reconfigured to interface to virtually any host computer. Any emulation board can communicate with this host workstation through either a serial or S-bus communication port. These interfaces are used to both observe in-circuit emulation status and to provide circuit inputs and outputs for simulation acceleration. The following sections further detail the important features of the demonstration system. 4.1 FPGA Array To emphasize the utility of Virtual Wires for inter-chip communication, we used no expensive crossbar chips and only low pin-count FPGAs (84-pin PLCCs). These FPGAs may be clocked at speeds approaching 40MHz thus resulting in small inter-chip delays and high emulation throughput. Figure 22 shows the board schematic. Each FPGA communicates with its four nearest neighbors (logically North, South, East, and West) through eight bidirectional I/O signals. To minimize multi-fpga routing resources, these I/O signals are distributed along the chip package in an alternating pattern (Figure 23). Thus I/O port signals are physically allocated so that adjacent I/O pins are assigned to signals from differing ports. With this permutation a signal passing through the chip need only be routed the length of several pins rather than across the body of the entire chip. Two

13 Page 13 Figure 21: Virtual Wires Emulation Board NSEWNSEW RS 232 Serial 68HC North Connector ( 3 2 ) Sbus Interface S N W ES N W E 84 Pin PLCC Package SNWESNWE N SE W NS E W Pin Group West ( 3 2 ) East ( 3 2 ) Figure 23: FPGA Pin Permutation South Connector ( 3 2 ) Figure 22: Emulation Board Schematic pin groups are located along each side of the package. This connection scheme is analyzed in detail in [18]. 4.2 Scaling to Multiple-Board Systems Each Virtual Wires prototype board can function as either a stand-alone system or as part of a larger group of boards. Multiple-board systems are constructed by connecting individual boards together to form a two-dimensional mesh (Figure 24). A clock driver chip and remote leads for clock cables allow one board to serve as a single global source for the other boards. CLK signals are fanned out to lo-

14 Page 14 Board 0 Board 1 words of data between the Sbus card [16] and the eight bit North port of the Xilinx chip in the upper right position of the array. 4.4 Application Board 2 Board 3 Figure 24: Multi-Board Emulation System cal logic at the destination boards using a clock distribution chip. Bi-directional FPGA I/O signals along the periphery of each board extend across connectors in each of the four directions. FPGA configuration information is transferred in a serial chain to all FPGAs in the system starting at the board connected to the download cable in the upper rightmost corner of the mesh. System size is currently constrained to a total of ten boards (160 chips) by the Xilinx-imposed limit on the configuration bitstream, although this limit may be overcome with multiple download cables. 4.3 Memory and Host Interfaces Each FPGA in the mesh has twenty-two dedicated I/O lines which interface to a 64K4 SRAM chip. These chips can be used to emulate sections of on-chip memory and are populated as needed. Virtual Wires software is used to multiplex address, data, and control signals for the SRAM so that a number of individual memory accesses to the same SRAM chip may take place during each emulation clock cycle. Twenty nanosecond SRAMs are used in the current prototype. SRAM interface signals have been allocated to dedicated FPGA pins to reduce capacitive loading on inter- FPGA signal lines and to simplify system software. A low bandwidth serial interface via an embedded microcontroller provides access to the array for data transfer, configuration, and FPGA state readback. Data signals from the controller interface directly to the North port of the FPGA in the upper left corner of the array. The embedded controller has the capability to download configuration information to the array thus eliminating the need for an additional download cable from Xilinx. To provide a higher bandwidth interface to the host, a seventeenth Xilinx chip serves as an intermediary between an Sbus interface card located in the host SparcStation and the Xilinx array. This chip is capable of transferring The prototype system allows the logical behavior of one circuit component to be emulated while the rest of the system is simulated. It contains a simulation interface to both the LSI Logic LSIM and Cadence Verilog simulators. At a given simulated clock edge, software drivers transfer data representing inputs to the host workstation which subsequently forwards this data to the emulation system via the serial or S-bus host interface. Once output results are generated, the drivers return them for display or further simulation. As an alternative to simulation acceleration, a target system may be interfaced directly to the emulator with a prototyping pod. This pod plugs into the chip socket in the target system. After FPGA configuration, the emulation system exchanges data with the target system at each emulation clock while performing internal evaluation at FPGA device speeds. In both modes of operation, simulation and emulation, the usability of the system is enhanced by our InnerView Hardware Debugger [17]. This tool consists of host software, embedded controller software, and FPGA circuitry which extracts the emulation state from all FPGAs and coordinates this state with the internal register names of the design under emulation. This tool takes advantage of the serial interface s capability to perform readback from FPGAs on a chip by chip basis. The controller can be programmed to trigger a readback bitstream from any FPGA in the system, and subsequently transfer the values back to the host workstation for evaluation. 5 Results We have successfully compiled designs up to 18K gates onto the demonstration system. In conjunction with the scheduling and synthesis algorithms described in this paper, we used the Synopsys Design Compiler [33] for translation and mapping, the InCA Concept Silicon partitioner [19] for partitioning, our simulated-annealing-based placer, and the standard Xilinx-provided tools for FPGA-specific place and route. Compile time is roughly 3 to 4 hours on a Sparc- Station 10, with 90 percent of this time consumed by the vendor specific FPGA compile. This compilation can thus be accelerated by processing independent FPGA compiles in parallel. The following sections describe our benchmarks and report simulation and emulation results.

Virtual Wires: Overcoming Pin Limitations in FPGA-based Logic Emulators

Virtual Wires: Overcoming Pin Limitations in FPGA-based Logic Emulators Page 1 Virtual Wires: Overcoming Pin Limitations in FPGA-based Logic Emulators Jonathan Babb, Russell Tessier, and Anant Agarwal MIT Laboratory for Computer Science Cambridge, MA 02139 Abstract Existing

More information

TIERS: Topology IndependEnt Pipelined Routing and Scheduling for VirtualWire TM Compilation

TIERS: Topology IndependEnt Pipelined Routing and Scheduling for VirtualWire TM Compilation TIERS: Topology IndependEnt Pipelined Routing and Scheduling for VirtualWire TM Compilation Charles Selvidge, Anant Agarwal, Matt Dahl, Jonathan Babb Virtual Machine Works, Inc. 1 Kendall Sq. Building

More information

How Much Logic Should Go in an FPGA Logic Block?

How Much Logic Should Go in an FPGA Logic Block? How Much Logic Should Go in an FPGA Logic Block? Vaughn Betz and Jonathan Rose Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario, Canada M5S 3G4 {vaughn, jayar}@eecgutorontoca

More information

Verilog for High Performance

Verilog for High Performance Verilog for High Performance Course Description This course provides all necessary theoretical and practical know-how to write synthesizable HDL code through Verilog standard language. The course goes

More information

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) D.Udhayasheela, pg student [Communication system],dept.ofece,,as-salam engineering and technology, N.MageshwariAssistant Professor

More information

Mapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology and Experience

Mapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology and Experience Mapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology and Experience H. Krupnova CMG/FMVG, ST Microelectronics Grenoble, France Helena.Krupnova@st.com Abstract Today, having a fast hardware

More information

Evolution of Implementation Technologies. ECE 4211/5211 Rapid Prototyping with FPGAs. Gate Array Technology (IBM s) Programmable Logic

Evolution of Implementation Technologies. ECE 4211/5211 Rapid Prototyping with FPGAs. Gate Array Technology (IBM s) Programmable Logic ECE 42/52 Rapid Prototyping with FPGAs Dr. Charlie Wang Department of Electrical and Computer Engineering University of Colorado at Colorado Springs Evolution of Implementation Technologies Discrete devices:

More information

INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS)

INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS) INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS) Bill Jason P. Tomas Dept. of Electrical and Computer Engineering University of Nevada Las Vegas FIELD PROGRAMMABLE ARRAYS Dominant digital design

More information

Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool

Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool Jin Hee Kim and Jason Anderson FPL 2015 London, UK September 3, 2015 2 Motivation for Synthesizable FPGA Trend towards ASIC design flow Design

More information

Multi-domain Communication Scheduling For FPGA-based Logic Emulation

Multi-domain Communication Scheduling For FPGA-based Logic Emulation Multi-domain Communication Scheduling For -based Logic Emulation Abstract Communication scheduling is a technique used by many parallel verification systems to pipeline data signals across shared physical

More information

A Time-Multiplexed FPGA

A Time-Multiplexed FPGA A Time-Multiplexed FPGA Steve Trimberger, Dean Carberry, Anders Johnson, Jennifer Wong Xilinx, nc. 2 100 Logic Drive San Jose, CA 95124 408-559-7778 steve.trimberger @ xilinx.com Abstract This paper describes

More information

Digital Design Methodology (Revisited) Design Methodology: Big Picture

Digital Design Methodology (Revisited) Design Methodology: Big Picture Digital Design Methodology (Revisited) Design Methodology Design Specification Verification Synthesis Technology Options Full Custom VLSI Standard Cell ASIC FPGA CS 150 Fall 2005 - Lec #25 Design Methodology

More information

Digital Design Methodology

Digital Design Methodology Digital Design Methodology Prof. Soo-Ik Chae Digital System Designs and Practices Using Verilog HDL and FPGAs @ 2008, John Wiley 1-1 Digital Design Methodology (Added) Design Methodology Design Specification

More information

FPGA: What? Why? Marco D. Santambrogio

FPGA: What? Why? Marco D. Santambrogio FPGA: What? Why? Marco D. Santambrogio marco.santambrogio@polimi.it 2 Reconfigurable Hardware Reconfigurable computing is intended to fill the gap between hardware and software, achieving potentially much

More information

Chapter 5: ASICs Vs. PLDs

Chapter 5: ASICs Vs. PLDs Chapter 5: ASICs Vs. PLDs 5.1 Introduction A general definition of the term Application Specific Integrated Circuit (ASIC) is virtually every type of chip that is designed to perform a dedicated task.

More information

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located

More information

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs? EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) September 12, 2002 John Wawrzynek Outline What are FPGAs? Why use FPGAs (a short history lesson). FPGA variations Internal logic

More information

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function.

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function. FPGA Logic block of an FPGA can be configured in such a way that it can provide functionality as simple as that of transistor or as complex as that of a microprocessor. It can used to implement different

More information

An FPGA Project for use in a Digital Logic Course

An FPGA Project for use in a Digital Logic Course Session 3226 An FPGA Project for use in a Digital Logic Course Daniel C. Gray, Thomas D. Wagner United States Military Academy Abstract The Digital Computer Logic Course offered at the United States Military

More information

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) September 12, 2002 John Wawrzynek Fall 2002 EECS150 - Lec06-FPGA Page 1 Outline What are FPGAs? Why use FPGAs (a short history

More information

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Stanley Bak Abstract Network algorithms are deployed on large networks, and proper algorithm evaluation is necessary to avoid

More information

Programmable Logic Devices FPGA Architectures II CMPE 415. Overview This set of notes introduces many of the features available in the FPGAs of today.

Programmable Logic Devices FPGA Architectures II CMPE 415. Overview This set of notes introduces many of the features available in the FPGAs of today. Overview This set of notes introduces many of the features available in the FPGAs of today. The majority use SRAM based configuration cells, which allows fast reconfiguation. Allows new design ideas to

More information

This chapter provides the background knowledge about Multistage. multistage interconnection networks are explained. The need, objectives, research

This chapter provides the background knowledge about Multistage. multistage interconnection networks are explained. The need, objectives, research CHAPTER 1 Introduction This chapter provides the background knowledge about Multistage Interconnection Networks. Metrics used for measuring the performance of various multistage interconnection networks

More information

Static Scheduling of Multidomain Circuits for Fast Functional Verification

Static Scheduling of Multidomain Circuits for Fast Functional Verification IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 21, NO. 11, NOVEMBER 2002 1253 Static Scheduling of Multidomain Circuits for Fast Functional Verification Murali Kudlugi

More information

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis

More information

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany

More information

INTERCONNECTION NETWORKS LECTURE 4

INTERCONNECTION NETWORKS LECTURE 4 INTERCONNECTION NETWORKS LECTURE 4 DR. SAMMAN H. AMEEN 1 Topology Specifies way switches are wired Affects routing, reliability, throughput, latency, building ease Routing How does a message get from source

More information

CprE 583 Reconfigurable Computing

CprE 583 Reconfigurable Computing CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #9 Logic Emulation Technology Recap FPGA-Based Router (FPX)

More information

Interconnect Technology and Computational Speed

Interconnect Technology and Computational Speed Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented

More information

Built-In Self-Test for Programmable I/O Buffers in FPGAs and SoCs

Built-In Self-Test for Programmable I/O Buffers in FPGAs and SoCs Built-In Self-Test for Programmable I/O Buffers in FPGAs and SoCs Sudheer Vemula, Student Member, IEEE, and Charles Stroud, Fellow, IEEE Abstract The first Built-In Self-Test (BIST) approach for the programmable

More information

8. Best Practices for Incremental Compilation Partitions and Floorplan Assignments

8. Best Practices for Incremental Compilation Partitions and Floorplan Assignments 8. Best Practices for Incremental Compilation Partitions and Floorplan Assignments QII51017-9.0.0 Introduction The Quartus II incremental compilation feature allows you to partition a design, compile partitions

More information

Outline. EECS Components and Design Techniques for Digital Systems. Lec 11 Putting it all together Where are we now?

Outline. EECS Components and Design Techniques for Digital Systems. Lec 11 Putting it all together Where are we now? Outline EECS 5 - Components and Design Techniques for Digital Systems Lec Putting it all together -5-4 David Culler Electrical Engineering and Computer Sciences University of California Berkeley Top-to-bottom

More information

The Xilinx XC6200 chip, the software tools and the board development tools

The Xilinx XC6200 chip, the software tools and the board development tools The Xilinx XC6200 chip, the software tools and the board development tools What is an FPGA? Field Programmable Gate Array Fully programmable alternative to a customized chip Used to implement functions

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

A New Approach to Execution Time Estimations in a Hardware/Software Codesign Environment

A New Approach to Execution Time Estimations in a Hardware/Software Codesign Environment A New Approach to Execution Time Estimations in a Hardware/Software Codesign Environment JAVIER RESANO, ELENA PEREZ, DANIEL MOZOS, HORTENSIA MECHA, JULIO SEPTIÉN Departamento de Arquitectura de Computadores

More information

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,

More information

COE 561 Digital System Design & Synthesis Introduction

COE 561 Digital System Design & Synthesis Introduction 1 COE 561 Digital System Design & Synthesis Introduction Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals Outline Course Topics Microelectronics Design

More information

High-level Variable Selection for Partial-Scan Implementation

High-level Variable Selection for Partial-Scan Implementation High-level Variable Selection for Partial-Scan Implementation FrankF.Hsu JanakH.Patel Center for Reliable & High-Performance Computing University of Illinois, Urbana, IL Abstract In this paper, we propose

More information

Actel s SX Family of FPGAs: A New Architecture for High-Performance Designs

Actel s SX Family of FPGAs: A New Architecture for High-Performance Designs Actel s SX Family of FPGAs: A New Architecture for High-Performance Designs A Technology Backgrounder Actel Corporation 955 East Arques Avenue Sunnyvale, California 94086 April 20, 1998 Page 2 Actel Corporation

More information

MULTIPLEXER / DEMULTIPLEXER IMPLEMENTATION USING A CCSDS FORMAT

MULTIPLEXER / DEMULTIPLEXER IMPLEMENTATION USING A CCSDS FORMAT MULTIPLEXER / DEMULTIPLEXER IMPLEMENTATION USING A CCSDS FORMAT Item Type text; Proceedings Authors Grebe, David L. Publisher International Foundation for Telemetering Journal International Telemetering

More information

Efficient Self-Reconfigurable Implementations Using On-Chip Memory

Efficient Self-Reconfigurable Implementations Using On-Chip Memory 10th International Conference on Field Programmable Logic and Applications, August 2000. Efficient Self-Reconfigurable Implementations Using On-Chip Memory Sameer Wadhwa and Andreas Dandalis University

More information

! Program logic functions, interconnect using SRAM. ! Advantages: ! Re-programmable; ! dynamically reconfigurable; ! uses standard processes.

! Program logic functions, interconnect using SRAM. ! Advantages: ! Re-programmable; ! dynamically reconfigurable; ! uses standard processes. Topics! SRAM-based FPGA fabrics:! Xilinx.! Altera. SRAM-based FPGAs! Program logic functions, using SRAM.! Advantages:! Re-programmable;! dynamically reconfigurable;! uses standard processes.! isadvantages:!

More information

Advanced FPGA Design Methodologies with Xilinx Vivado

Advanced FPGA Design Methodologies with Xilinx Vivado Advanced FPGA Design Methodologies with Xilinx Vivado Alexander Jäger Computer Architecture Group Heidelberg University, Germany Abstract With shrinking feature sizes in the ASIC manufacturing technology,

More information

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA Maheswari Murali * and Seetharaman Gopalakrishnan # * Assistant professor, J. J. College of Engineering and Technology,

More information

Unit 2: High-Level Synthesis

Unit 2: High-Level Synthesis Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis

More information

On GPU Bus Power Reduction with 3D IC Technologies

On GPU Bus Power Reduction with 3D IC Technologies On GPU Bus Power Reduction with 3D Technologies Young-Joon Lee and Sung Kyu Lim School of ECE, Georgia Institute of Technology, Atlanta, Georgia, USA yjlee@gatech.edu, limsk@ece.gatech.edu Abstract The

More information

Introduction to reconfigurable systems

Introduction to reconfigurable systems Introduction to reconfigurable systems Reconfigurable system (RS)= any system whose sub-system configurations can be changed or modified after fabrication Reconfigurable computing (RC) is commonly used

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

What is Xilinx Design Language?

What is Xilinx Design Language? Bill Jason P. Tomas University of Nevada Las Vegas Dept. of Electrical and Computer Engineering What is Xilinx Design Language? XDL is a human readable ASCII format compatible with the more widely used

More information

4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013)

4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013) 1 4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013) Lab #1: ITB Room 157, Thurs. and Fridays, 2:30-5:20, EOW Demos to TA: Thurs, Fri, Sept.

More information

An FPGA Implementation of Bene Permutation Networks. Richard N. Pedersen Anatole D. Ruslanov and Jeremy R. Johnson

An FPGA Implementation of Bene Permutation Networks. Richard N. Pedersen Anatole D. Ruslanov and Jeremy R. Johnson An FPGA Implementation of Bene Permutation Networks Richard N. Pedersen Anatole D. Ruslanov and Jeremy R. Johnson Technical Report DU-CS-03-06 Department of Computer Science Drexel University Philadelphia,

More information

TECHNOLOGY MAPPING FOR THE ATMEL FPGA CIRCUITS

TECHNOLOGY MAPPING FOR THE ATMEL FPGA CIRCUITS TECHNOLOGY MAPPING FOR THE ATMEL FPGA CIRCUITS Zoltan Baruch E-mail: Zoltan.Baruch@cs.utcluj.ro Octavian Creţ E-mail: Octavian.Cret@cs.utcluj.ro Kalman Pusztai E-mail: Kalman.Pusztai@cs.utcluj.ro Computer

More information

AMS Behavioral Modeling

AMS Behavioral Modeling CHAPTER 3 AMS Behavioral Modeling Ronald S. Vogelsong, Ph.D. Overview Analog designers have for many decades developed their design using a Bottom-Up design flow. First, they would gain the necessary understanding

More information

VHDL for Synthesis. Course Description. Course Duration. Goals

VHDL for Synthesis. Course Description. Course Duration. Goals VHDL for Synthesis Course Description This course provides all necessary theoretical and practical know how to write an efficient synthesizable HDL code through VHDL standard language. The course goes

More information

A Path Based Algorithm for Timing Driven. Logic Replication in FPGA

A Path Based Algorithm for Timing Driven. Logic Replication in FPGA A Path Based Algorithm for Timing Driven Logic Replication in FPGA By Giancarlo Beraudo B.S., Politecnico di Torino, Torino, 2001 THESIS Submitted as partial fulfillment of the requirements for the degree

More information

asoc: : A Scalable On-Chip Communication Architecture

asoc: : A Scalable On-Chip Communication Architecture asoc: : A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang,, Andrew Laffely,, and Wayne Burleson University of Massachusetts, Amherst Reconfigurable Computing Group Supported by

More information

Verifying the Correctness of the PA 7300LC Processor

Verifying the Correctness of the PA 7300LC Processor Verifying the Correctness of the PA 7300LC Processor Functional verification was divided into presilicon and postsilicon phases. Software models were used in the presilicon phase, and fabricated chips

More information

Design Tools for 100,000 Gate Programmable Logic Devices

Design Tools for 100,000 Gate Programmable Logic Devices esign Tools for 100,000 Gate Programmable Logic evices March 1996, ver. 1 Product Information Bulletin 22 Introduction The capacity of programmable logic devices (PLs) has risen dramatically to meet the

More information

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study Bradley F. Dutton, Graduate Student Member, IEEE, and Charles E. Stroud, Fellow, IEEE Dept. of Electrical and Computer Engineering

More information

Employing Multi-FPGA Debug Techniques

Employing Multi-FPGA Debug Techniques Employing Multi-FPGA Debug Techniques White Paper Traditional FPGA Debugging Methods Debugging in FPGAs has been difficult since day one. Unlike simulation where designers can see any signal at any time,

More information

Structured Datapaths. Preclass 1. Throughput Yield. Preclass 1

Structured Datapaths. Preclass 1. Throughput Yield. Preclass 1 ESE534: Computer Organization Day 23: November 21, 2016 Time Multiplexing Tabula March 1, 2010 Announced new architecture We would say w=1, c=8 arch. March, 2015 Tabula closed doors 1 [src: www.tabula.com]

More information

COPYRIGHTED MATERIAL. Architecting Speed. Chapter 1. Sophisticated tool optimizations are often not good enough to meet most design

COPYRIGHTED MATERIAL. Architecting Speed. Chapter 1. Sophisticated tool optimizations are often not good enough to meet most design Chapter 1 Architecting Speed Sophisticated tool optimizations are often not good enough to meet most design constraints if an arbitrary coding style is used. This chapter discusses the first of three primary

More information

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Presenter: Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Outline Overview of M.I.T. Raw processor

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

ESE534: Computer Organization. Tabula. Previously. Today. How often is reuse of the same operation applicable?

ESE534: Computer Organization. Tabula. Previously. Today. How often is reuse of the same operation applicable? ESE534: Computer Organization Day 22: April 9, 2012 Time Multiplexing Tabula March 1, 2010 Announced new architecture We would say w=1, c=8 arch. 1 [src: www.tabula.com] 2 Previously Today Saw how to pipeline

More information

Overview of Microcontroller and Embedded Systems

Overview of Microcontroller and Embedded Systems UNIT-III Overview of Microcontroller and Embedded Systems Embedded Hardware and Various Building Blocks: The basic hardware components of an embedded system shown in a block diagram in below figure. These

More information

Best Practices for Incremental Compilation Partitions and Floorplan Assignments

Best Practices for Incremental Compilation Partitions and Floorplan Assignments Best Practices for Incremental Compilation Partitions and Floorplan Assignments December 2007, ver. 1.0 Application Note 470 Introduction The Quartus II incremental compilation feature allows you to partition

More information

The Design and Implementation of a Low-Latency On-Chip Network

The Design and Implementation of a Low-Latency On-Chip Network The Design and Implementation of a Low-Latency On-Chip Network Robert Mullins 11 th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 24-27 th, 2006, Yokohama, Japan. Introduction Current

More information

CHAPTER 2 LITERATURE REVIEW

CHAPTER 2 LITERATURE REVIEW CHAPTER 2 LITERATURE REVIEW As this music box project involved FPGA, Verilog HDL language, and Altera Education Kit (UP2 Board), information on the basic of the above mentioned has to be studied. 2.1 Introduction

More information

Writing Circuit Descriptions 8

Writing Circuit Descriptions 8 8 Writing Circuit Descriptions 8 You can write many logically equivalent descriptions in Verilog to describe a circuit design. However, some descriptions are more efficient than others in terms of the

More information

A Simple Placement and Routing Algorithm for a Two-Dimensional Computational Origami Architecture

A Simple Placement and Routing Algorithm for a Two-Dimensional Computational Origami Architecture A Simple Placement and Routing Algorithm for a Two-Dimensional Computational Origami Architecture Robert S. French April 5, 1989 Abstract Computational origami is a parallel-processing concept in which

More information

6. Parallel Volume Rendering Algorithms

6. Parallel Volume Rendering Algorithms 6. Parallel Volume Algorithms This chapter introduces a taxonomy of parallel volume rendering algorithms. In the thesis statement we claim that parallel algorithms may be described by "... how the tasks

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Choosing an Intellectual Property Core

Choosing an Intellectual Property Core Choosing an Intellectual Property Core MIPS Technologies, Inc. June 2002 One of the most important product development decisions facing SOC designers today is choosing an intellectual property (IP) core.

More information

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,

More information

Synthesis at different abstraction levels

Synthesis at different abstraction levels Synthesis at different abstraction levels System Level Synthesis Clustering. Communication synthesis. High-Level Synthesis Resource or time constrained scheduling Resource allocation. Binding Register-Transfer

More information

INTRODUCTION TO FPGA ARCHITECTURE

INTRODUCTION TO FPGA ARCHITECTURE 3/3/25 INTRODUCTION TO FPGA ARCHITECTURE DIGITAL LOGIC DESIGN (BASIC TECHNIQUES) a b a y 2input Black Box y b Functional Schematic a b y a b y a b y 2 Truth Table (AND) Truth Table (OR) Truth Table (XOR)

More information

Place and Route for FPGAs

Place and Route for FPGAs Place and Route for FPGAs 1 FPGA CAD Flow Circuit description (VHDL, schematic,...) Synthesize to logic blocks Place logic blocks in FPGA Physical design Route connections between logic blocks FPGA programming

More information

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Politecnico di Milano & EPFL A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Vincenzo Rana, Ivan Beretta, Donatella Sciuto Donatella Sciuto sciuto@elet.polimi.it Introduction

More information

Computer Architecture

Computer Architecture Computer Architecture Lecture 1: Digital logic circuits The digital computer is a digital system that performs various computational tasks. Digital computers use the binary number system, which has two

More information

Fault-Tolerant Routing Algorithm in Meshes with Solid Faults

Fault-Tolerant Routing Algorithm in Meshes with Solid Faults Fault-Tolerant Routing Algorithm in Meshes with Solid Faults Jong-Hoon Youn Bella Bose Seungjin Park Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science Oregon State University

More information

Section 3 - Backplane Architecture Backplane Designer s Guide

Section 3 - Backplane Architecture Backplane Designer s Guide Section 3 - Backplane Architecture Backplane Designer s Guide March 2002 Revised March 2002 The primary criteria for backplane design are low cost, high speed, and high reliability. To attain these often-conflicting

More information

FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP

FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP 1 M.DEIVAKANI, 2 D.SHANTHI 1 Associate Professor, Department of Electronics and Communication Engineering PSNA College

More information

High-Level Synthesis

High-Level Synthesis High-Level Synthesis 1 High-Level Synthesis 1. Basic definition 2. A typical HLS process 3. Scheduling techniques 4. Allocation and binding techniques 5. Advanced issues High-Level Synthesis 2 Introduction

More information

AN AUTOMATED INTERCONNECT DESIGN SYSTEM

AN AUTOMATED INTERCONNECT DESIGN SYSTEM AN AUTOMATED INTERCONNECT DESIGN SYSTEM W. E. Pickrell Automation Systems, Incorporated INTRODUCTION This paper describes a system for automatically designing and producing artwork for interconnect surfaces.

More information

AN 567: Quartus II Design Separation Flow

AN 567: Quartus II Design Separation Flow AN 567: Quartus II Design Separation Flow June 2009 AN-567-1.0 Introduction This application note assumes familiarity with the Quartus II incremental compilation flow and floorplanning with the LogicLock

More information

Fault Grading FPGA Interconnect Test Configurations

Fault Grading FPGA Interconnect Test Configurations * Fault Grading FPGA Interconnect Test Configurations Mehdi Baradaran Tahoori Subhasish Mitra* Shahin Toutounchi Edward J. McCluskey Center for Reliable Computing Stanford University http://crc.stanford.edu

More information

Very Large Scale Integration (VLSI)

Very Large Scale Integration (VLSI) Very Large Scale Integration (VLSI) Lecture 6 Dr. Ahmed H. Madian Ah_madian@hotmail.com Dr. Ahmed H. Madian-VLSI 1 Contents FPGA Technology Programmable logic Cell (PLC) Mux-based cells Look up table PLA

More information

6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP

6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃLPHÃIRUÃDÃSDFHLPH $GDSWLYHÃURFHVVLQJÃ$OJRULWKPÃRQÃDÃDUDOOHOÃ(PEHGGHG \VWHP Jack M. West and John K. Antonio Department of Computer Science, P.O. Box, Texas Tech University,

More information

The Memory Component

The Memory Component The Computer Memory Chapter 6 forms the first of a two chapter sequence on computer memory. Topics for this chapter include. 1. A functional description of primary computer memory, sometimes called by

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

Advanced Design System DSP Synthesis

Advanced Design System DSP Synthesis Advanced Design System 2002 DSP Synthesis February 2002 Notice The information contained in this document is subject to change without notice. Agilent Technologies makes no warranty of any kind with regard

More information

CPE/EE 421/521 Fall 2004 Chapter 4 The CPU Hardware Model. Dr. Rhonda Kay Gaede UAH. The CPU Hardware Model - Overview

CPE/EE 421/521 Fall 2004 Chapter 4 The CPU Hardware Model. Dr. Rhonda Kay Gaede UAH. The CPU Hardware Model - Overview CPE/EE 421/521 Fall 2004 Chapter 4 The 68000 CPU Hardware Model Dr. Rhonda Kay Gaede UAH Fall 2004 1 The 68000 CPU Hardware Model - Overview 68000 interface Timing diagram Minimal configuration using the

More information

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER A Thesis by SUNGHO PARK Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements

More information

Spiral 2-8. Cell Layout

Spiral 2-8. Cell Layout 2-8.1 Spiral 2-8 Cell Layout 2-8.2 Learning Outcomes I understand how a digital circuit is composed of layers of materials forming transistors and wires I understand how each layer is expressed as geometric

More information

The Challenges of Doing a PCI Design in FPGAs

The Challenges of Doing a PCI Design in FPGAs The Challenges of Doing a PCI Design in FPGAs Nupur Shah What are the challenges of doing PCI in FPGAs? This paper covers the issues Xilinx has discovered in doing PCI in FPGAs, how they have been surmounted

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION CHAPTER 1 INTRODUCTION Rapid advances in integrated circuit technology have made it possible to fabricate digital circuits with large number of devices on a single chip. The advantages of integrated circuits

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction In a packet-switched network, packets are buffered when they cannot be processed or transmitted at the rate they arrive. There are three main reasons that a router, with generic

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Co-synthesis and Accelerator based Embedded System Design

Co-synthesis and Accelerator based Embedded System Design Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer

More information