The Use of the No-Instruction-Set Computer (NISC) for Acceleration in WISHBONE-Based Systems

Size: px
Start display at page:

Download "The Use of the No-Instruction-Set Computer (NISC) for Acceleration in WISHBONE-Based Systems"

Transcription

1 The Use o the No-Instruction-Set Computer () or Acceleration in WISHBONE-Based Systems Roko Grubišić, Vlado Sruk Technical Report Department o Electronics, Microelectronics, Computer and Intelligent Systems Faculty o Electrical Engineering and Computing, Zagreb, Croatia {roko.grubisic, vlado.sruk}@er.hr Abstract: General-purpose processors are oten unable to exploit the parallelism inherent to the sotware code. This is why additional hardware accelerators are needed to enable meeting the perormance goals. (No-Instruction-Set Computer) is a new approach to hardware-sotware co-design based on automatic generation o special-purpose processors. It was designed to be sel-suicient and it eliminates the need or other processors in the system. This work describes a method or expanding the application domain o the processor to general-purpose processor systems with large amounts o processor-speciic legacy code. This coprocessor-based approach allows application acceleration by utilizing both instruction-level and task-level parallelism by migrating perormance-critical parts o an application to hardware without the need or changing the rest o the program code. For demonstration o this concept, a coprocessor WISHBONE interace was designed. It was implemented and tested in a WISHBONE system based on Altium s TSK3000A general-purpose RISC sot processor and an analytical model was proposed to provide the means to evaluate its eiciency in arbitrary systems. Keywords:, WISHBONE, coprocessor, bus, interace 1

2 1 Introduction Embedded computer systems are no longer used only as simple control devices. Instead, today s embedded systems have to eiciently perorm complex tasks and algorithms despite increasingly stringent design constraints and shrinking time-to-market. Traditional sotware-centered design approach oers high designer productivity, but low design quality in terms o perormance and logic utilization. The key or increasing the design quality is the migration o computationally-intensive parts o the system s task to hardware. Since general-purpose processors are oten unable to exploit the parallelism inherent in the sotware code, these hardware accelerators provide an eective way to meet the design s perormance goals or certain classes o applications. Hardware acceleration can be accomplished by adding application-speciic coprocessors to the system or by customizing existing processors. Custom hardware is conventionally designed at the register transer level (RTL), which entails manually coding the RTL model in a hardware description language (HDL), such as VHDL or Verilog. This process is oten tedious and error-prone and limits the designer productivity. To overcome this problem, it is necessary to raise the level o abstraction or the hardware design and use special tools to synthesize the actual circuitry. Such design approaches include High-level synthesis (HLS), Application-speciic instruction set processors (ASIPs) and the No Instruction Set Computer (). HLS and ASIP techniques are commonly used to design custom hardware, but their eectiveness and scope o use are limited. HLS usually supports only a subset o a high-level programming language and is only eective or small-size problems. Also, synthesis results tend to be unpredictable and so the designer can only try to meet the design constraints by trial-and-error methods. AISPs, on the other hand, rely on a limited number o custom instructions to increase the system s perormance. The number o custom instructions is limited by the instruction decoder and some ASIPs even limit this number to a single custom instruction. Design o special datapath extensions is also required and it aces the same productivity vs. perormance problems as designing any custom hardware. [1][2] addresses these limitations by eliminating the instruction abstraction and compiling programming language code directly to datapath control words using a special cycle-accurate compiler [3]. In this way, this approach keeps the best o both the world o general purpose processors and the world o custom hardware design. processor s architecture can be manually modeled in an Architecture Description Language (ADL) called Generic Netlist Representation (GNR) [4], automatically generated or selected rom a library o standard or previously designed architectures. Toolset [5] generates the RTL model o the processor and control words or the desired application to be implemented in the desired technology, whether FPGA or ASIC. approach utilizes instruction-level parallelism (ILP) to speed up the execution o an algorithm and the generated processor is capable o executing several equivalent RISC instructions in a single clock 2

3 cycle. The additional prebinding mechanism enables simple I/O and interrupts [6] and in this way the scope o applications or which a processor can be used is extended to control tasks o a classical microcontroller. The communication interace or the processor [7] enables the use o multiple processors in the system. It enables exploiting the system s available task-level parallelism by parallel execution o the system s tasks on dierent processors which urther accelerates the application. approach was designed to be sel-suicient and it eliminates the need or other processors in the system. However, that represents a problem when attempting to integrate a processor into an existing system. In the case when a large base o non-portable legacy code optimized or a particular general purpose processor exists, we suggest a coprocessor-based approach to take advantage o the technology and accelerate desired applications. This approach allows utilizing both instruction-level and task-level parallelism by migrating perormance-critical parts o the application to hardware without the need or signiicant changes to the rest o the program code. One o the major design challenges o the coprocessor approach is the integration o the processor with other IP cores in the system. When designing SoCs, standard bus architectures are usually used, but the processor itsel isn t compatible with any o the speciications. Standard bus architectures simpliy the process o system integration, shorten time to market and support portability and reuse o IP cores. Renowned SoC bus architectures o the day include ARM Advanced Microcontroller Bus Architecture (AMBA), IBM CoreConnect, WISHBONE and Altera Avalon. The use o standard bus architectures also provides the opportunities or using special tools that automatically generate bus interaces and interconnections, in this way urther shortening time to market and reducing the possibility o human error. A standard bus interace or the processor enables its simple integration in a wide variety o systems with dierent processors and peripherals. In this paper we present the WISHBONE bus interace or the processor [8] and propose the use o the processor as a loosely-coupled application-speciic coprocessor in WISHBONE-based systems. The WISHBONE Interace was designed, implemented and tested in a WISHBONE system based on Altium s TSK3000A general-purpose RISC sot processor using an Altium LiveDesign Evaluation Board with a Spartan3 FPGA and a Xilinx ML506 board with a Virtex-5 FPGA. 2 WISHBONE bus architecture The WISHBONE System-on-chip (SoC) architecture or portable IP cores [8] is a lexible design methodology developed by Silicore Corp. targeted at SoC integration and design reuse. This is accomplished by deining a standard interconnection scheme and data exchange protocols. WISHBONE speciication deines a single, simple, logical, ully synchronous MASTER/SLAVE bus and IP core interaces that require very ew logic gates. It supports dierent technology-independent interconnection topologies ranging rom simple point-to-point and shared bus interconnections to data low interconnections and complex switch abrics. It also supports a ull range o standard data transer protocols including SINGLE READ/WRITE cycles, BLOCK READ/WRITE cycles and read-modiy-write 3

4 (RMW) cycles with various data sizes and byte ordering. A handshake mechanism enables low control and communication between dierent-speed cores. WISHBONE bus architecture was chosen or the implementation o the coprocessor interace mainly because o its lexibility and the act it imposes no licensing and application restrictions. WISHBONE speciication is currently maintained by OpenCores.org and today it represents a de acto standard or open-source hardware. It is also oers CAD tool support and a large library o ree processors and other IP cores. Besides ree processor IP, many popular commercial sot processors (e.g. Xilinx MicroBlaze and Altera Nios) have a WISHBONE-compatible variant available. 3 WISHBONE Interace In this section we introduce our approach to enabling the use o the processor in a WISHBONEbased system. To make the connection o the processor to the WISHBONE bus possible, we designed and implemented a WISHBONE Interace IP. This IP enables the connection o the processor to the WISHBONE bus, but or the processor to be used as a coprocessor which perorms useul tasks in the system, some changes in the processor s architecture are required. We describe these architectural additions together with the sotware necessary to enable the communication rom both the side o the processor and the side o the main processor. We also propose an appropriate communication scheme, designed to ease the migration o application s unctions between hardware and sotware and thus ease the process o design space exploration. The WISHBONE Interace is compatible with the existing design low and requires no changes in the Toolset. The WISHBONE Interace IP and the sotware communication scheme enable control and data transer which are necessary or the use o the processor as a coprocessor in WISBONE-based systems. The main processor can transer the data to be processed (the unction s arguments) to the coprocessor, start the execution o the coprocessor unction, detect when the data processing is completed (either by pooling the coprocessor s status or by means o an interrupt) and retrieve the results. The designed WISHBONE Interace is targeted at ully synchronous WISHBONE systems and supports word size (32-bit), word granularity SINGLE READ/WRITE WISHBONE classic bus cycles. We describe two dierent methods or exchanging the data between the general-purpose processor and the coprocessor. One method is suitable or simple general-purpose data transer and the other or custom special-purpose data transers tailored or a speciic application. A design approach with two distinct parts o the WISHBONE interace was ollowed to enable this kind o lexibility. The irst part is Basic WISHBONE Interace which is responsible or control unctions and special-purpose data transers. The other part o the interace is the Data Memory WISHBONE Interace and this one is responsible or general purpose data transers, i.e. transerring large amounts o data directly to and rom the data memory o the processor. 4

5 3.1 Basic WISHBONE Interace The Basic WISHBONE Interace enables coprocessor control control, i.e. starting or stopping data processing and detecting when the data processing is completed, and transerring smaller amounts o special purpose data directly into the processor s datapath. This is the main part o the WISHBONE Interace and it is necessary or the communication since it enables coprocessor control and interrupt capabilities. It is designed or seamless integration with the RTL model generated by the Toolset. The advantage o using the Basic WISHBONE Interace or data transers is ast communication (one-cycle transers) and the ability to bring the data directly to the input o the desired unctional unit, in this way speeding up the execution o the algorithm. The disadvantage o this part o the interace is that it is only suitable or transerring small amounts o data, since this approach doesn t scale well beyond a ew arguments and results side communication requirements The accessible external signals o the processor s RTL model are the clock signal, the reset signal, the halt signal and the conigurable I/O ports. The Basic WISHBONE Interace uses only these external signals to connect the processor to the WISHBONE bus. Data transer rom the main processor to the processor is achieved using input ports which can be read by the s application using the prebinding mechanism. Starting the s application execution is accomplished by resetting the processor, i.e. by generating a alling edge on the reset pin. Completion detection is achieved by reading the halt signal which automatically sets upon completing its task. The results are retrieved using output ports, also using the prebinding mechanism. Ater the coprocessors completes, it writes the results to its output ports registers to be read by the main processor using the WISHBONE interace. The output ports have an internal register inside the processor s datapath, but because the input ports unction only as proxies or external registers, it was necessary to provide additional registers as a part o the WISHBONE interace. These input and output registers represent the shared resources that enable the transer o data between the main processor and the coprocessor. The designer s task is to add the external I/O ports to the desired datapath units as a part o the process o designing the architecture using the GNR ADL. A simple architecture with two input ports or the arguments and one output port or the result shown on Figure 1. This method o communication is convenient when the application requires intensive calculations with some o the arguments or when only simple argument exchange is required. In this case arguments can be brought directly to the input o the desired unctional unit and selected when needed, using input multiplexers. This eliminates the need or the sequential transer o the arguments to the register ile or later use and thus eliminates unnecessary transer o the data. One sequential transer is already done when the main processor transers the data to the interace s registers, so another transer rom the interace s registers to the processor s register ile or internal registers would double the communication expenses and thus limit the eiciency o the coprocessor. This approach also avoids register spilling or oten-used values, which is especially important in area-limited systems with small register iles. Output registers or the 5

6 results are used in a similar manner and can also be connected to the datapath in such a way to minimize communication expenses. RF ExternalInput1 ExternalInput2 MUX MUX ExternalOutput ExternalOutputReg ALU Figure 1: Simple architecture with argument inputs and a result output Datapath connections or external inputs and external outputs enable s hardware to access shared registers or communication, while the Basic WISHBONE Interace enables the same access or the main processor. To enable the s application to actually read these shared resources or write them, appropriate prebound unctions or reading or writing shared registers are used. Listing 1 illustrates this concept or a simple application with two arguments and one result. First, the arguments are read rom the external input ports into temporary variables and then they are used in a unction to compute the result. The result is then written to the external output register to be read by the main processor when the application terminates and the halt signal is asserted. void NiscMain(){ //... a ExternalInput1_read(); //get the irst argument b ExternalInput2_read(); //get the second argument //... c do_stu(a,b); //calculate the result //... ExternalOutputReg_write(c); //return the result } Listing 1: Sample application 6

7 3.1.2 Basic WISHBONE Interace implementation To enable the transer o the data and control low, the Basic WISHBONE Interace s provides registers or input arguments and special control registers. Control registers are used to set the state o the processor s reset signal and to read the state o the halt signal. The result output ports are simply routed to the interace s output multiplexer to enable the main processor to read the results. A simpliied block schematic o the interace is shown in Figure 2. The igure shows the internal registers o the interace and the simpliied circuitry to illustrate the path o data to and rom the processor s top module NiscSystem. Basic WISHBONE Interace CLK_I RST_I CYC_I STB_I WE_I & RESET REG clk reset halt NiscSystem WISHBONE ACK_O ADDR_I SEL_I DAT_I INPUT1 REG INPUT2 REG ExternalInput1 ExternalInput2 DAT_O ExternalOutput INT_O & INT_EN REG Figure 2: Basic WISHBONE Interace In a typical WISHBONE system, the most signiicant bytes o the address are decoded in the interconnect module (the so called WISHBONE Intercon), which enables the desired peripheral using the active cycle (CYC_I) and strobe (STB_I) signals, and only the required number o address bits is orwarded to the peripheral s ADDR_I inputs. The internal address decoding o the peripheral then determines which o its internal memory locations is addressed, and this is also the way the Basic WISHBONE Interace operates. Reading the data rom the interace is implemented using a multiplexer which is controlled by the WISHBONE address lines. The write operation is implemented using register enable signals controlled by the address decode logic and the write enable qualiier signal (WE_I). The WISHBONE clock signal CLK_I is routed to the processor and the interace to create a ully synchronous system. The WISHBONE handshake mechanism is implemented using only combinatorial logic, as there are no slow modules in the design. I.e. none o the modules requires more than one cycle or read or write operations so no memory elements or delay are required. WISHBONE reset signal RST_I resets the processor and the interace s internal registers. 7

8 Interrupt mechanism is implemented in such way that setting the interrupt signal INT_O only occurs when the interrupt enable (INT_EN) lag is set and the halt signal is set. INT_O is not a standard WISHBONE signal, but it can easily be connected directly to the main processor and several commercial tools (e.g. Altium Designer) even provide interrupt connections as a part o the WISHBONE interconnect module Basic WISHBONE Interace programming model Using the coprocessor attached to the WISHBONE bus in this way is very straightorward. First, the value 1 is written to the reset register to put the processor in reset state and to insure that only the main processor has access to the shared registers. Then, the arguments are written to appropriate registers and the value 0 is written to the reset register to start the s application execution. Ater that, the main processor can pool the value o the halt signal until it becomes active and then retrieve the results by reading appropriate registers. The reset register and the halt signal are mapped to the same memory location (location 0), and one is accessed when writing and the other when reading that memory address. The interace s ull memory map is shown on Table 1. The irst memory location, at address 0 (plus the base address displacement) is reserved or the already mentioned reset/halt pair, the control register. Because 32-bit addressing is used, the address dierence between two adjacent memory locations is 4 bytes. So, the next location s address is 4 and this is the location o the interrupt enable register. Ater that ollow the argument and result registers. The example register address layout shown in Table 1 enables simple implementations o traditional C unctions with arbitrary number o arguments and one return value. O course, or dierent applications with additional result values and matching result registers, dierent address layouts are possible with dierently laid-out interace datapath and I/O ports. Table 1: Basic WISHBONE Interace memory map Address Name Width [bit] Description 00 RESET HALT!(INT_FLAG) 1 Bit0: read: status o the halt signal is read write: reset register is written to '1' sets the to reset state, '0' initializes the execution ('0' also serves as the interrupt acknowledge) 04 INT_EN 1 Bit0: reads/writes the interrupt enable lag '1' enables the interrupts, '0' disables them 08 RESULT 32 Read/Write result 12 ARG 1 32 Read/Write 1 st argument 16 ARG 2 32 Read/Write 2 nd argument (N+2)*4 ARG N 32 Read/Write N th argument 8

9 Listing 2 shows an example C unction oloaded to the coprocessor to illustrate the designed communication scheme. The unction call is inlined to eliminate the need or copying the arguments to local variables. This example unction is blocking, i.e. it doesn t return until the coprocessor s task is completed and in that way the coprocessor s task behaves as a regular C unction. Because o that, the programmer can use it rom the calling unction as though it is a sotware, rather than hardware, implementation o the algorithm. When using this communication scheme, only the body o the unction needs to be replaced and the calling unctions remain unchanged when migrating the desired unction s implementation between hardware and sotware. In this way, it is very simple to experimentally evaluate design options and thus explore the design space. For exploiting the task-level, data and unction parallelism, more complex non-blocking versions o coprocessor unction calls should be used. In this way, the main processor can send the data to the coprocessor, initialize the data processing and carry on with its own tasks. These tasks could include control unctions, processing a part o the data (data parallelism), or perorming a part o the processing on all the data (unction parallelism). // declarations volatile uint32_t * nisc_ctrl (uint32_t *) Base_; volatile uint32_t * nisc_ie (uint32_t *) (Base_ + 4); volatile uint32_t * nisc_result (uint32_t *) (Base_ + 8); volatile uint32_t * nisc_arg1 (uint32_t *) (Base_ + 12); volatile uint32_t * nisc_arg2 (uint32_t *) (Base_ + 16); //hide the pointers behind #deine #deine _CTRL ( *nisc_ctrl ) #deine _HALT ( *nisc_ctrl ) //alias or _CTRL #deine _IE ( *nisc_ie ) #deine _RESULT ( *nisc_result ) #deine _ARG1 ( *nisc_arg1 ) #deine _ARG2 ( *nisc_arg2 ) //constants #deine _STOP 1 #deine _GO 0 // unction (blocking) inline uint32_t nisc_unction(uint32_t arg1, uint32_t arg2){ _CTRL _STOP; //stop the _ARG1 arg1; //write arg1 _ARG2 arg2; //write arg2 _CTRL _GO; //start the } while(!_halt); return _RESULT; //wait until completes //return the result Listing 2: Communication with the processor 9

10 Interrupt-driven communication with the coprocessor requires enabling the interrupts using the interrupt enable lag. When the arguments are sent to the coprocessor registers and the execution is started, the main processor can carry on with its regular tasks and retrieve the results in the interrupt service routine (ISR). As a part o the ISR, the main processor must also acknowledge the interrupt by clearing the WISHBONE Interace s interrupt lag which is achieved by writing the value 0 to the control register. Other standard interrupt acknowledge procedures must be ollowed depending on the processor s interrupt system architecture and whether the interrupts are level or edge triggered. The next batch o data to be processed can be sent in the ISR and the next execution can also be initiated. 3.2 Data Memory WISHBONE Interace The Basic WISHBONE Interace was designed or applications which need to exchange a small amount o special purpose data with the main processor. The problem is that this approach doesn t scale well or applications that have a large amount o arguments and/or results. This approach would then require a large number o registers and large multiplexers, which would require a lot o chip area and would also limit the maximum requency, especially when considering the negative implications o using large multiplexers in FPGAs. A urther drawback o the Basic WISHBONE Interace is the inability to perorm indexed addressing on the arguments and results without the need or data transer between shared registers and the data memory. E.g. to perorm indexed addressing on the arguments, processor must copy the data rom the interace s argument registers to internal arrays in the data memory and every piece o data thus has to be copied twice which imposes a large communication overhead when transerring large amounts o data. This especially represents an issue with algorithms that involve matrix calculations and the like Data Memory Access Multiplexer/Arbiter A simple solution or the problem o transerring large amounts o data to and rom the processor is the direct transer o data to and rom the processor s data memory. To achieve this, a special module that enables external access to s data memory was added to the design. The role o this data memory access multiplexer is to provide access to the processor s data memory rom the WISHBONE bus and to arbitrate access to the memory between the interace and the main processor. The data memory multiplexer is connected between the processor s core (controller and datapath) and the data memory and it provides an additional set o pins that are connected outside o the processor itsel, as shown in Figure 3. 10

11 App clk reset halt Controller ExternalInputs ExternalOutputs cmem Datapath dmem mux dmem dmem external access Figure 3: Multiplexing the access to the data memory The data memory multiplexer s arbitration scheme is quite simple: i the processor s reset or halt signals are active, then the main processor has control over the data memory. Otherwise, the control is let to the processor and the main processor has no means o writing to the data memory and reads the value o all data bits as zeroes. The implementation o the data memory multiplexer is shown in Figure 4. DATA_OUT dmem mux ADDR+CTRL DATA_IN RESET HALT OR DATA_OUT EXTERNAL DATA_OUT ADDR+CTRL DATA_IN MUX GND MUX MUX DATA_IN ADDR+CTRL Memory Figure 4: Data memory multiplexer implementation 11

12 3.2.2 Data Memory WISHBONE Interace implementation Data Memory Access Multiplexer/Arbiter provides the means or external access to the data memory o the processor and the Data Memory WISHBONE Interace is connected to these external access pins. Its role is to handle the appropriate data conversion and handshaking to enable correct data memory access operations rom the WISHBONE bus. s data memory module encompasses FPGA block RAM s (BRAMs) and a memory controller module (namely, ByteAddressableDMemLogic Verilog module). The data memory module s ports are actually the memory controller s ports so the Data Memory WISHBONE Interace is designed in such a way to operate with the memory controller and internal BRAMs. Figure 5 shows the internal organization o the Data Memory WISHBONE Interace. The address bits are orwarded directly rom the WISHBONE address lines to data memory s address lines. Since s memory controller doesn t require word access addresses aligned on 4 byte borders and requires the data with width less than 32-bits to be aligned to the lower data bits (unlike WISHBONE mechanism with byte-enables), special alignment logic was designed to handle the translation. Also, depending on the access type, WISHBONE byte enable lines are translated to appropriate type codes or the memory controller. Write and read enable signals are derived rom the WISHBONE write enable qualiier WE_I. The handshaking mechanism is implemented with 1 cycle acknowledge delay to allow or the BRAM latency, since FPGA BRAMs are synchronous. CLK_I Data Memory WISHBONE Interace WISHBONE RST_I CYC_I STB_I ACK_O WE_I ADDR_I SEL_I DAT_O DAT_I & & ACK DELAY REG TYPE DECODE ALIGN ALIGN TYPE DATA IN READ EN WRITE EN ADDR DATA OUT dmem external access Figure 5: Data Memory WISHBONE Interace 3.3 The complete WISHBONE Interace and its programming model The complete WISHBONE Interace, as shown o Figure 6, consists o the Basic WISHBONE Interace and the Data Memory WISHBONE Interace. This approach with two distinct interaces eliminates the need or address translations inside o the interace to dierentiate between the shared registers space and the data memory space and thus simpliies the design. The simplest way to use the 12

13 complete interace is to irst put the processor in the reset state using the Basic WISHBONE Interace and then send the data through the Data Memory WISHBONE Interace. Ater that, is taken out o reset state and the main processor waits or the application to complete and retrieves the results rom the data memory. Special-purpose data could be also sent or retrieved using the Basic WISHBONE Interace. It s also important to point out that the lack o hardware or sotware reset o s data memory is vital or this sort o communication because the sent arguments would otherwise be lost when the processor is reset. To enable the main processor to read or write a particular bit o data in the s data memory, its address must be known and it is used as an oset rom the base address o the Data Memory WISHBONE Interace in the main processor s memory map. Toolset s Storage Bindings outputs are used to determine the address o a particular global variable or a global array in the data memory. Storage Bindings provide this inormation in the orm o a table and the required inormation can be ound in this table and used in the program code. The whole data memory o the processor is mapped to a contiguous external memory block in the main processor s address space and is thus easily accessible through pointer manipulation. Ater the main processor writes the arguments to the global data structures in the processor s data memory, the processor operates on this data. The processor then writes the results to designated global data structures which are aterwards read by the main processor. This approach requires no changes in the s algorithm, provided that it uses global variables or global arrays. Global structures can also be passed to the unctions as arguments, which eliminates the need or any changes in the code o any o the unctions. App Basic WISHBONE Interace clk reset halt cmem ExternalInput1 ExternalInput2 ExternalOutput Controller Datapath dmem mux dmem Data Memory WISHBONE Interace dmem external access Figure 6: WISHBONE Interace architecture 13

14 Listing 3 shows an example implementation o the communication using the Data Memory WISHBONE Interace. Ater asserting s reset signal to ensure access to the data memory, arguments are copied to the data memory using pointers and the data processing is started by deasserting the reset signal. Ater completion detection, using pooling in this example, results can be retrieved or urther processing. I there is no need to immediately start processing the next batch o data, the data can be used directly rom s data memory, thus eliminating the need or copying. I, however, we want to utilize task-level parallelism and process another batch o data in parallel with the task o the general-purpose processor, copying becomes necessary. volatile uint32_t *nisc_dmem_arg (uint32_t*) (nisc_dmem_base + arg_oset); volatile uint32_t *nisc_dmem_res (uint32_t*) (nisc_dmem_base + res_oset); // unction (blocking) inline void nisc_unction(uint32_t *arguments, uint32_t *results){ _CTRL _STOP; // stop //send the arguments or(int i0; i < N; i++){ nisc_dmem_arg[i] arguments[i]; } _CTRL _GO; while(!_halt); // start //wait until completes } //return the results or(int i0; i < N; i++){ results[i] nisc_dmem_res[i]; } Listing 3: Communication using the Data Memory WISHBONE Interace 3.4 TSK3000A WISHBONE System with a coprocessor The complete WISHBONE Interace was implemented using VHDL and integrated with a generated processor in a sel-suicient WISHBONE compatible module which can be used as a coprocessor in systems based on the WISHBONE bus architecture. The generated processor was based on a generic architecture with necessary I/O datapath extensions and simple applications designed to test the communication capabilities. As a proo-o-concept or the presented coprocessor-based approach, we implemented and tested a WISHBONE system based on Altium s TSK3000A generalpurpose 32-bit RISC sot processor. TSK3000A is based on the MIPS/DLX instruction set and is FPGA vendor independent (unlike FPGA vendors proprietary processors, such as Xilinx MicroBlaze and Altera Nios). It is designed or seamless integration with Altium s conigurable Wishbone Interconnect module which was also used in this proo-o-concept system. Main processor-side communication routines were developed as presented earlier in this chapter and then tested in conjunction with the hardware using the Altium LiveDesign Evaluation Board with a Spartan3 FPGA and Xilinx ML506 with a Virtex-5 FPGA. The schematic or the complete system is shown in Figure 7. 14

15 Figure 7: TSK3000A with a coprocessor 4 Perormance analysis The designed WISHBONE Interace provides the means or using the processor as a looselycoupled coprocessor in WISHBONE-based systems. To justiy the use o the coprocessor in the system, the speedup o the algorithm must justiy the cost o the extra hardware. Furthermore, the speedup o the implementation o the algorithm in question must be such to cancel the negative eect on the perormance imposed by the main processor to coprocessor communication overhead and still provide overall system speedup. O course, what is actually justiied also depends on the achieved speedup rate and design constraints. E.g. or area-constrained systems a small achieved speedup would not justiy the use o a coprocessor, but it could be justiied in systems where area is not an issue and every bit o perormance improvement counts. To ease the design space exploration and provide the means o estimating the system s perormance and the margins o the speedup required by the processor in the early stages o the project, an analytical model o the system s perormance was devised. This model is intended to be used together with the proiling inormation and it enables the designer to estimate the beneits o using a coprocessor or a particular part o the system s task and quickly explore the design space beore the actual system is built. It also predicts the minimum perormance limit or the processor which can be used to quickly eliminate unsuitable architectures and thus drive the hardware-sotware cooptimization process in the right direction considering the system-level requirements. The WISHBONE Interace was designed in such a way as to extract maximum perormance in systems with a general-purpose processor which uses WISHBONE classic bus cycles, like the Altium TSK3000A. This means that an access to the interace s register takes one clock cycle and an access to 15

16 the data memory takes two clock cycles (because o BRAM latency). WISHBONE classic bus cycles don t allow or distinct address and data transer phases and any sort o bus transaction pipelining to improve perormance. Because o that, there are no means or urther acceleration o the data transer according to the WISHBONE classic speciication, not even with additional circuitry. The number o clock cycles required or the communication is deined by the number o cycles required or the transer o arguments, the number o cycles required or the transer o results and the number o cycles required or the control operations. The control overhead consists o only two cycles, one or stopping the processor (i.e. putting it into reset state) and one or starting it (i.e. taking it out o reset state). To estimate the perormance impact o the communication overhead, we use the worst-case scenario or the transer o the arguments and results, i.e. we use the slower mode o data transer, the transer o data to and rom the data memory o the processor. The communication time is deined by the cycle count or the communication and the requency o the communication clock, i.e. the bus requency, which is the same as the system requency in ully synchronous systems. Fully synchronous systems oer the advantage o easier design process and simpler veriication. The drawback o using a single clock in the system is the negative impact on perormance. The only way to extract the maximum perormance is to use asynchronous clock domains or dierent parts o the systems, each running at its maximum requency and use synchronization logic and FIFOs or crossing the clock domains, which oten leads to design problems and diicult-to-track timing issues. The ully synchronous approach eliminates timing problems inside o the system and it was our intent to analyze and determine the boundaries o this design approach when using coprocessors. The major speedup should be achieved by the means o parallelism and architecture optimizations with cores sharing the same clock and asynchronous design should only be considered when this approach ails to meet the design constrains. Since the designed WISHBONE Interace is 32 bits wide, the model presumes 32-bit arguments and results but this reasoning is, o course, valid or the arguments with widths o less than 32 bits. Transer o data wider than 32 bits can be split up in 32 bit data transers with some sotware support and can thus also be modeled. It is also important or the model to include additional sotware overheads or the communication, e.g. loop overhead. There are dierent approaches to accelerating this sort o communication, using both sotware and hardware. These approaches include loop unrolling, using custom loop instructions with zero control overhead (depending on the processor architecture and conigurability) or using additional DMA engines to drive the data transer. To model these additional parameters, a scaling actor or the number o transers was added together with a constant that includes any additional control overheads (such as DMA setup) to better estimate the system s perormance. The basic expression or estimating the communication time is given by the equation (1). 16

17 T comm BUS comm ( ArgCount + Re scount) (2 + Overhead ) + CtrlOverhead BUS (1) To analyze the beneits o the implementation o a particular unction, we must compare the execution time o the algorithm or a general-purpose processor and the execution time when using the processor. The execution time or the general-purpose processor is deined by the processor architecture and clock requency. The processor architecture deines the number o instructions or the algorithm and the number o cycles required to execute an instruction (Cycles Per Instruction CPI) and these deine the total number o clock cycles or an algorithm. Equation (2) shows the expression or the execution time. T (2) Similarly, the execution time or the processor is deined by the number o cycles required or the particular application and the processor s clock requency, as equation (3) shows. The number o cycles is determined by the application itsel and the processor s architecture. T (3) The total execution time when using the processor as a coprocessor is deined by the sum o the execution time or the application itsel and the time required or communication: T T + T (4) COP All these expressions depend on the requency o a particular module. Since the WISHBONE Interace is targeted at ully synchronous systems, the requencies o all modules must be the same. Thereore, the maximum requency o the system is determined by the minimum o all the core s requencies, as shown in equation (5). This expression can be easily extended to systems with multiple coprocessors. comm {, } min (5) SYS, max,max, BUS,max The execution time or the general-purpose processor can be expressed using (2) and (5), as ollows: 17

18 T min SYS, max,max, {, } BUS,max (6) Similarly, the execution time or the coprocessor can be expressed using (1), (3), (4) and (5), as ollows: T COP T + T comm + ( ArgCount + Re scount) (2 + Overhead ) + CtrlOverhead SYS + ( ArgCount + Re scount) (2 + Overhead ) + CtrlOverhead min {, }, max,max, BUS,max (7) To determine the borderline o the coprocessor s eectiveness, we again turn to the worst-case scenario, which in this case is the blocking communication. When using blocking communication, there is no parallel execution o the algorithm on the main processor and the coprocessor and because o that, this represents the worst case or the processor-coprocessor communication. For the hardware implementation o a unction to be eicient, the obvious and necessary condition is or the unction s execution time on the coprocessor to be shorter than the execution time on a general-purpose processor: T COP < T (8) I the coprocessor s implementation doesn t limit the system s maximum requency (i.e.,max >,max holds true) then the speedup o a part o an algorithm (taking into account communication and control overheads) certainly speeds up the execution o the whole program. From (6), (7) and (8), ollows: < ( ArgCount + Re scount) (2 + Overhead) CtrlOverhead (9) The equation (9) deines the maximum number o clock cycles or the coprocessor in order or that implementation not to reduce the eectiveness o the whole system. Using this expression, it is possible to eliminate unacceptable designs in the early stages o the project. For the coprocessor implementation to be truly eective, it must provide suitable speedup to justiy the additional hardware expense. To quantiy this, we must analyze the overall speedup actor. To do this, the program must be divided into two parts. One part will always be executed on the main processor. The other part can migrate between hardware and sotware implementations and it is or this part that we evaluate the eectiveness o the coprocessor implementation. + / (10) PROGRAM HW 18

19 The speedup is deined as the ratio o execution times when using sotware or hardware implementation o the unction in question. T is the execution time o a purely sotware implementation and T +COP is the execution time when a part o the system s task is implemented in hardware, using the coprocessor. T Speedup T COP + HW /, HW /, HW /, + comm + HW /, + ( ArgCount + Re scount) (2 + Overhead) + CtrlOverhead (11) I the coprocessor s implementation limits the system s maximum requency then the speedup should be analyzed in this context. The processor s implementation is in this case only eective i the overall execution time is shorter than that o a purely sotware implementation despite the lower overall system requency. T < T + COP + HW /, + comm + <,max,max HW /, HW /,,max < ( + HW /, ),max comm (12) HW /, <,max,max ( + HW /, ) ( ArgCount + Re scount) (2 + Overhead ) CtrlOverhead The equation (12) deines the maximum number o clock cycles or the coprocessor or it not to degrade the system s perormance and insure speedup, given the overall requency o the system. I we re interested in the minimum requency or the system that would insure speedup, (12) can be rewritten as ollows:,max > +,max HW /,,max ( + + ) + comm HW /, + < HW /, +,max comm HW /, (17),max >,max ( + + ( ArgCount+ ResCount) (2 + Overhead) + CtrlOverhead ) HW /, + HW /, 19

20 The system s speedup in the case when the coprocessor limits the system s maximum requency is as ollows: T Speedup T + COP + HW /,,max + + HW /,,max HW /, +,max,max comm HW /, + ( ArgCount+ ResCount) (2 + Overhead) + CtrlOverhead + (18) Similar reasoning applies to the case then the bus interconnect module limits the maximum requency, e.g. due to its complexity increasing when introducing the coprocessor in the system. 5 Conclusion In this paper we presented our solution or using the No-Instruction-Set Computer () as an application-speciic coprocessor in systems based on the WISHBONE open bus speciication. This approach allows designers to easily use the technology in existing systems with large amounts o unportable legacy code. We deined and explained the hardware and sotware extensions necessary or the integration o the coprocessor in an arbitrary WISHBONE system. The lexibility o these extensions enables dierent ways o using the coprocessor to accelerate the system s application. As a result, we enabled quick and easy integration o a hardware accelerator generated using the Toolset into a given system to improve overall perormance. With the previously designed interace and memory multiplexer hardware, the manual interace parameterization and integration process can t take more than an hour. This process is also a good candidate or automation using a simple sotware tool which can urther simpliy and speed-up the whole design process. We also analyzed the perormance implications o using such a coprocessor in a ully synchronous WISHBONE system. The eectiveness o using the coprocessor and the borderline perormance values required to achieve overall system speed-up were evaluated. We proposed an analytical perormance model that could help quick exploration o the design space, elimination o unacceptable designs and setting the design targets even in early stages o the project. In this way, the hardwaresotware co-development and co-optimization process could converge more quickly towards the desired goal. Since the model requires only simple arithmetic calculations, it can easily be used to build a tool to help automate design space exploration. 6 Acknowledgements The authors wish to thank the Unity through Knowledge Fund (UKF), Končar and SYSTEMCOM or supporting the project. This work builds on many years o processor research at the Center or 20

21 Embedded Computer Systems (CECS), University o Caliornia at Irvine. We wish to thank CECS or allowing the download o Toolset and Mehrdad Reshadi or his help and advice on how to use the toolset. 7 Reerences 1. D. Gajski, ": The Ultimate Reconigurable Component", Center or Embedded Computer Systems, TR 03-28, October M. Reshadi: Modeling and Compilation, dissertation, University o Caliornia, Irwine, M. Reshadi, D. Gajski, "A Cycle-Accurate Compilation Algorithm or Custom Pipelined Datapaths", International Symposium on Hardware/Sotware Codesign and System Synthesis (CODES+ISSS), pp 21-26, September B. Gorjiara, M. Reshadi, D. Gajski, "Generic Architecture Description or Retargetable Compilation and Synthesis o Application-Speciic Pipelined IPs", International Conerence on Computer Design (ICCD), October Technology & Toolset URL: ( ) 6. M. Reshadi, D. Gajski, "Interrupt and Low-level Programming Support or Expanding the Application Domain o Statically-scheduled Horizontally-microcoded Architectures in Embedded Systems", Design Automation and Test in Europe (DATE), April B. Gorjiara, M. Reshadi, D. Gajski, " Communication Interace", Center or Embedded Computer Systems, TR 06-05, March R. Grubišić, Design o the processor WISHBONE Interace, diploma thesis (in Croatian), University o Zagreb, Faculty o Electrical Engineering and Computing (FER), May Speciication or the: WISHBONE System-on-Chip (SoC) Interconnect Architecture or Portable IP Cores, Revision: B.3, Silicore/OpenCores.org,

10. SOPC Builder Component Development Walkthrough

10. SOPC Builder Component Development Walkthrough 10. SOPC Builder Component Development Walkthrough QII54007-9.0.0 Introduction This chapter describes the parts o a custom SOPC Builder component and guides you through the process o creating an example

More information

2. Design Planning with the Quartus II Software

2. Design Planning with the Quartus II Software November 2013 QII51016-13.1.0 2. Design Planning with the Quartus II Sotware QII51016-13.1.0 This chapter discusses key FPGA design planning considerations, provides recommendations, and describes various

More information

2. Recommended Design Flow

2. Recommended Design Flow 2. Recommended Design Flow This chapter describes the Altera-recommended design low or successully implementing external memory interaces in Altera devices. Altera recommends that you create an example

More information

13. Power Optimization

13. Power Optimization 13. Power Optimization May 2013 QII52016-13.0.0 QII52016-13.0.0 The Quartus II sotware oers power-driven compilation to ully optimize device power consumption. Power-driven compilation ocuses on reducing

More information

13. Power Management in Stratix IV Devices

13. Power Management in Stratix IV Devices February 2011 SIV51013-3.2 13. Power Management in Stratix IV Devices SIV51013-3.2 This chapter describes power management in Stratix IV devices. Stratix IV devices oer programmable power technology options

More information

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis

More information

NISC Application and Advantages

NISC Application and Advantages NISC Application and Advantages Daniel D. Gajski Mehrdad Reshadi Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697-3425, USA {gajski, reshadi}@cecs.uci.edu CECS Technical

More information

DSP Design Flow User Guide

DSP Design Flow User Guide DSP Design Flow User Guide 101 Innovation Drive San Jose, CA 95134 www.altera.com Document Date: June 2009 Copyright 2009 Altera Corporation. All rights reserved. Altera, The Programmable Solutions Company,

More information

Section II. Nios II Software Development

Section II. Nios II Software Development Section II. Nios II Sotware Development This section o the Embedded Design Handbook describes how to most eectively use the Altera tools or embedded system sotware development, and recommends design styles

More information

GPIO IP Core Specification

GPIO IP Core Specification GPIO IP Core Specification Author: Damjan Lampret lampret@opencores.org Rev. 0.2 February 20, 2001 Preliminary Draft www.opencores.org Rev 0.2 Preliminary 1 of 18 Revision History Rev. Date Author Description

More information

2. Processor Architecture

2. Processor Architecture 2. Processor Architecture February 2014 NII51002-13.1.0 NII51002-13.1.0 This chapter describes the hardware structure o the Nios II processor, including a discussion o all the unctional units o the Nios

More information

Section III. Advanced Programming Topics

Section III. Advanced Programming Topics Section III. Advanced Programming Topics This section provides inormation about several advanced embedded programming topics. It includes the ollowing chapters: Chapter 8, Exception Handling Chapter 9,

More information

structure syntax different levels of abstraction

structure syntax different levels of abstraction This and the next lectures are about Verilog HDL, which, together with another language VHDL, are the most popular hardware languages used in industry. Verilog is only a tool; this course is about digital

More information

Here is a list of lecture objectives. They are provided for you to reflect on what you are supposed to learn, rather than an introduction to this

Here is a list of lecture objectives. They are provided for you to reflect on what you are supposed to learn, rather than an introduction to this This and the next lectures are about Verilog HDL, which, together with another language VHDL, are the most popular hardware languages used in industry. Verilog is only a tool; this course is about digital

More information

ICTP Latin-American Advanced Course on FPGADesign for Scientific Instrumentation. 19 November - 7 December, 2012

ICTP Latin-American Advanced Course on FPGADesign for Scientific Instrumentation. 19 November - 7 December, 2012 2384-15 ICTP Latin-American Advanced Course on FPGADesign for Scientific Instrumentation 19 November - 7 December, 2012 Introduction to the Bus Interface CICUTTIN Andres ICTP Multidisciplinary Laboratory

More information

Using VCS with the Quartus II Software

Using VCS with the Quartus II Software Using VCS with the Quartus II Sotware December 2002, ver. 1.0 Application Note 239 Introduction As the design complexity o FPGAs continues to rise, veriication engineers are inding it increasingly diicult

More information

WB_INTERFACE Custom Wishbone Interface

WB_INTERFACE Custom Wishbone Interface WB_INTERFACE Custom Wishbone Interface Summary This document provides detailed reference information with respect to the WB_INTERFACE peripheral component. This component enables you to build custom Wishbone

More information

ITU - Telecommunication Standardization Sector. G.fast: Far-end crosstalk in twisted pair cabling; measurements and modelling ABSTRACT

ITU - Telecommunication Standardization Sector. G.fast: Far-end crosstalk in twisted pair cabling; measurements and modelling ABSTRACT ITU - Telecommunication Standardization Sector STUDY GROUP 15 Temporary Document 11RV-22 Original: English Richmond, VA. - 3-1 Nov. 211 Question: 4/15 SOURCE 1 : TNO TITLE: G.ast: Far-end crosstalk in

More information

7. External Memory Interfaces in Cyclone IV Devices

7. External Memory Interfaces in Cyclone IV Devices March 2016 CYIV-51007-2.6 7. External Memory Interaces in Cyclone IV Devices CYIV-51007-2.6 This chapter describes the memory interace pin support and the external memory interace eatures o Cyclone IV

More information

08 - Address Generator Unit (AGU)

08 - Address Generator Unit (AGU) October 2, 2014 Todays lecture Memory subsystem Address Generator Unit (AGU) Schedule change A new lecture has been entered into the schedule (to compensate for the lost lecture last week) Memory subsystem

More information

13. HardCopy Design Migration Guidelines

13. HardCopy Design Migration Guidelines November 2012 EMI_DG_012-2.2 13. HardCopy Design Migration Guidelines EMI_DG_012-2.2 This chapter discusses HardCopy migration guidelines or UniPHY-based designs. I you want to migrate your ALTMEMPHY-based

More information

NISC Technology Online Toolset

NISC Technology Online Toolset NISC Technology Online Toolset Mehrdad Reshadi, Bita Gorjiara, Daniel Gajski Technical Report CECS-05-19 December 2005 Center for Embedded Computer Systems University of California Irvine Irvine, CA 92697-3425,

More information

7. High-Speed Differential Interfaces in the Cyclone III Device Family

7. High-Speed Differential Interfaces in the Cyclone III Device Family December 2011 CIII51008-4.0 7. High-Speed Dierential Interaces in the Cyclone III Device Family CIII51008-4.0 This chapter describes the high-speed dierential I/O eatures and resources in the Cyclone III

More information

Using the Nios II Configuration Controller Reference Designs

Using the Nios II Configuration Controller Reference Designs Using the Nios II Controller Reerence Designs AN-346-1.2 March 2009 Introduction This application note describes coniguration controller reerence designs or Nios II systems using Altera Stratix II, Cyclone

More information

SoC Design Lecture 11: SoC Bus Architectures. Shaahin Hessabi Department of Computer Engineering Sharif University of Technology

SoC Design Lecture 11: SoC Bus Architectures. Shaahin Hessabi Department of Computer Engineering Sharif University of Technology SoC Design Lecture 11: SoC Bus Architectures Shaahin Hessabi Department of Computer Engineering Sharif University of Technology On-Chip bus topologies Shared bus: Several masters and slaves connected to

More information

Neighbourhood Operations

Neighbourhood Operations Neighbourhood Operations Neighbourhood operations simply operate on a larger neighbourhood o piels than point operations Origin Neighbourhoods are mostly a rectangle around a central piel Any size rectangle

More information

Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication

Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication Introduction All processors offer some form of instructions to add, subtract, and manipulate data.

More information

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer)

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer) ESE Back End 2.0 D. Gajski, S. Abdi (with contributions from H. Cho, D. Shin, A. Gerstlauer) Center for Embedded Computer Systems University of California, Irvine http://www.cecs.uci.edu 1 Technology advantages

More information

BRIDGE PIF / WISHBONE

BRIDGE PIF / WISHBONE June 27 Politecnico of Torino BRIDGE PIF / WISHBONE Specification Authors: Edoardo Paone Paolo Motto Sergio Tota Mario Casu Table of Contents May 27 Table Of Contents Table of Figures May 27 Table Of Figures

More information

6. I/O Features for HardCopy III Devices

6. I/O Features for HardCopy III Devices 6. I/O Features or HardCopy III Devices January 2011 HIII51006-3.1 HIII51006-3.1 This chapter describes the I/O standards, eatures, termination schemes, and perormance supported in HardCopy III devices.

More information

Digital Integrated Circuits

Digital Integrated Circuits Digital Integrated Circuits Lecture 9 Jaeyong Chung Robust Systems Laboratory Incheon National University DIGITAL DESIGN FLOW Chung EPC6055 2 FPGA vs. ASIC FPGA (A programmable Logic Device) Faster time-to-market

More information

SpecC Methodology for High-Level Modeling

SpecC Methodology for High-Level Modeling EDP 2002 9 th IEEE/DATC Electronic Design Processes Workshop SpecC Methodology for High-Level Modeling Rainer Dömer Daniel D. Gajski Andreas Gerstlauer Center for Embedded Computer Systems Universitiy

More information

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011 FPGA for Complex System Implementation National Chiao Tung University Chun-Jen Tsai 04/14/2011 About FPGA FPGA was invented by Ross Freeman in 1989 SRAM-based FPGA properties Standard parts Allowing multi-level

More information

9. Reviewing Printed Circuit Board Schematics with the Quartus II Software

9. Reviewing Printed Circuit Board Schematics with the Quartus II Software November 2012 QII52019-12.1.0 9. Reviewing Printed Circuit Board Schematics with the Quartus II Sotware QII52019-12.1.0 This chapter provides guidelines or reviewing printed circuit board (PCB) schematics

More information

An Analytic Model for Embedded Machine Vision: Architecture and Performance Exploration

An Analytic Model for Embedded Machine Vision: Architecture and Performance Exploration 419 An Analytic Model or Embedded Machine Vision: Architecture and Perormance Exploration Chan Kit Wai, Prahlad Vadakkepat, Tan Kok Kiong Department o Electrical and Computer Engineering, 4 Engineering

More information

MATRIX ALGORITHM OF SOLVING GRAPH CUTTING PROBLEM

MATRIX ALGORITHM OF SOLVING GRAPH CUTTING PROBLEM UDC 681.3.06 MATRIX ALGORITHM OF SOLVING GRAPH CUTTING PROBLEM V.K. Pogrebnoy TPU Institute «Cybernetic centre» E-mail: vk@ad.cctpu.edu.ru Matrix algorithm o solving graph cutting problem has been suggested.

More information

COE 561 Digital System Design & Synthesis Introduction

COE 561 Digital System Design & Synthesis Introduction 1 COE 561 Digital System Design & Synthesis Introduction Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals Outline Course Topics Microelectronics Design

More information

2. System Interconnect Fabric for Memory-Mapped Interfaces

2. System Interconnect Fabric for Memory-Mapped Interfaces 2. System Interconnect Fabric for Memory-Mapped Interfaces QII54003-8.1.0 Introduction The system interconnect fabric for memory-mapped interfaces is a high-bandwidth interconnect structure for connecting

More information

Teaching Computer Architecture with FPGA Soft Processors

Teaching Computer Architecture with FPGA Soft Processors Teaching Computer Architecture with FPGA Soft Processors Dr. Andrew Strelzoff 1 Abstract Computer Architecture has traditionally been taught to Computer Science students using simulation. Students develop

More information

Midterm Exam. Solutions

Midterm Exam. Solutions Midterm Exam Solutions Problem 1 List at least 3 advantages of implementing selected portions of a design in hardware, and at least 3 advantages of implementing the remaining portions of the design in

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Midterm Exam. Solutions

Midterm Exam. Solutions Midterm Exam Solutions Problem 1 List at least 3 advantages of implementing selected portions of a complex design in software Software vs. Hardware Trade-offs Improve Performance Improve Energy Efficiency

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Cache 11232011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review Memory Components/Boards Two-Level Memory Hierarchy

More information

8. Best Practices for Incremental Compilation Partitions and Floorplan Assignments

8. Best Practices for Incremental Compilation Partitions and Floorplan Assignments 8. Best Practices for Incremental Compilation Partitions and Floorplan Assignments QII51017-9.0.0 Introduction The Quartus II incremental compilation feature allows you to partition a design, compile partitions

More information

AN 608: HST Jitter and BER Estimator Tool for Stratix IV GX and GT Devices

AN 608: HST Jitter and BER Estimator Tool for Stratix IV GX and GT Devices AN 608: HST Jitter and BER Estimator Tool or Stratix IV GX and GT Devices July 2010 AN-608-1.0 The high-speed communication link design toolkit (HST) jitter and bit error rate (BER) estimator tool is a

More information

Design of Embedded Hardware and Firmware

Design of Embedded Hardware and Firmware Design of Embedded Hardware and Firmware Introduction on "System On Programmable Chip" NIOS II Avalon Bus - DMA Andres Upegui Laboratoire de Systèmes Numériques hepia/hes-so Geneva, Switzerland Embedded

More information

V8-uRISC 8-bit RISC Microprocessor AllianceCORE Facts Core Specifics VAutomation, Inc. Supported Devices/Resources Remaining I/O CLBs

V8-uRISC 8-bit RISC Microprocessor AllianceCORE Facts Core Specifics VAutomation, Inc. Supported Devices/Resources Remaining I/O CLBs V8-uRISC 8-bit RISC Microprocessor February 8, 1998 Product Specification VAutomation, Inc. 20 Trafalgar Square Nashua, NH 03063 Phone: +1 603-882-2282 Fax: +1 603-882-1587 E-mail: sales@vautomation.com

More information

2. Getting Started with the Graphical User Interface

2. Getting Started with the Graphical User Interface February 2011 NII52017-10.1.0 2. Getting Started with the Graphical User Interace NII52017-10.1.0 The Nios II Sotware Build Tools (SBT) or Eclipse is a set o plugins based on the popular Eclipse ramework

More information

Design and Verification Point-to-Point Architecture of WISHBONE Bus for System-on-Chip

Design and Verification Point-to-Point Architecture of WISHBONE Bus for System-on-Chip International Journal of Emerging Engineering Research and Technology Volume 2, Issue 2, May 2014, PP 155-159 Design and Verification Point-to-Point Architecture of WISHBONE Bus for System-on-Chip Chandrala

More information

Buses. Maurizio Palesi. Maurizio Palesi 1

Buses. Maurizio Palesi. Maurizio Palesi 1 Buses Maurizio Palesi Maurizio Palesi 1 Introduction Buses are the simplest and most widely used interconnection networks A number of modules is connected via a single shared channel Microcontroller Microcontroller

More information

CMPE 415 Programmable Logic Devices Introduction

CMPE 415 Programmable Logic Devices Introduction Department of Computer Science and Electrical Engineering CMPE 415 Programmable Logic Devices Introduction Prof. Ryan Robucci What are FPGAs? Field programmable Gate Array Typically re programmable as

More information

Synthesis vs. Compilation Descriptions mapped to hardware Verilog design patterns for best synthesis. Spring 2007 Lec #8 -- HW Synthesis 1

Synthesis vs. Compilation Descriptions mapped to hardware Verilog design patterns for best synthesis. Spring 2007 Lec #8 -- HW Synthesis 1 Verilog Synthesis Synthesis vs. Compilation Descriptions mapped to hardware Verilog design patterns for best synthesis Spring 2007 Lec #8 -- HW Synthesis 1 Logic Synthesis Verilog and VHDL started out

More information

External Parallel Port to Internal Wishbone Interface

External Parallel Port to Internal Wishbone Interface Page 1 of 5 Revision History Revision Date Author Description 0.1 2008-06-22 Thomas Thanner Initial Version Author TTHA Date 2008-06-22 File wbc_parallel_master-spec_doc-r01.odt Page 2 of 5 1 Introduction

More information

SEMICON Solutions. Bus Structure. Created by: Duong Dang Date: 20 th Oct,2010

SEMICON Solutions. Bus Structure. Created by: Duong Dang Date: 20 th Oct,2010 SEMICON Solutions Bus Structure Created by: Duong Dang Date: 20 th Oct,2010 Introduction Buses are the simplest and most widely used interconnection networks A number of modules is connected via a single

More information

Embedded Systems. "System On Programmable Chip" NIOS II Avalon Bus. René Beuchat. Laboratoire d'architecture des Processeurs.

Embedded Systems. System On Programmable Chip NIOS II Avalon Bus. René Beuchat. Laboratoire d'architecture des Processeurs. Embedded Systems "System On Programmable Chip" NIOS II Avalon Bus René Beuchat Laboratoire d'architecture des Processeurs rene.beuchat@epfl.ch 3 Embedded system on Altera FPGA Goal : To understand the

More information

Chapter 5: ASICs Vs. PLDs

Chapter 5: ASICs Vs. PLDs Chapter 5: ASICs Vs. PLDs 5.1 Introduction A general definition of the term Application Specific Integrated Circuit (ASIC) is virtually every type of chip that is designed to perform a dedicated task.

More information

Design Space Exploration Using Parameterized Cores

Design Space Exploration Using Parameterized Cores RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS UNIVERSITY OF WINDSOR Design Space Exploration Using Parameterized Cores Ian D. L. Anderson M.A.Sc. Candidate March 31, 2006 Supervisor: Dr. M. Khalid 1 OUTLINE

More information

Synthesis of Combinational and Sequential Circuits with Verilog

Synthesis of Combinational and Sequential Circuits with Verilog Synthesis of Combinational and Sequential Circuits with Verilog What is Verilog? Hardware description language: Are used to describe digital system in text form Used for modeling, simulation, design Two

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

6. I/O Features in the Cyclone III Device Family

6. I/O Features in the Cyclone III Device Family July 2012 CIII51007-3.4 6. I/O Features in the Cyclone III Device Family CIII51007-3.4 This chapter describes the I/O eatures oered in the Cyclone III device amily (Cyclone III and Cyclone III LS devices).

More information

Yet Another Implementation of CoRAM Memory

Yet Another Implementation of CoRAM Memory Dec 7, 2013 CARL2013@Davis, CA Py Yet Another Implementation of Memory Architecture for Modern FPGA-based Computing Shinya Takamaeda-Yamazaki, Kenji Kise, James C. Hoe * Tokyo Institute of Technology JSPS

More information

FPGA Design Challenge :Techkriti 14 Digital Design using Verilog Part 1

FPGA Design Challenge :Techkriti 14 Digital Design using Verilog Part 1 FPGA Design Challenge :Techkriti 14 Digital Design using Verilog Part 1 Anurag Dwivedi Digital Design : Bottom Up Approach Basic Block - Gates Digital Design : Bottom Up Approach Gates -> Flip Flops Digital

More information

Review: Chip Design Styles

Review: Chip Design Styles MPT-50 Introduction to omputer Design SFU, Harbour entre, Spring 007 Lecture 9: Feb. 6, 007 Programmable Logic Devices (PLDs) - Read Only Memory (ROM) - Programmable Array Logic (PAL) - Programmable Logic

More information

WB_UART8 Serial Communications Port

WB_UART8 Serial Communications Port Summary This document provides detailed reference information with respect to the UART peripheral device. Core Reference CR0157 (v3.1) August 01, 2008 Serial ports on embedded systems often provide a 2-wire

More information

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function.

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function. FPGA Logic block of an FPGA can be configured in such a way that it can provide functionality as simple as that of transistor or as complex as that of a microprocessor. It can used to implement different

More information

Definitions. Key Objectives

Definitions. Key Objectives CHAPTER 2 Definitions Key Objectives & Types of models & & Black box versus white box Definition of a test Functional verification requires that several elements are in place. It relies on the ability

More information

VHDL. Chapter 1 Introduction to VHDL. Course Objectives Affected. Outline

VHDL. Chapter 1 Introduction to VHDL. Course Objectives Affected. Outline Chapter 1 Introduction to VHDL VHDL VHDL - Flaxer Eli Ch 1-1 Course Objectives Affected Write functionally correct and well-documented VHDL code, intended for either simulation or synthesis, of any combinational

More information

TSEA44 - Design for FPGAs

TSEA44 - Design for FPGAs 2015-11-24 Now for something else... Adapting designs to FPGAs Why? Clock frequency Area Power Target FPGA architecture: Xilinx FPGAs with 4 input LUTs (such as Virtex-II) Determining the maximum frequency

More information

System Development Tools for Excalibur Devices

System Development Tools for Excalibur Devices System Development Tools or Excalibur Devices January 2003, ver. 1.0 Application Note 299 Introduction The Excalibur embedded processor devices achieve a new level o system integration rom the inclusion

More information

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can

More information

1. Cyclone IV FPGA Device Family Overview

1. Cyclone IV FPGA Device Family Overview May 2013 CYIV-51001-1.8 1. Cyclone IV FPGA Device Family Overview CYIV-51001-1.8 Altera s new Cyclone IV FPGA device amily extends the Cyclone FPGA series leadership in providing the market s lowest-cost,

More information

Logic Synthesis. EECS150 - Digital Design Lecture 6 - Synthesis

Logic Synthesis. EECS150 - Digital Design Lecture 6 - Synthesis Logic Synthesis Verilog and VHDL started out as simulation languages, but quickly people wrote programs to automatically convert Verilog code into low-level circuit descriptions (netlists). EECS150 - Digital

More information

Design of an Efficient FSM for an Implementation of AMBA AHB in SD Host Controller

Design of an Efficient FSM for an Implementation of AMBA AHB in SD Host Controller Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 11, November 2015,

More information

Understanding the basic building blocks of a microcontroller device in general. Knows the terminologies like embedded and external memory devices,

Understanding the basic building blocks of a microcontroller device in general. Knows the terminologies like embedded and external memory devices, Understanding the basic building blocks of a microcontroller device in general. Knows the terminologies like embedded and external memory devices, CISC and RISC processors etc. Knows the architecture and

More information

FPGA based embedded processor

FPGA based embedded processor MOTIVATION FPGA based embedded processor With rising gate densities of FPGA devices, many FPGA vendors now offer a processor that either exists in silicon as a hard IP or can be incorporated within the

More information

Recommended Design Techniques for ECE241 Project Franjo Plavec Department of Electrical and Computer Engineering University of Toronto

Recommended Design Techniques for ECE241 Project Franjo Plavec Department of Electrical and Computer Engineering University of Toronto Recommed Design Techniques for ECE241 Project Franjo Plavec Department of Electrical and Computer Engineering University of Toronto DISCLAIMER: The information contained in this document does NOT contain

More information

AN OPEN-SOURCE VHDL IP LIBRARY WITH PLUG&PLAY CONFIGURATION

AN OPEN-SOURCE VHDL IP LIBRARY WITH PLUG&PLAY CONFIGURATION AN OPEN-SOURCE VHDL IP LIBRARY WITH PLUG&PLAY CONFIGURATION Jiri Gaisler Gaisler Research, Första Långgatan 19, 413 27 Göteborg, Sweden Abstract: Key words: An open-source IP library based on the AMBA-2.0

More information

Digital Systems Design. System on a Programmable Chip

Digital Systems Design. System on a Programmable Chip Digital Systems Design Introduction to System on a Programmable Chip Dr. D. J. Jackson Lecture 11-1 System on a Programmable Chip Generally involves utilization of a large FPGA Large number of logic elements

More information

FPGAs: FAST TRACK TO DSP

FPGAs: FAST TRACK TO DSP FPGAs: FAST TRACK TO DSP Revised February 2009 ABSRACT: Given the prevalence of digital signal processing in a variety of industry segments, several implementation solutions are available depending on

More information

Keywords- AMBA, AHB, APB, AHB Master, SOC, Split transaction.

Keywords- AMBA, AHB, APB, AHB Master, SOC, Split transaction. Volume 4, Issue 3, March 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Design of an Efficient

More information

Control and Datapath 8

Control and Datapath 8 Control and Datapath 8 Engineering attempts to develop design methods that break a problem up into separate steps to simplify the design and increase the likelihood of a correct solution. Digital system

More information

A Classification System and Analysis for Aspect-Oriented Programs

A Classification System and Analysis for Aspect-Oriented Programs A Classiication System and Analysis or Aspect-Oriented Programs Martin Rinard, Alexandru Sălcianu, and Suhabe Bugrara Massachusetts Institute o Technology Cambridge, MA 02139 ABSTRACT We present a new

More information

The QR code here provides a shortcut to go to the course webpage.

The QR code here provides a shortcut to go to the course webpage. Welcome to this MSc Lab Experiment. All my teaching materials for this Lab-based module are also available on the webpage: www.ee.ic.ac.uk/pcheung/teaching/msc_experiment/ The QR code here provides a shortcut

More information

9.8 Graphing Rational Functions

9.8 Graphing Rational Functions 9. Graphing Rational Functions Lets begin with a deinition. Deinition: Rational Function A rational unction is a unction o the orm P where P and Q are polynomials. Q An eample o a simple rational unction

More information

Design an 8 bit Microprocessor with all the Operations of 8051 Microcontroller with the help of VHDL

Design an 8 bit Microprocessor with all the Operations of 8051 Microcontroller with the help of VHDL Design an 8 bit Microprocessor with all the Operations of 8051 Microcontroller with the help of VHDL Annu Verma 1, Ashwani Depuria 2, Bhavesh Vaidya 3, Snehlata Haldkar 4, Prem Ratan Agrawal 5 1 to 4 BE

More information

SD Card Controller IP Specification

SD Card Controller IP Specification SD Card Controller IP Specification Marek Czerski Friday 30 th August, 2013 1 List of Figures 1 SoC with SD Card IP core................................ 4 2 Wishbone SD Card Controller IP Core interface....................

More information

Multimedia Decoder Using the Nios II Processor

Multimedia Decoder Using the Nios II Processor Multimedia Decoder Using the Nios II Processor Third Prize Multimedia Decoder Using the Nios II Processor Institution: Participants: Instructor: Indian Institute of Science Mythri Alle, Naresh K. V., Svatantra

More information

RECONFIGURABLE SPI DRIVER FOR MIPS SOFT-CORE PROCESSOR USING FPGA

RECONFIGURABLE SPI DRIVER FOR MIPS SOFT-CORE PROCESSOR USING FPGA RECONFIGURABLE SPI DRIVER FOR MIPS SOFT-CORE PROCESSOR USING FPGA 1 HESHAM ALOBAISI, 2 SAIM MOHAMMED, 3 MOHAMMAD AWEDH 1,2,3 Department of Electrical and Computer Engineering, King Abdulaziz University

More information

Intelligent knowledge-based system for the automated screwing process control

Intelligent knowledge-based system for the automated screwing process control Intelligent knowledge-based system or the automated screwing process control YULIYA LEBEDYNSKA yuliya.lebedynska@tu-cottbus.de ULRICH BERGER Chair o automation Brandenburg University o Technology Cottbus

More information

Spiral 2-8. Cell Layout

Spiral 2-8. Cell Layout 2-8.1 Spiral 2-8 Cell Layout 2-8.2 Learning Outcomes I understand how a digital circuit is composed of layers of materials forming transistors and wires I understand how each layer is expressed as geometric

More information

ISSN Vol.03,Issue.29 October-2014, Pages:

ISSN Vol.03,Issue.29 October-2014, Pages: ISSN 2319-8885 Vol.03,Issue.29 October-2014, Pages:5891-5895 www.ijsetr.com Development of Verification Environment for SPI using OVM K.NAGARJUNA 1, K. PADMAJA DEVI 2 1 PG Scholar, Dept of ECE, TKR College

More information

Automated modeling of custom processors for DCT algorithm

Automated modeling of custom processors for DCT algorithm Automated modeling of custom processors for DCT algorithm D. Ivošević and V. Sruk Faculty of Electrical Engineering and Computing / Department of Electronics, Microelectronics, Computer and Intelligent

More information

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

Applying the Benefits of Network on a Chip Architecture to FPGA System Design white paper Intel FPGA Applying the Benefits of on a Chip Architecture to FPGA System Design Authors Kent Orthner Senior Manager, Software and IP Intel Corporation Table of Contents Abstract...1 Introduction...1

More information

A Large-scale Architecture for Restricted Boltzmann Machines

A Large-scale Architecture for Restricted Boltzmann Machines A Large-scale Architecture or Restricted Boltzmann Machines Sang Kyun Kim, Peter L. McMahon, Kunle Olukotun Department o Electrical Engineering Stanord University Stanord, Caliornia, United States {skkim38,pmcmahon,kunle}@stanord.edu

More information

EEL 4783: HDL in Digital System Design

EEL 4783: HDL in Digital System Design EEL 4783: HDL in Digital System Design Lecture 15: Logic Synthesis with Verilog Prof. Mingjie Lin 1 Verilog Synthesis Synthesis vs. Compilation Descriptions mapped to hardware Verilog design patterns for

More information

Best Practices for Incremental Compilation Partitions and Floorplan Assignments

Best Practices for Incremental Compilation Partitions and Floorplan Assignments Best Practices for Incremental Compilation Partitions and Floorplan Assignments December 2007, ver. 1.0 Application Note 470 Introduction The Quartus II incremental compilation feature allows you to partition

More information

Multi-core microcontroller design with Cortex-M processors and CoreSight SoC

Multi-core microcontroller design with Cortex-M processors and CoreSight SoC Multi-core microcontroller design with Cortex-M processors and CoreSight SoC Joseph Yiu, ARM Ian Johnson, ARM January 2013 Abstract: While the majority of Cortex -M processor-based microcontrollers are

More information

Behavioral ARM Processor Model Using System C. Dallas Webster IEEE Student Member No Texas Tech University 12/05/05

Behavioral ARM Processor Model Using System C. Dallas Webster IEEE Student Member No Texas Tech University 12/05/05 Behavioral ARM Processor Model Using System C Dallas Webster IEEE Student Member No. 80296404 Texas Tech University 12/05/05 Table of Contents Abstract... 2 Introduction... 3 System C... 3 ARM Processors...

More information

Lecture 25: Busses. A Typical Computer Organization

Lecture 25: Busses. A Typical Computer Organization S 09 L25-1 18-447 Lecture 25: Busses James C. Hoe Dept of ECE, CMU April 27, 2009 Announcements: Project 4 due this week (no late check off) HW 4 due today Handouts: Practice Final Solutions A Typical

More information

Embedded Applications. COMP595EA Lecture03 Hardware Architecture

Embedded Applications. COMP595EA Lecture03 Hardware Architecture Embedded Applications COMP595EA Lecture03 Hardware Architecture Microcontroller vs Microprocessor Microprocessor is a term used to describe all programmed computational devices. Microcontroller is a term

More information