The Use of the No-Instruction-Set Computer (NISC) for Acceleration in WISHBONE-Based Systems

Size: px

Start display at page:

Download "The Use of the No-Instruction-Set Computer (NISC) for Acceleration in WISHBONE-Based Systems"

Leona Rice
5 years ago
Views:

1 The Use o the No-Instruction-Set Computer () or Acceleration in WISHBONE-Based Systems Roko Grubišić, Vlado Sruk Technical Report Department o Electronics, Microelectronics, Computer and Intelligent Systems Faculty o Electrical Engineering and Computing, Zagreb, Croatia {roko.grubisic, vlado.sruk}@er.hr Abstract: General-purpose processors are oten unable to exploit the parallelism inherent to the sotware code. This is why additional hardware accelerators are needed to enable meeting the perormance goals. (No-Instruction-Set Computer) is a new approach to hardware-sotware co-design based on automatic generation o special-purpose processors. It was designed to be sel-suicient and it eliminates the need or other processors in the system. This work describes a method or expanding the application domain o the processor to general-purpose processor systems with large amounts o processor-speciic legacy code. This coprocessor-based approach allows application acceleration by utilizing both instruction-level and task-level parallelism by migrating perormance-critical parts o an application to hardware without the need or changing the rest o the program code. For demonstration o this concept, a coprocessor WISHBONE interace was designed. It was implemented and tested in a WISHBONE system based on Altium s TSK3000A general-purpose RISC sot processor and an analytical model was proposed to provide the means to evaluate its eiciency in arbitrary systems. Keywords:, WISHBONE, coprocessor, bus, interace 1

2 1 Introduction Embedded computer systems are no longer used only as simple control devices. Instead, today s embedded systems have to eiciently perorm complex tasks and algorithms despite increasingly stringent design constraints and shrinking time-to-market. Traditional sotware-centered design approach oers high designer productivity, but low design quality in terms o perormance and logic utilization. The key or increasing the design quality is the migration o computationally-intensive parts o the system s task to hardware. Since general-purpose processors are oten unable to exploit the parallelism inherent in the sotware code, these hardware accelerators provide an eective way to meet the design s perormance goals or certain classes o applications. Hardware acceleration can be accomplished by adding application-speciic coprocessors to the system or by customizing existing processors. Custom hardware is conventionally designed at the register transer level (RTL), which entails manually coding the RTL model in a hardware description language (HDL), such as VHDL or Verilog. This process is oten tedious and error-prone and limits the designer productivity. To overcome this problem, it is necessary to raise the level o abstraction or the hardware design and use special tools to synthesize the actual circuitry. Such design approaches include High-level synthesis (HLS), Application-speciic instruction set processors (ASIPs) and the No Instruction Set Computer (). HLS and ASIP techniques are commonly used to design custom hardware, but their eectiveness and scope o use are limited. HLS usually supports only a subset o a high-level programming language and is only eective or small-size problems. Also, synthesis results tend to be unpredictable and so the designer can only try to meet the design constraints by trial-and-error methods. AISPs, on the other hand, rely on a limited number o custom instructions to increase the system s perormance. The number o custom instructions is limited by the instruction decoder and some ASIPs even limit this number to a single custom instruction. Design o special datapath extensions is also required and it aces the same productivity vs. perormance problems as designing any custom hardware. [1][2] addresses these limitations by eliminating the instruction abstraction and compiling programming language code directly to datapath control words using a special cycle-accurate compiler [3]. In this way, this approach keeps the best o both the world o general purpose processors and the world o custom hardware design. processor s architecture can be manually modeled in an Architecture Description Language (ADL) called Generic Netlist Representation (GNR) [4], automatically generated or selected rom a library o standard or previously designed architectures. Toolset [5] generates the RTL model o the processor and control words or the desired application to be implemented in the desired technology, whether FPGA or ASIC. approach utilizes instruction-level parallelism (ILP) to speed up the execution o an algorithm and the generated processor is capable o executing several equivalent RISC instructions in a single clock 2

3 cycle. The additional prebinding mechanism enables simple I/O and interrupts [6] and in this way the scope o applications or which a processor can be used is extended to control tasks o a classical microcontroller. The communication interace or the processor [7] enables the use o multiple processors in the system. It enables exploiting the system s available task-level parallelism by parallel execution o the system s tasks on dierent processors which urther accelerates the application. approach was designed to be sel-suicient and it eliminates the need or other processors in the system. However, that represents a problem when attempting to integrate a processor into an existing system. In the case when a large base o non-portable legacy code optimized or a particular general purpose processor exists, we suggest a coprocessor-based approach to take advantage o the technology and accelerate desired applications. This approach allows utilizing both instruction-level and task-level parallelism by migrating perormance-critical parts o the application to hardware without the need or signiicant changes to the rest o the program code. One o the major design challenges o the coprocessor approach is the integration o the processor with other IP cores in the system. When designing SoCs, standard bus architectures are usually used, but the processor itsel isn t compatible with any o the speciications. Standard bus architectures simpliy the process o system integration, shorten time to market and support portability and reuse o IP cores. Renowned SoC bus architectures o the day include ARM Advanced Microcontroller Bus Architecture (AMBA), IBM CoreConnect, WISHBONE and Altera Avalon. The use o standard bus architectures also provides the opportunities or using special tools that automatically generate bus interaces and interconnections, in this way urther shortening time to market and reducing the possibility o human error. A standard bus interace or the processor enables its simple integration in a wide variety o systems with dierent processors and peripherals. In this paper we present the WISHBONE bus interace or the processor [8] and propose the use o the processor as a loosely-coupled application-speciic coprocessor in WISHBONE-based systems. The WISHBONE Interace was designed, implemented and tested in a WISHBONE system based on Altium s TSK3000A general-purpose RISC sot processor using an Altium LiveDesign Evaluation Board with a Spartan3 FPGA and a Xilinx ML506 board with a Virtex-5 FPGA. 2 WISHBONE bus architecture The WISHBONE System-on-chip (SoC) architecture or portable IP cores [8] is a lexible design methodology developed by Silicore Corp. targeted at SoC integration and design reuse. This is accomplished by deining a standard interconnection scheme and data exchange protocols. WISHBONE speciication deines a single, simple, logical, ully synchronous MASTER/SLAVE bus and IP core interaces that require very ew logic gates. It supports dierent technology-independent interconnection topologies ranging rom simple point-to-point and shared bus interconnections to data low interconnections and complex switch abrics. It also supports a ull range o standard data transer protocols including SINGLE READ/WRITE cycles, BLOCK READ/WRITE cycles and read-modiy-write 3

4 (RMW) cycles with various data sizes and byte ordering. A handshake mechanism enables low control and communication between dierent-speed cores. WISHBONE bus architecture was chosen or the implementation o the coprocessor interace mainly because o its lexibility and the act it imposes no licensing and application restrictions. WISHBONE speciication is currently maintained by OpenCores.org and today it represents a de acto standard or open-source hardware. It is also oers CAD tool support and a large library o ree processors and other IP cores. Besides ree processor IP, many popular commercial sot processors (e.g. Xilinx MicroBlaze and Altera Nios) have a WISHBONE-compatible variant available. 3 WISHBONE Interace In this section we introduce our approach to enabling the use o the processor in a WISHBONEbased system. To make the connection o the processor to the WISHBONE bus possible, we designed and implemented a WISHBONE Interace IP. This IP enables the connection o the processor to the WISHBONE bus, but or the processor to be used as a coprocessor which perorms useul tasks in the system, some changes in the processor s architecture are required. We describe these architectural additions together with the sotware necessary to enable the communication rom both the side o the processor and the side o the main processor. We also propose an appropriate communication scheme, designed to ease the migration o application s unctions between hardware and sotware and thus ease the process o design space exploration. The WISHBONE Interace is compatible with the existing design low and requires no changes in the Toolset. The WISHBONE Interace IP and the sotware communication scheme enable control and data transer which are necessary or the use o the processor as a coprocessor in WISBONE-based systems. The main processor can transer the data to be processed (the unction s arguments) to the coprocessor, start the execution o the coprocessor unction, detect when the data processing is completed (either by pooling the coprocessor s status or by means o an interrupt) and retrieve the results. The designed WISHBONE Interace is targeted at ully synchronous WISHBONE systems and supports word size (32-bit), word granularity SINGLE READ/WRITE WISHBONE classic bus cycles. We describe two dierent methods or exchanging the data between the general-purpose processor and the coprocessor. One method is suitable or simple general-purpose data transer and the other or custom special-purpose data transers tailored or a speciic application. A design approach with two distinct parts o the WISHBONE interace was ollowed to enable this kind o lexibility. The irst part is Basic WISHBONE Interace which is responsible or control unctions and special-purpose data transers. The other part o the interace is the Data Memory WISHBONE Interace and this one is responsible or general purpose data transers, i.e. transerring large amounts o data directly to and rom the data memory o the processor. 4

5 3.1 Basic WISHBONE Interace The Basic WISHBONE Interace enables coprocessor control control, i.e. starting or stopping data processing and detecting when the data processing is completed, and transerring smaller amounts o special purpose data directly into the processor s datapath. This is the main part o the WISHBONE Interace and it is necessary or the communication since it enables coprocessor control and interrupt capabilities. It is designed or seamless integration with the RTL model generated by the Toolset. The advantage o using the Basic WISHBONE Interace or data transers is ast communication (one-cycle transers) and the ability to bring the data directly to the input o the desired unctional unit, in this way speeding up the execution o the algorithm. The disadvantage o this part o the interace is that it is only suitable or transerring small amounts o data, since this approach doesn t scale well beyond a ew arguments and results side communication requirements The accessible external signals o the processor s RTL model are the clock signal, the reset signal, the halt signal and the conigurable I/O ports. The Basic WISHBONE Interace uses only these external signals to connect the processor to the WISHBONE bus. Data transer rom the main processor to the processor is achieved using input ports which can be read by the s application using the prebinding mechanism. Starting the s application execution is accomplished by resetting the processor, i.e. by generating a alling edge on the reset pin. Completion detection is achieved by reading the halt signal which automatically sets upon completing its task. The results are retrieved using output ports, also using the prebinding mechanism. Ater the coprocessors completes, it writes the results to its output ports registers to be read by the main processor using the WISHBONE interace. The output ports have an internal register inside the processor s datapath, but because the input ports unction only as proxies or external registers, it was necessary to provide additional registers as a part o the WISHBONE interace. These input and output registers represent the shared resources that enable the transer o data between the main processor and the coprocessor. The designer s task is to add the external I/O ports to the desired datapath units as a part o the process o designing the architecture using the GNR ADL. A simple architecture with two input ports or the arguments and one output port or the result shown on Figure 1. This method o communication is convenient when the application requires intensive calculations with some o the arguments or when only simple argument exchange is required. In this case arguments can be brought directly to the input o the desired unctional unit and selected when needed, using input multiplexers. This eliminates the need or the sequential transer o the arguments to the register ile or later use and thus eliminates unnecessary transer o the data. One sequential transer is already done when the main processor transers the data to the interace s registers, so another transer rom the interace s registers to the processor s register ile or internal registers would double the communication expenses and thus limit the eiciency o the coprocessor. This approach also avoids register spilling or oten-used values, which is especially important in area-limited systems with small register iles. Output registers or the 5

6 results are used in a similar manner and can also be connected to the datapath in such a way to minimize communication expenses. RF ExternalInput1 ExternalInput2 MUX MUX ExternalOutput ExternalOutputReg ALU Figure 1: Simple architecture with argument inputs and a result output Datapath connections or external inputs and external outputs enable s hardware to access shared registers or communication, while the Basic WISHBONE Interace enables the same access or the main processor. To enable the s application to actually read these shared resources or write them, appropriate prebound unctions or reading or writing shared registers are used. Listing 1 illustrates this concept or a simple application with two arguments and one result. First, the arguments are read rom the external input ports into temporary variables and then they are used in a unction to compute the result. The result is then written to the external output register to be read by the main processor when the application terminates and the halt signal is asserted. void NiscMain(){ //... a ExternalInput1_read(); //get the irst argument b ExternalInput2_read(); //get the second argument //... c do_stu(a,b); //calculate the result //... ExternalOutputReg_write(c); //return the result } Listing 1: Sample application 6

7 3.1.2 Basic WISHBONE Interace implementation To enable the transer o the data and control low, the Basic WISHBONE Interace s provides registers or input arguments and special control registers. Control registers are used to set the state o the processor s reset signal and to read the state o the halt signal. The result output ports are simply routed to the interace s output multiplexer to enable the main processor to read the results. A simpliied block schematic o the interace is shown in Figure 2. The igure shows the internal registers o the interace and the simpliied circuitry to illustrate the path o data to and rom the processor s top module NiscSystem. Basic WISHBONE Interace CLK_I RST_I CYC_I STB_I WE_I & RESET REG clk reset halt NiscSystem WISHBONE ACK_O ADDR_I SEL_I DAT_I INPUT1 REG INPUT2 REG ExternalInput1 ExternalInput2 DAT_O ExternalOutput INT_O & INT_EN REG Figure 2: Basic WISHBONE Interace In a typical WISHBONE system, the most signiicant bytes o the address are decoded in the interconnect module (the so called WISHBONE Intercon), which enables the desired peripheral using the active cycle (CYC_I) and strobe (STB_I) signals, and only the required number o address bits is orwarded to the peripheral s ADDR_I inputs. The internal address decoding o the peripheral then determines which o its internal memory locations is addressed, and this is also the way the Basic WISHBONE Interace operates. Reading the data rom the interace is implemented using a multiplexer which is controlled by the WISHBONE address lines. The write operation is implemented using register enable signals controlled by the address decode logic and the write enable qualiier signal (WE_I). The WISHBONE clock signal CLK_I is routed to the processor and the interace to create a ully synchronous system. The WISHBONE handshake mechanism is implemented using only combinatorial logic, as there are no slow modules in the design. I.e. none o the modules requires more than one cycle or read or write operations so no memory elements or delay are required. WISHBONE reset signal RST_I resets the processor and the interace s internal registers. 7

8 Interrupt mechanism is implemented in such way that setting the interrupt signal INT_O only occurs when the interrupt enable (INT_EN) lag is set and the halt signal is set. INT_O is not a standard WISHBONE signal, but it can easily be connected directly to the main processor and several commercial tools (e.g. Altium Designer) even provide interrupt connections as a part o the WISHBONE interconnect module Basic WISHBONE Interace programming model Using the coprocessor attached to the WISHBONE bus in this way is very straightorward. First, the value 1 is written to the reset register to put the processor in reset state and to insure that only the main processor has access to the shared registers. Then, the arguments are written to appropriate registers and the value 0 is written to the reset register to start the s application execution. Ater that, the main processor can pool the value o the halt signal until it becomes active and then retrieve the results by reading appropriate registers. The reset register and the halt signal are mapped to the same memory location (location 0), and one is accessed when writing and the other when reading that memory address. The interace s ull memory map is shown on Table 1. The irst memory location, at address 0 (plus the base address displacement) is reserved or the already mentioned reset/halt pair, the control register. Because 32-bit addressing is used, the address dierence between two adjacent memory locations is 4 bytes. So, the next location s address is 4 and this is the location o the interrupt enable register. Ater that ollow the argument and result registers. The example register address layout shown in Table 1 enables simple implementations o traditional C unctions with arbitrary number o arguments and one return value. O course, or dierent applications with additional result values and matching result registers, dierent address layouts are possible with dierently laid-out interace datapath and I/O ports. Table 1: Basic WISHBONE Interace memory map Address Name Width [bit] Description 00 RESET HALT!(INT_FLAG) 1 Bit0: read: status o the halt signal is read write: reset register is written to '1' sets the to reset state, '0' initializes the execution ('0' also serves as the interrupt acknowledge) 04 INT_EN 1 Bit0: reads/writes the interrupt enable lag '1' enables the interrupts, '0' disables them 08 RESULT 32 Read/Write result 12 ARG 1 32 Read/Write 1 st argument 16 ARG 2 32 Read/Write 2 nd argument (N+2)*4 ARG N 32 Read/Write N th argument 8

9 Listing 2 shows an example C unction oloaded to the coprocessor to illustrate the designed communication scheme. The unction call is inlined to eliminate the need or copying the arguments to local variables. This example unction is blocking, i.e. it doesn t return until the coprocessor s task is completed and in that way the coprocessor s task behaves as a regular C unction. Because o that, the programmer can use it rom the calling unction as though it is a sotware, rather than hardware, implementation o the algorithm. When using this communication scheme, only the body o the unction needs to be replaced and the calling unctions remain unchanged when migrating the desired unction s implementation between hardware and sotware. In this way, it is very simple to experimentally evaluate design options and thus explore the design space. For exploiting the task-level, data and unction parallelism, more complex non-blocking versions o coprocessor unction calls should be used. In this way, the main processor can send the data to the coprocessor, initialize the data processing and carry on with its own tasks. These tasks could include control unctions, processing a part o the data (data parallelism), or perorming a part o the processing on all the data (unction parallelism). // declarations volatile uint32_t * nisc_ctrl (uint32_t *) Base_; volatile uint32_t * nisc_ie (uint32_t *) (Base_ + 4); volatile uint32_t * nisc_result (uint32_t *) (Base_ + 8); volatile uint32_t * nisc_arg1 (uint32_t *) (Base_ + 12); volatile uint32_t * nisc_arg2 (uint32_t *) (Base_ + 16); //hide the pointers behind #deine #deine _CTRL ( *nisc_ctrl ) #deine _HALT ( *nisc_ctrl ) //alias or _CTRL #deine _IE ( *nisc_ie ) #deine _RESULT ( *nisc_result ) #deine _ARG1 ( *nisc_arg1 ) #deine _ARG2 ( *nisc_arg2 ) //constants #deine _STOP 1 #deine _GO 0 // unction (blocking) inline uint32_t nisc_unction(uint32_t arg1, uint32_t arg2){ _CTRL _STOP; //stop the _ARG1 arg1; //write arg1 _ARG2 arg2; //write arg2 _CTRL _GO; //start the } while(!_halt); return _RESULT; //wait until completes //return the result Listing 2: Communication with the processor 9

10 Interrupt-driven communication with the coprocessor requires enabling the interrupts using the interrupt enable lag. When the arguments are sent to the coprocessor registers and the execution is started, the main processor can carry on with its regular tasks and retrieve the results in the interrupt service routine (ISR). As a part o the ISR, the main processor must also acknowledge the interrupt by clearing the WISHBONE Interace s interrupt lag which is achieved by writing the value 0 to the control register. Other standard interrupt acknowledge procedures must be ollowed depending on the processor s interrupt system architecture and whether the interrupts are level or edge triggered. The next batch o data to be processed can be sent in the ISR and the next execution can also be initiated. 3.2 Data Memory WISHBONE Interace The Basic WISHBONE Interace was designed or applications which need to exchange a small amount o special purpose data with the main processor. The problem is that this approach doesn t scale well or applications that have a large amount o arguments and/or results. This approach would then require a large number o registers and large multiplexers, which would require a lot o chip area and would also limit the maximum requency, especially when considering the negative implications o using large multiplexers in FPGAs. A urther drawback o the Basic WISHBONE Interace is the inability to perorm indexed addressing on the arguments and results without the need or data transer between shared registers and the data memory. E.g. to perorm indexed addressing on the arguments, processor must copy the data rom the interace s argument registers to internal arrays in the data memory and every piece o data thus has to be copied twice which imposes a large communication overhead when transerring large amounts o data. This especially represents an issue with algorithms that involve matrix calculations and the like Data Memory Access Multiplexer/Arbiter A simple solution or the problem o transerring large amounts o data to and rom the processor is the direct transer o data to and rom the processor s data memory. To achieve this, a special module that enables external access to s data memory was added to the design. The role o this data memory access multiplexer is to provide access to the processor s data memory rom the WISHBONE bus and to arbitrate access to the memory between the interace and the main processor. The data memory multiplexer is connected between the processor s core (controller and datapath) and the data memory and it provides an additional set o pins that are connected outside o the processor itsel, as shown in Figure 3. 10

11 App clk reset halt Controller ExternalInputs ExternalOutputs cmem Datapath dmem mux dmem dmem external access Figure 3: Multiplexing the access to the data memory The data memory multiplexer s arbitration scheme is quite simple: i the processor s reset or halt signals are active, then the main processor has control over the data memory. Otherwise, the control is let to the processor and the main processor has no means o writing to the data memory and reads the value o all data bits as zeroes. The implementation o the data memory multiplexer is shown in Figure 4. DATA_OUT dmem mux ADDR+CTRL DATA_IN RESET HALT OR DATA_OUT EXTERNAL DATA_OUT ADDR+CTRL DATA_IN MUX GND MUX MUX DATA_IN ADDR+CTRL Memory Figure 4: Data memory multiplexer implementation 11

12 3.2.2 Data Memory WISHBONE Interace implementation Data Memory Access Multiplexer/Arbiter provides the means or external access to the data memory o the processor and the Data Memory WISHBONE Interace is connected to these external access pins. Its role is to handle the appropriate data conversion and handshaking to enable correct data memory access operations rom the WISHBONE bus. s data memory module encompasses FPGA block RAM s (BRAMs) and a memory controller module (namely, ByteAddressableDMemLogic Verilog module). The data memory module s ports are actually the memory controller s ports so the Data Memory WISHBONE Interace is designed in such a way to operate with the memory controller and internal BRAMs. Figure 5 shows the internal organization o the Data Memory WISHBONE Interace. The address bits are orwarded directly rom the WISHBONE address lines to data memory s address lines. Since s memory controller doesn t require word access addresses aligned on 4 byte borders and requires the data with width less than 32-bits to be aligned to the lower data bits (unlike WISHBONE mechanism with byte-enables), special alignment logic was designed to handle the translation. Also, depending on the access type, WISHBONE byte enable lines are translated to appropriate type codes or the memory controller. Write and read enable signals are derived rom the WISHBONE write enable qualiier WE_I. The handshaking mechanism is implemented with 1 cycle acknowledge delay to allow or the BRAM latency, since FPGA BRAMs are synchronous. CLK_I Data Memory WISHBONE Interace WISHBONE RST_I CYC_I STB_I ACK_O WE_I ADDR_I SEL_I DAT_O DAT_I & & ACK DELAY REG TYPE DECODE ALIGN ALIGN TYPE DATA IN READ EN WRITE EN ADDR DATA OUT dmem external access Figure 5: Data Memory WISHBONE Interace 3.3 The complete WISHBONE Interace and its programming model The complete WISHBONE Interace, as shown o Figure 6, consists o the Basic WISHBONE Interace and the Data Memory WISHBONE Interace. This approach with two distinct interaces eliminates the need or address translations inside o the interace to dierentiate between the shared registers space and the data memory space and thus simpliies the design. The simplest way to use the 12

13 complete interace is to irst put the processor in the reset state using the Basic WISHBONE Interace and then send the data through the Data Memory WISHBONE Interace. Ater that, is taken out o reset state and the main processor waits or the application to complete and retrieves the results rom the data memory. Special-purpose data could be also sent or retrieved using the Basic WISHBONE Interace. It s also important to point out that the lack o hardware or sotware reset o s data memory is vital or this sort o communication because the sent arguments would otherwise be lost when the processor is reset. To enable the main processor to read or write a particular bit o data in the s data memory, its address must be known and it is used as an oset rom the base address o the Data Memory WISHBONE Interace in the main processor s memory map. Toolset s Storage Bindings outputs are used to determine the address o a particular global variable or a global array in the data memory. Storage Bindings provide this inormation in the orm o a table and the required inormation can be ound in this table and used in the program code. The whole data memory o the processor is mapped to a contiguous external memory block in the main processor s address space and is thus easily accessible through pointer manipulation. Ater the main processor writes the arguments to the global data structures in the processor s data memory, the processor operates on this data. The processor then writes the results to designated global data structures which are aterwards read by the main processor. This approach requires no changes in the s algorithm, provided that it uses global variables or global arrays. Global structures can also be passed to the unctions as arguments, which eliminates the need or any changes in the code o any o the unctions. App Basic WISHBONE Interace clk reset halt cmem ExternalInput1 ExternalInput2 ExternalOutput Controller Datapath dmem mux dmem Data Memory WISHBONE Interace dmem external access Figure 6: WISHBONE Interace architecture 13

14 Listing 3 shows an example implementation o the communication using the Data Memory WISHBONE Interace. Ater asserting s reset signal to ensure access to the data memory, arguments are copied to the data memory using pointers and the data processing is started by deasserting the reset signal. Ater completion detection, using pooling in this example, results can be retrieved or urther processing. I there is no need to immediately start processing the next batch o data, the data can be used directly rom s data memory, thus eliminating the need or copying. I, however, we want to utilize task-level parallelism and process another batch o data in parallel with the task o the general-purpose processor, copying becomes necessary. volatile uint32_t *nisc_dmem_arg (uint32_t*) (nisc_dmem_base + arg_oset); volatile uint32_t *nisc_dmem_res (uint32_t*) (nisc_dmem_base + res_oset); // unction (blocking) inline void nisc_unction(uint32_t *arguments, uint32_t *results){ _CTRL _STOP; // stop //send the arguments or(int i0; i < N; i++){ nisc_dmem_arg[i] arguments[i]; } _CTRL _GO; while(!_halt); // start //wait until completes } //return the results or(int i0; i < N; i++){ results[i] nisc_dmem_res[i]; } Listing 3: Communication using the Data Memory WISHBONE Interace 3.4 TSK3000A WISHBONE System with a coprocessor The complete WISHBONE Interace was implemented using VHDL and integrated with a generated processor in a sel-suicient WISHBONE compatible module which can be used as a coprocessor in systems based on the WISHBONE bus architecture. The generated processor was based on a generic architecture with necessary I/O datapath extensions and simple applications designed to test the communication capabilities. As a proo-o-concept or the presented coprocessor-based approach, we implemented and tested a WISHBONE system based on Altium s TSK3000A generalpurpose 32-bit RISC sot processor. TSK3000A is based on the MIPS/DLX instruction set and is FPGA vendor independent (unlike FPGA vendors proprietary processors, such as Xilinx MicroBlaze and Altera Nios). It is designed or seamless integration with Altium s conigurable Wishbone Interconnect module which was also used in this proo-o-concept system. Main processor-side communication routines were developed as presented earlier in this chapter and then tested in conjunction with the hardware using the Altium LiveDesign Evaluation Board with a Spartan3 FPGA and Xilinx ML506 with a Virtex-5 FPGA. The schematic or the complete system is shown in Figure 7. 14

Figure 7: TSK3000A with a coprocessor 4 Perormance analysis The designed WISHBONE Interace provides the means or using the processor as a looselycoupled coprocessor in WISHBONE-based systems.

15 Figure 7: TSK3000A with a coprocessor 4 Perormance analysis The designed WISHBONE Interace provides the means or using the processor as a looselycoupled coprocessor in WISHBONE-based systems. To justiy the use o the coprocessor in the system, the speedup o the algorithm must justiy the cost o the extra hardware. Furthermore, the speedup o the implementation o the algorithm in question must be such to cancel the negative eect on the perormance imposed by the main processor to coprocessor communication overhead and still provide overall system speedup. O course, what is actually justiied also depends on the achieved speedup rate and design constraints. E.g. or area-constrained systems a small achieved speedup would not justiy the use o a coprocessor, but it could be justiied in systems where area is not an issue and every bit o perormance improvement counts. To ease the design space exploration and provide the means o estimating the system s perormance and the margins o the speedup required by the processor in the early stages o the project, an analytical model o the system s perormance was devised. This model is intended to be used together with the proiling inormation and it enables the designer to estimate the beneits o using a coprocessor or a particular part o the system s task and quickly explore the design space beore the actual system is built. It also predicts the minimum perormance limit or the processor which can be used to quickly eliminate unsuitable architectures and thus drive the hardware-sotware cooptimization process in the right direction considering the system-level requirements. The WISHBONE Interace was designed in such a way as to extract maximum perormance in systems with a general-purpose processor which uses WISHBONE classic bus cycles, like the Altium TSK3000A. This means that an access to the interace s register takes one clock cycle and an access to 15

16 the data memory takes two clock cycles (because o BRAM latency). WISHBONE classic bus cycles don t allow or distinct address and data transer phases and any sort o bus transaction pipelining to improve perormance. Because o that, there are no means or urther acceleration o the data transer according to the WISHBONE classic speciication, not even with additional circuitry. The number o clock cycles required or the communication is deined by the number o cycles required or the transer o arguments, the number o cycles required or the transer o results and the number o cycles required or the control operations. The control overhead consists o only two cycles, one or stopping the processor (i.e. putting it into reset state) and one or starting it (i.e. taking it out o reset state). To estimate the perormance impact o the communication overhead, we use the worst-case scenario or the transer o the arguments and results, i.e. we use the slower mode o data transer, the transer o data to and rom the data memory o the processor. The communication time is deined by the cycle count or the communication and the requency o the communication clock, i.e. the bus requency, which is the same as the system requency in ully synchronous systems. Fully synchronous systems oer the advantage o easier design process and simpler veriication. The drawback o using a single clock in the system is the negative impact on perormance. The only way to extract the maximum perormance is to use asynchronous clock domains or dierent parts o the systems, each running at its maximum requency and use synchronization logic and FIFOs or crossing the clock domains, which oten leads to design problems and diicult-to-track timing issues. The ully synchronous approach eliminates timing problems inside o the system and it was our intent to analyze and determine the boundaries o this design approach when using coprocessors. The major speedup should be achieved by the means o parallelism and architecture optimizations with cores sharing the same clock and asynchronous design should only be considered when this approach ails to meet the design constrains. Since the designed WISHBONE Interace is 32 bits wide, the model presumes 32-bit arguments and results but this reasoning is, o course, valid or the arguments with widths o less than 32 bits. Transer o data wider than 32 bits can be split up in 32 bit data transers with some sotware support and can thus also be modeled. It is also important or the model to include additional sotware overheads or the communication, e.g. loop overhead. There are dierent approaches to accelerating this sort o communication, using both sotware and hardware. These approaches include loop unrolling, using custom loop instructions with zero control overhead (depending on the processor architecture and conigurability) or using additional DMA engines to drive the data transer. To model these additional parameters, a scaling actor or the number o transers was added together with a constant that includes any additional control overheads (such as DMA setup) to better estimate the system s perormance. The basic expression or estimating the communication time is given by the equation (1). 16

17 T comm BUS comm ( ArgCount + Re scount) (2 + Overhead ) + CtrlOverhead BUS (1) To analyze the beneits o the implementation o a particular unction, we must compare the execution time o the algorithm or a general-purpose processor and the execution time when using the processor. The execution time or the general-purpose processor is deined by the processor architecture and clock requency. The processor architecture deines the number o instructions or the algorithm and the number o cycles required to execute an instruction (Cycles Per Instruction CPI) and these deine the total number o clock cycles or an algorithm. Equation (2) shows the expression or the execution time. T (2) Similarly, the execution time or the processor is deined by the number o cycles required or the particular application and the processor s clock requency, as equation (3) shows. The number o cycles is determined by the application itsel and the processor s architecture. T (3) The total execution time when using the processor as a coprocessor is deined by the sum o the execution time or the application itsel and the time required or communication: T T + T (4) COP All these expressions depend on the requency o a particular module. Since the WISHBONE Interace is targeted at ully synchronous systems, the requencies o all modules must be the same. Thereore, the maximum requency o the system is determined by the minimum o all the core s requencies, as shown in equation (5). This expression can be easily extended to systems with multiple coprocessors. comm {, } min (5) SYS, max,max, BUS,max The execution time or the general-purpose processor can be expressed using (2) and (5), as ollows: 17

18 T min SYS, max,max, {, } BUS,max (6) Similarly, the execution time or the coprocessor can be expressed using (1), (3), (4) and (5), as ollows: T COP T + T comm + ( ArgCount + Re scount) (2 + Overhead ) + CtrlOverhead SYS + ( ArgCount + Re scount) (2 + Overhead ) + CtrlOverhead min {, }, max,max, BUS,max (7) To determine the borderline o the coprocessor s eectiveness, we again turn to the worst-case scenario, which in this case is the blocking communication. When using blocking communication, there is no parallel execution o the algorithm on the main processor and the coprocessor and because o that, this represents the worst case or the processor-coprocessor communication. For the hardware implementation o a unction to be eicient, the obvious and necessary condition is or the unction s execution time on the coprocessor to be shorter than the execution time on a general-purpose processor: T COP < T (8) I the coprocessor s implementation doesn t limit the system s maximum requency (i.e.,max >,max holds true) then the speedup o a part o an algorithm (taking into account communication and control overheads) certainly speeds up the execution o the whole program. From (6), (7) and (8), ollows: < ( ArgCount + Re scount) (2 + Overhead) CtrlOverhead (9) The equation (9) deines the maximum number o clock cycles or the coprocessor in order or that implementation not to reduce the eectiveness o the whole system. Using this expression, it is possible to eliminate unacceptable designs in the early stages o the project. For the coprocessor implementation to be truly eective, it must provide suitable speedup to justiy the additional hardware expense. To quantiy this, we must analyze the overall speedup actor. To do this, the program must be divided into two parts. One part will always be executed on the main processor. The other part can migrate between hardware and sotware implementations and it is or this part that we evaluate the eectiveness o the coprocessor implementation. + / (10) PROGRAM HW 18

19 The speedup is deined as the ratio o execution times when using sotware or hardware implementation o the unction in question. T is the execution time o a purely sotware implementation and T +COP is the execution time when a part o the system s task is implemented in hardware, using the coprocessor. T Speedup T COP + HW /, HW /, HW /, + comm + HW /, + ( ArgCount + Re scount) (2 + Overhead) + CtrlOverhead (11) I the coprocessor s implementation limits the system s maximum requency then the speedup should be analyzed in this context. The processor s implementation is in this case only eective i the overall execution time is shorter than that o a purely sotware implementation despite the lower overall system requency. T < T + COP + HW /, + comm + <,max,max HW /, HW /,,max < ( + HW /, ),max comm (12) HW /, <,max,max ( + HW /, ) ( ArgCount + Re scount) (2 + Overhead ) CtrlOverhead The equation (12) deines the maximum number o clock cycles or the coprocessor or it not to degrade the system s perormance and insure speedup, given the overall requency o the system. I we re interested in the minimum requency or the system that would insure speedup, (12) can be rewritten as ollows:,max > +,max HW /,,max ( + + ) + comm HW /, + < HW /, +,max comm HW /, (17),max >,max ( + + ( ArgCount+ ResCount) (2 + Overhead) + CtrlOverhead ) HW /, + HW /, 19

20 The system s speedup in the case when the coprocessor limits the system s maximum requency is as ollows: T Speedup T + COP + HW /,,max + + HW /,,max HW /, +,max,max comm HW /, + ( ArgCount+ ResCount) (2 + Overhead) + CtrlOverhead + (18) Similar reasoning applies to the case then the bus interconnect module limits the maximum requency, e.g. due to its complexity increasing when introducing the coprocessor in the system. 5 Conclusion In this paper we presented our solution or using the No-Instruction-Set Computer () as an application-speciic coprocessor in systems based on the WISHBONE open bus speciication. This approach allows designers to easily use the technology in existing systems with large amounts o unportable legacy code. We deined and explained the hardware and sotware extensions necessary or the integration o the coprocessor in an arbitrary WISHBONE system. The lexibility o these extensions enables dierent ways o using the coprocessor to accelerate the system s application. As a result, we enabled quick and easy integration o a hardware accelerator generated using the Toolset into a given system to improve overall perormance. With the previously designed interace and memory multiplexer hardware, the manual interace parameterization and integration process can t take more than an hour. This process is also a good candidate or automation using a simple sotware tool which can urther simpliy and speed-up the whole design process. We also analyzed the perormance implications o using such a coprocessor in a ully synchronous WISHBONE system. The eectiveness o using the coprocessor and the borderline perormance values required to achieve overall system speed-up were evaluated. We proposed an analytical perormance model that could help quick exploration o the design space, elimination o unacceptable designs and setting the design targets even in early stages o the project. In this way, the hardwaresotware co-development and co-optimization process could converge more quickly towards the desired goal. Since the model requires only simple arithmetic calculations, it can easily be used to build a tool to help automate design space exploration. 6 Acknowledgements The authors wish to thank the Unity through Knowledge Fund (UKF), Končar and SYSTEMCOM or supporting the project. This work builds on many years o processor research at the Center or 20

21 Embedded Computer Systems (CECS), University o Caliornia at Irvine. We wish to thank CECS or allowing the download o Toolset and Mehrdad Reshadi or his help and advice on how to use the toolset. 7 Reerences 1. D. Gajski, ": The Ultimate Reconigurable Component", Center or Embedded Computer Systems, TR 03-28, October M. Reshadi: Modeling and Compilation, dissertation, University o Caliornia, Irwine, M. Reshadi, D. Gajski, "A Cycle-Accurate Compilation Algorithm or Custom Pipelined Datapaths", International Symposium on Hardware/Sotware Codesign and System Synthesis (CODES+ISSS), pp 21-26, September B. Gorjiara, M. Reshadi, D. Gajski, "Generic Architecture Description or Retargetable Compilation and Synthesis o Application-Speciic Pipelined IPs", International Conerence on Computer Design (ICCD), October Technology & Toolset URL: ( ) 6. M. Reshadi, D. Gajski, "Interrupt and Low-level Programming Support or Expanding the Application Domain o Statically-scheduled Horizontally-microcoded Architectures in Embedded Systems", Design Automation and Test in Europe (DATE), April B. Gorjiara, M. Reshadi, D. Gajski, " Communication Interace", Center or Embedded Computer Systems, TR 06-05, March R. Grubišić, Design o the processor WISHBONE Interace, diploma thesis (in Croatian), University o Zagreb, Faculty o Electrical Engineering and Computing (FER), May Speciication or the: WISHBONE System-on-Chip (SoC) Interconnect Architecture or Portable IP Cores, Revision: B.3, Silicore/OpenCores.org,

10. SOPC Builder Component Development Walkthrough

10. SOPC Builder Component Development Walkthrough QII54007-9.0.0 Introduction This chapter describes the parts o a custom SOPC Builder component and guides you through the process o creating an example