Multiprocessor System in an FPGA

Size: px
Start display at page:

Download "Multiprocessor System in an FPGA"

Transcription

1 November Multiprocessor System in an FPGA Wilson Maltez José Abstract As time goes by, new applications emerge more complex and demanding than ever, leading technology forward. In the embedded systems area, the MPSoC solution is increasingly adopted by industry, as it enables systems to reach their real time deadlines and at the same time overcome area and power imposed restrictions. In this respect, FPGAs emerge as a promising platform to develop these kinds of systems. FPGAbased multiprocessors present a cheaper and faster solution compared to ASIC-based multiprocessors. With this project we have shown that the FPGA is a practicable platform to design and implement multiprocessing systems. A homogenous soft-core open source based multiprocessing system has been designed and tested in an FPGA. A matrix multiplication application has been parallelized and used to evaluate the system acceleration and efficiency. Also we have analyzed two different communication architectures. Index Terms Communication architectures, Embedded systems, FPGA, MPSoC, Multi-processing F I. INTRODUCTION OR many years, hardware engineers have relied in increasing the systems clock frequency as a mean to get more performance. However, this approach ceases to be viable as problems like heat sink and heat removal began to be extremely difficult to overcome. In the search for simpler ways of getting more performance, multiprocessing systems appear as an increasingly popular solution. Multiprocessing systems are systems with more than one processing element which can execute several processes simultaneously. As technology advanced, it began to be possible to integrate in a chip a complete multiprocessing system. These systems are called MPSoC (Multi Processor Systems on Chip). Nowadays, MPSoC are an extremely attractive solution in the embedded systems area, as it allows the embedded systems to meet their real time deadlines and at the same time overcome power consumption and area critical constrictions [1]. In this respect, FPGAs (Field Programmable Gate Array) emerge as a new and promising platform to implement multiprocessing systems. FPGAs enable fast prototyping and research of new architectures without ASIC (Application Specific Integrated Circuit) related problems. However, designing in HDL is very time consuming. An alternative to HDL design is the use of soft-core processors on an FPGA to build a multiprocessor system. Soft-core processors are configurable processors designed to fit well on an FPGA design. In today s FPGAs it is possible to integrate dozens of processors and therefore to provide a significant parallel computation capacity. This paper presents a multiprocessor architecture based on an open source soft-core processor and proposes two alternative communication architectures. In section 2, the multiprocessing systems discussion is particularized for FPGAs. First, actual FPGAs are shown as viable platforms for implementing multiprocessing systems. Then, alternative soft-cores architectures are evaluated and the most adequate to this project is chosen. Some FPGA-based multiprocessing systems implementations are presented together with a brief architectural background discussion. Finally, communication architectures are introduced and alternative implementations are discussed. Section 3 describes the processor MB-LITE architecture. Section 4 presents the multiprocessing system architecture design and details the two communications architectures developed. Section 5 describes the test application for the system and the test methodology adopted. It also presents the synthesis and simulation results and discusses area/performance related issues. In section 6, a reflection is made about the work done and some suggestions to improve the work further are presented. II. FPGA-BASED MULTIPROCESSING SYSTEMS A. FPGA-based Multiprocessors viability FPGA-based multiprocessing systems are normally used as a way to realize prototypes before the definitive implementation in ASIC. However, final products already exist based on FPGAs. ASIC based systems are generally faster and optimized but FPGAs present several advantages that for certain kind of applications may give them the edge. Such as, 1) Flexibility and reconfiguration: The number of softcores included is solely limited by the FPGA capacity. 2) Faster time to market: The project process does not include IC (Integrated-Circuit) manufacturing reducing substantially the project duration. 3) Cost-effective: The process is cheaper. It is possible to implement a FPGA project using a small team. Moreover, if the system has an error it is not final. 4) Scalability: FPGA-based multiprocessor systems can include an increasing number of microprocessors or peripherals if resources are available [1]. B. Processing Element Choosing a multiprocessor system processing element is a crucial step in the project as it is rather likely that it will limit the system in some way. Several soft-core were compared,

2 November including open source and commercial. In table 1 we present the main attributes of the soft-cores studied as potential options. In the end we chose the soft-core MB-LITE because its performance and resource consumption are close to the commercial soft-cores, such as the MicroBlaze, and it provides the same flexibility of his open source counter-parts. C. Architectural Background Normally the target application of the FPGA-based multiprocessing system determines the architecture. Three main types of architectures can be defined [1]: (1) Master-Slave, where one or more processors act as the master processor, controlling the behavior of the slave processors; (2) Pipeline is useful for streaming applications for example. The architecture is composed by a chain of processors and each processor act as pipeline stage; (3) Net architecture refers to multiprocessor systems where there is no hierarchy between processors, all processors being able to communicate with each other when necessary. Another important issue is the way communications are established physically between the system elements [1]. (1) Point to point. Processors are directly connected. Since the links are dedicated, it has a large bandwidth as an advantage. However, when the system grows it is not area efficient; (2) Shared bus. This traditional approach derives from single processor systems. It is the best known mechanism to communicate cores, but it is not effective in terms of performance because the bus can only be used by one processor at a time; (3) NoC (Network-on-Chip). The basis of this method of interconnecting cores is to apply network background to on-chip systems. When there are a lot of on-chip cores, it is the solution that best combines area and performance. The idea is to use small routers inside the chip to enable communications between all cores of the system with low latencies. Finally, we can define three methods to trade information between cores: (1) Shared Memory is the most frequently used method. One of the mains reasons for this fact is the FPGAs limited amount of on-chip memory. (2) Message passing is mainly used in distributed memory systems and consists in sending messages with information between cores. A protocol is required to message passing. (3) Streaming consists in unidirectional communication from the sender to the receiver. It has the advantage of presenting extremely simplified protocols and avoid the overhead resulting from the slave answer. D. Related Work Multiprocessing systems are generally divided in two categories: heterogeneous systems and homogeneous systems. The first type involves systems where different processors or TABLE I SOFT-CORE COMPARISON CPU Fclk Area Flexibility Pipeline Interface (MHz) (LE) (Stages) MicroBlaze 200 [4] 1324 [3] MMU, FPU and MUL [3] 3,5 FSL Nios II/f 200 [4] 1800 [3] Leon AEMB OpenFire MB-Lite Plasma OpenRISC MMU*1, FPU*1 and MUL l [3] 6 Avalon MMU, FPU 7 AMBA 2.0 and MUL AHB, UART MUL 3 Wishbone, FSL MUL 3 FSL MUL 5 Wishbone MUL 2,3 or 4 UART MMU, MUL 5 Wishbone accelerators are part of the same system. In homogeneous systems all cores are identical. Heterogeneous systems are by nature associated with application specific systems. The flexibility to have different cores in one system enables them to adapt better to the target application. These kinds of systems are used across several areas, like, bioinformatics, controllers, communication networks and multimedia. An MPEG-4 Encoder is implemented in [5]. The system has a master-slave architecture with support for message passing and shared SDRAM to interconnect NIOS processors. It uses a shared bus to connect instruction-shared memory and Heterogeneous IP Block Interconnection (HIBI) to connect data-shared memory by the plug and play method. It is an easy-to-scale computational system. Scalability is obtained through special parallelization: every image is divided into horizontal slices, and every slice is processed by 4 softcores in a master-slave configuration. In [6] a master-slave shared-bus/shared-memory architecture is used for industrial applications. They use Nios II softcore processors and an Avalon bus. The authors discuss the advantages of using FPGA-based multiprocessor systems in industrial applications. Industrial production machines have to be highly flexible in order to satisfy changes deriving from the demand for new products. Although most of today s MPSoC systems are heterogeneous, in order to meet the targeted application requirements, in the near future, homogeneous multiprocessor systems may become a viable alternative, bringing other benefits such as run-time load balancing and task migration [16].The homogeneous architectural style is used generally for data-parallel systems. Wireless base stations, in which the same algorithm is applied to several independent data streams, are one example; motion estimation, in which different parts of the image can be treated separately, is another. Normally, homogeneous multiprocessor systems are general purpose. In [7] a shared memory multiprocessing architecture is presented based on the commercial soft-core MicroBlaze. The

3 November authors provide a hardware layer for task synchronization support and software mechanism to ensure data consistency. It is concluded that cache coherency algorithms are difficult to implement and verify in hardware. Furthermore, the commercial tools available hinder multiprocessor systems organization because these tools are primarily oriented to uniprocessor systems design. The experimental results show that the designed multiprocessing architecture can have an equivalent performance to a more powerful hard-core processor. A point to point architecture based multiprocessor with message passing communication is presented in [8]. MicroBlaze is used as the system core processor. The authors study FSL link viability for communication between processors and present a comparison between different topologies. It is concluded that the FSL link is good enough to be applicable by parallel systems and that star topology offers the most scalability and so is used. In the end it is stated that communication delay limits system performance and that the only limiting factor to the number of cores in a MPSoC is the amount of BRAM. E. Communication Architectures The communication architecture resumes to the connecting infrastructure between cores responsible for enabling data exchange. The simplest way of connecting devices is establishing point to point connections between all of them. However, this approach presents a poor scalability because it needs one dedicated connection from each core to another. This fact is the main reason why communication is normally done through shared connections. The way connections are established defines the systems topology, which identifies which paths are available for a packet to reach its destiny. For architectures connecting more than one element tree basic functions must be provided by the network: routing, arbitration/flow control and switching. Routing is defined by the operations needed to compute a valid path from the origin to the destiny of a packet. Arbitration together with flow control handles the important question of when is there an available path for a packet to take. Switching provides or allocates a path for the packet to advance. Nowadays, communication architectures have to support an increasing number of processors connected. So it is very important if an architecture scales well, providing a good performance/area ratio and the flexibility to integrate different cores with ease. In this respect, NoC architectures appear as a promising solution. NoC present a highly modular character and potentially a higher bandwidth depending on the adopted topology [9]. In a NoC a module can be processors, DSPs (Digital Signal Processor), memories, input/output devices, et al. There are several articles exploring the implementation of NoC architectures [10, 11]. Systems extremely configurable and flexible are presented in [12] and [13]. For example, Heracles enables choosing from the memory architecture to the topology, routing algorithms among other parameters. These systems are important as they not only allow running applications on it, but can also be used as a platform for the research of new architectures and algorithms. Some articles compare NoC with other architectures. In [14], the authors perform a comparison between NoC based architectures with mesh topology and shared bus/memory architectures. It is stated that for parallel applications a NoC efficient implementation on a FPGA can increase communication speed till seven times relatively to the shared bus architecture, although these results depend on the data dimension and number of processors connected to the network. Finally in [15], a methodology to design NoCs is presented. Clearly, a NoC based design will not always be the preferred solution for all kinds of applications. It is expected that NoC based designs will provide good solutions for flexible products that should be reconfigurable and programmable; for designs which are the basis for several product variants; for applications with a heterogeneous task mix; for applications with stringent time to market requirements; for products where reuse both at the block and the function and feature level is considered valuable [15]. III. PROCESSOR MB-LITE The MB-LITE is a 32 bit RISC soft-core processor based on the MIPS organization with the same classic five pipeline stages architecture, IF (Instruction Fetch), ID (Instruction Decode), EX (Execute), MEM (Memory) e WB (Write-back), and compatible with the MicroBlaze ISA (Instruction Set Architecture). The IF stage feeds the pipeline with the required instruction and stores the current PC (Program Counter) [16]. The ID stage decodes the instructions in control signals which travel along with the instruction. The execution stage determines the ALU operands and the operation which needs to be executed. In addition to all the basic functionalities like shifts, additions and logical operations, the ALU can also optionally include a multiplier and a barrel shifter. The MEM module controls the interaction with data memory and the register bank is written in the WB stage. The instructions in MB-LITE present the same latency specified for the MicroBlaze architecture. Most instructions - except for branches - have a latency of one cycle. Also the architecture provides a single interrupt like the MicroBlaze implementation. MB-LITE implements distributed control in order to eliminate the need for a centralized and complex pipeline controller. All dependencies like stalls, hazards and forwards are solved locally. These dependencies can generate conflicts which are resolved using different techniques. Data conflicts are resolved by forwarding so that stalls are reduced to a minimum. The structural hazard which occurs when the same register is read and written concurrently is also solved using operand forwarding. When the result of a load instruction is immediately used, it is possible to forward the memory result either using stalls or additional logic. This is done by configuring the processor through VHDL generics. Finally, control hazards are solved using a pipeline flush. Figure 1 shows the MB-LITE structure with the inputs and outputs in each stage.

4 November To simplify connecting components like memories, coprocessors, bridges and adapters, the MB-LITE authors provide an easily modifiable address decoder, which is used in our project design. The address decoder decodes the incoming addresses based on a generic memory map and forwards the control, address and data signals to the proper output. IV. MULTIPROCESSING ARCHITECTURE The multiprocessing system designed is a homogenous system with distributed memory and communication through streaming. Various configurations of the same system were designed to study the best architectural options and to evaluate how the system scales with an increasing number of processing elements. The system is basically formed by two or more cores and a communication infrastructure which interconnects the cores. As connecting two processing elements is very simple, we have opted to implement a point to point communication to connect the cores in the two processor architecture. For the larger multiprocessors, two main communication architectures have been designed. The first was a shared-media based communication architecture with a crossbar topology. The second was a NoC based communication architecture with a 2D-mesh topology. As a base to the system developed we designed a block which basically provides the processor the means to communicate with other processors. A. Base Block The base block is composed by the following elements: processor MB-LITE, Data Instruction Memory (DIMEM), Mailbox, address decoder and finally the data received controller. Figure 2 shows how these elements interconnect. The MB-LITE is the core processor and therefore the base block central module. All the components are memory mapped which means they can be accessed by the core through load and store instructions. This communication relies on the master/slave paradigm. The MB-LITE is the master and the other elements are the slaves. The address decoder is responsible for routing data and control signals between the MB-LITE and the other components. The DIMEM module represented in figure 2 is an instruction and data memory which can only be accessed by the processor MB-LITE. It presents itself as a unified memory, Von Neumann architecture. The Mailbox is a small memory where the data received from the other cores is stored. It can be read by the processor and written by the network. The received data controller is responsible for alerting the processor of data arrival. This module is basically a counter which increments Fig. 1. MB-LITE structure and interface signals [4] each time a block of memory is written in the mailbox. This module can be reset by the processor. Finally, a port is mapped in memory to enable the sending of data by the processor to the network. The first decision made was to design a distributed memory system. This way, the traffic imposed on the network is reduced as each one of the processors communicates directly with his own memory. The communication is done by data streaming. This method avoids for one hand the memory conflicts originated from a memory shared approach, and on the other hand the latency associated to a message passing approach (e.g. blocking the master processor waiting for the slave response). Basically each processor has a Mailbox where it receives data from other processors. The processors can then send data through the network to other processor by specifying the destiny and Mailbox address where to write. A processor can verify if data has been completely received by checking the associated counter. As said before, two communication architectures have been design. B. Switching Crossbar The switch crossbar network was chosen as the first architecture to design because it is simple to implement and enables simultaneous communication between the cores. Furthermore, for a low number of processors, the area overhead is small which favors FPGA mapping. As can be seen in figure 3, the network is composed by two types of blocks: a router and a network adapter (NA). The router is responsible for routing the packages to their destiny. The network adapter controls the Mailbox access and warns the processor when data is ready. As the network is composed by only one node the path taken by the packet equals its destiny. In the event of two packets pretending the same destiny the router chooses one based on an arbitration algorithm. We chose the round robin algorithm as it is very simple and easy to implement which means smaller and faster components. Basically the round robin algorithm follows an

5 November Fig. 2. Base Block Structure established order giving access alternately to a packet from a different source. To allocate a path for the packets a circuit switch approach was adopted, mainly because the target applications for the designed multiprocessor system involve matrix computations which benefit from this approach. Sending data from one processor to another can be divided in two steps. In the first step a header is sent. This packet has the necessary information to reserve a path for the following packets in the router. Also, it has the base address and the block data size for writing all the following data in the Mailbox. Then, the following data which belongs to that header is sent in sequence and stored in that order in the Mailbox. Broadcast is also supported in this architecture. To implement broadcast an extra bit was added to the packets to identify a broadcast request. This request is then identified in the arbiter and has priority above all the other requests in all arbiters. C. NoC The NoC architecture is composed by several nodes interconnected in a way according to a certain topology. For the NoC created, a mesh topology was chosen because it is simpler to implement and routing in a 2D architecture is easier resulting in potentially smaller and faster routers. Figure 4 shows the NoC with eight cores (eight cores + standard output controller). Each node in the network is formed by a network adapter and a router. The network adapter provides an interface between the processor and the network. It is responsible for forming the packets and retrieving the correct signals from the arriving packets to write in the Mailbox. The router is responsible for relaying the packets through the network until they reach their destiny. The path a packet takes to reach its destination is defined in the source of the packet, more precisely in the network adapter (source routing). The network adapter contains a routing table with the path defined for each possible destination. The packet arbitration is again based on the round robin algorithm. As for the switching function we adopted the circuit switching approach again for the same reasons explained before. However, now we have more nodes so circuit switching is not done the same way. Fig. 3. Basic Crossbar Switch based multiprocessor architecture. The communication across the network is divided in three steps. First, the header is sent to reserve the resources necessary for the following packets to arrive at the correct destination. Then, data is sent sequentially and written in that way in the Mailbox. The last packet has a different type which is identified by the circuit and allows freeing the resources reserved. The path resources across the crossbar, when reserved, stay blocked for packets from any other sources different from that of the header which reserved the path. A. Application V. RESULTS The application chosen to test the multiprocessing system was matrix multiplication. Matrix operations, like matrix multiplication, are very common appearing in almost all scientific research areas, as graph theory, numeric algorithms, signal processing and digital control [17]. For that reason, matrix multiplication presents a good generality degree, and hence ideal to test the design system. Given two matrices A and B, with n m and m l dimension respectively, the product matrix C with n l dimension can be defined the following way: (1) The matrix multiplication requires m n l multiplication operations and n l (m-1) add operations. The simpler algorithm was selected to be developed to test the system, as the more sophisticated ones revealed to be too complex and time consuming to be adopted. The following assumptions were taken in the algorithm development. Matrices A and B are always square with dimension n n, in which n is always power of 2. The number of processors, p, is also always power of 2. Finally, only the root processor or processor 0 has access to matrices A and B. Let n = q * p, where q 1 is an integer. The matrix A is partitioned in p regions where each region contains q lines which are assigned to each processor. Matrix B is available to

6 November These metrics allow evaluating the gain from parallelizing a certain application and running it on a multiprocessing system. Fig. 4. NoC based multiprocessor architecture. all the processors. The root processor acts as host processor responsible for distributing all the necessary data to the other processors and waiting for the results from the other processors. The distribution/collection of data depends on the implemented architecture. The generalization of the algorithm for p processors and n dimension matrices is simple. The original matrices A and B are divided in p n and n p dimension sub-matrices, respectively. The steps previously described when p=n are applied to these sub-matrices, and a p p block is obtained which is part of matrix C. This process is repeated n/p times till the full matrix C is obtained. B. Test Methodology The systems compared were all synthesized using XST tool (Xilinx ISE 13.1) targeting a Virtex-6 device (XC6VLX130t- 3ff1156). The MB-LITE cores were used without any optional functional unit or interrupt support. Each processor has 4 Kbytes of local memory and 512 Bytes of Mailbox. To verify the system proper functioning, a C program was written and compiled with MB-GCC, MicroBlaze compiler (EDK 10.1). The program multiplies two 8 8 integer matrices. The program is loaded into the data instruction memory and a test bench is run, using ISim (Xilinx ISE 13.1) to observe the results. The input data is read from a RAM memory from which only the root processor has access. The root processor is located in the central node in the NoC architecture as it can be seen looking at figure 4. The central node presents a globally shorter distance to reach all the other nodes, thus implying faster communication. The results were verified using an I/O interface, which reads data from a bus and writes them to the ISim console. The results were checked by inspection. Performance was measured by comparing execution times and by analyzing the speedup and efficiency reached by the system. The Amdahl law relates the sequential and parallel part of an algorithm and gives us the maximum speedup possible for a target application. In (2) is presented the total execution time equation. The speedup and efficiency can be computed by equation (3) and (4), respectively. N refers to the parallelism degree (number of processors). S is the time taken to execute the code sequential part and P the time taken by a single processor system to execute the parallel part of the code. C. Crossbar Switch Scalability Looking at table 2, it can be seen that the area occupied by the systems grows almost linearly with the number of cores. However, the eight core system router has more than three times the size of the four cores system router (see number of LUTs, in table 3). The main reason for this discrepancy is the crossbar, whose occupied area grows quadratically with the number of ports. The increase in number of processors does not significantly affect the maximum frequency achievable by systems, except for the eight core system. This is because the processor is, in general, the limiting factor of the system while the network allows higher clock frequencies. However, in the eight core system, the maximum system frequency suffers a significant drop. Looking at table 3 it can be seen this is due to the router having a lower maximum frequency. The router suffers from increasing complexity of the arbiter due to the larger number of possible destinies to choose from. Finally, the network adapter is independent of the number of processors in the system, as can be seen in table 3. D. Application Acceleration The system speedup achieved and the theoretical ideal speedup for the target algorithm can be observed in the graph sketched in figure 5. The theoretical ideal speed represents the maximum speedup the system can achieve given the parallel algorithm used. The broadcast optimization plays a big role in improving the speedup results, as it can be seen in figure 5. The system without broadcast presents small improves with the increasing in number of processors and even drops when the eight core system is used. The reason for this is that more processors lead to the need to send more matrices and therefore the communication delay grows against computation delay. With broadcast, matrix B can be sent to all the processors simultaneously, which allows to save much time in communications. Although the speedup increases with the number of cores, as supposed, even with broadcast support the system is far from the ideal speedup, except for the two core system (which has always one latency cycle to send data do the other core). These results are due to the communication overhead. Basically the communication latency is not uniform. Ideally the packet would take one cycle to cross the network, from being read from the FIFO until the data it travels with it being stored in the Mailbox (we are only taking into to account the

7 Efficiency Speedup November TABLE 2 SYNTHESIS RESULTS FOR SYSTEM WITH ONE, TWO, FOUR AND EIGHT CORES 1 % 2 % 4 % 8 % LUTS BRAMS F Max (MHz) T Min (ns) TABLE 3 SYNTHESIS RESULTS FOR FOUR CORE ROUTER AND EIGHT CORE ROUTER 4CoresB 8CoresB Router NA Router NA LUTS (136/74/144) (544/170/288) BRAMS F Max (MHz) data packets, not the header ones). However this depends on two main factors: if there is more than one element in the FIFO and the arbiter state. If more than one element is in the FIFO, a packet is immediately read, otherwise takes two clock cycles (FWFT FIFO function mode related [18]). Depending on the arbiter state, a packet can be immediately chosen or being forced to wait till N-2 clock cycles, N being the number of processors in the system. The efficiency, represented in figure 6, indicates if the processors of the system spend too much time idle. A small efficiency, like the one observed for the eight core system, implies there is barely any gain from increasing the number of processors. E. Crossbar versus NoC In this section we use the Crossbar Switch based system without broadcast, so that it can be made a fair comparison against the NoC based system. As both systems present eight cores the main difference lies in the communication architecture. Table 4 presents the synthesis results for both systems. Table 5 shows the individual results for the router modules. By looking at table 4, it is observed that the area utilization of the NoC is vastly superior to that of the Crossbar Switch. This is mainly because the NoC uses nine router modules, with five entries each, compared to only one router module, with 8 entries, for the Crossbar Switch system. Although, if we think about the growth of both systems with the number of cores, we conclude that the NoC scales linearly (because each router have always the same size which implies the size of a individual node is fixed) while the Crossbar Switch grows faster, as can be seen in table 2 (because as we see can see in table 3 the router grows almost quadratically with the increase in number of cores). This is the main reason why the NoC is better scalable than the crossbar switch. The NoC larger maximum clock frequency, which stands out mainly by comparing the two routers systems maximum frequency, is due to the less complex arbiter (which reduces critical path) Fig. 5. Speedup for an 8 8 integer matrix multiplication parallel program. 1 0,8 0,6 0,4 0, Number of processors Number of processors Fig. 6. Efficiency for an 8 8 integer matrix multiplication parallel program. The performance of two systems was compared by analyzing the execution time of the parallel portion of the program, because this corresponds to the matrix multiplication parallelization. The two systems present equivalent performance with a short advantage for the NoC system. This is probably because of the NoC higher clock frequency, even though almost negligible (see table 6). F. FPGA Implementation Without Broadcast With Broadcast Theoretical Ideal Without Broadcast With Broadcast Theoretical Ideal The crossbar switch based multiprocessing system with four cores was implemented in the device Spartan-3E (XC3S500E- 5FG320). To verify the results the system sends the data to the PC through the RS232 interface available on the development board. The circuit can execute at a maximum clock frequency of 57 MHz. The implementation used a 50 MHz clock. The system consumes 7717 LUTs and 20 BRAMs, presenting a device occupation of 82% and 100%, respectively. The BRAMs are the clearly most critical resource, with even some RAM being mapped to LUTs due the BRAM shortage. The system was tested with an 8 8 matrix multiplication application and the results presented in the HyperTerminal on the PC coincided with the expected, proving the correct system functioning. The system could be probably optimized for resource utilization, but this was not our main concern when designing the system. To multiply bigger matrices, an external memory would be probably necessary.

8 November TABLE 4 SYNTHESIS RESULTS FOR CROSSBAR SWITCH AND NOC ARCHITECTURES 8Core % NoC % LUTS BRAMS F Máx (MHz) T Min (ns) TABLE 5 SYNTHESIS RESULTS FOR ARCHITECTURES ROUTERS 8Cores Router NoC Router LUTS 1007 (528/168/288) 433 (170/60/180) BRAMS 4 3 F Máx (MHz) VI. CONCLUSION A. Conclusion The main objective of this work consisted in the design and implementation of a multiprocessor system in a FPGA. The main contributions involving this work can be summarized as: Design of a homogeneous multiprocessor system with distributed memory and streaming communication. Development and comparison of two different communication architectures, a crossbar switch based architecture and NoC based with mesh topology architecture. Evaluation of scalability and performance of the Crossbar Switch based architecture. Acceleration of a matrix multiplication test application through mapping on a multiprocessor system with increasing number of cores. Implementation of a four cores multiprocessing demonstration system in a Spartan-3E device. We concluded that being able to broadcast data is extremely important in parallel applications, because it reduces significantly the communication delay. The evaluation of the two communication architectures studied concluded that the NoC based architecture scales better for an increasing number of processors in the system. This has to do with the router always having the same size contrarily to the crossbar switch router which increases quadratically in complexity with the number of cores supported. This also holds true for performance, as the crossbar switch architecture has an arbiter which increases in complexity with the number of processors, generating a greater delay across the network (is part of the critical path). The crossbar switch based system has shown a speedup far from the ideal, even with broadcast. This fact is justified by the dynamic nature of the communication delay. The two main factors are: FIFO read latency and round robin related latency. Finally, it was concluded that software and hardware must be developed together to achieve the best results possible (performance and area related). From a software point of view TABLE 6 TIMING RESULTS FOR CROSSBAR SWITCH AND NOC ARCHITECTURES 8Core NoC T Min (ns) Nº cycles Execution Time (us) it is also important to better explore the applications parallelism and conceive new models. B. Future Work For the future we intend to improve the NoC architecture, which has been shown to be the most promising one. Broadcasting support is the first requirement to be fulfilled, as we have seen it is a crucial factor to achieve better performance. A better algorithm must be developed to parallelize the matrix multiplication as well as other similar applications. It is also of great importance to make the NoC more generic for easier scaling of the architecture. To improve usability the system must be revised to facilitate the integration of different modules (maybe through the Wishbone interface which would make our system compatible with several IP components which already support it). Also, it would be interesting to study how our system scales with bigger matrices as input. REFERENCES [1] GT. Dorta, J. Jiménez, Martín. J, Bidarte, U and Astarloa. A, Overview of FPGA-Based Multiprocessor Systems, International Conference on Reconfigurable Computing and FPGAs, pp , Tamar Kranenburg, Design of a Portable and Customizable Microprocessor for Rapid System Prototyping, Master Thesis, Delft University of Technology, [3] Soft CPU Cores for FPGA [online]. Available: core.com/library/digital/soft-cpu-cores. [4] J. G. Tong, I. D. L. Anderson and M. A. S. Khalid, Soft-Core Processor for Embedded Systems, in Proceeding of the International Conference on Microelectronics (ICM '06), pp , December 2006 [5] O. Lehtoranta, E. Salminen, A. Kulmala,M. H annik ainen, and T. D. H am al ainen, A parallel MPEG-4 encoder for FPGA based multiprocessor SOC, in Proceedings of the International Conference on Field Programmable Logic and Applications (FPL 05), pp , August [6] Ralf Joost, Ralf Salomon, Advantages of FPGA-Based Multiprocessor Systems in Industrial Applications, Industrial Electronics Society, IECON st Annual Conference of IEEE, [7] Antonino Tumeo, Matteo Monchiero, Gianluca Palermo, Fabrizio Ferrandi, Donatella Sciuto, A Design Kit for a Fully Working Shared Memory Multiprocessor on FPGA, Application-Specific Systems, Architectures and Processors, ASAP International Conference on, July [8] P. Huerta, J. Castillo, J. I. Mártinez, and V. López, A microblaze based multiprocessor SoC, WSEAS Transactions on Circuits and Systems, vol. 4, no. 5, pp , [9] L. Benini and G. De Micheli, Networks on Chips: A New SoC Paradigm, Computer, vol. 35, no. 1, pp , January [10] Z. Wang and O. Hammami, External DDR2-constrained NOC-based 24-processors MPSOC design and implementation on single FPGA, in Proceedings of the 3rd International Design and Test Workshop (IDT 08), pp , December [11] P.Huerta,J.Castillo,J.I.Mart ınez, and C. Pedraza, Exploring FPGA capabilities for building symmetric multiprocessor systems, in Proceedings of the 3rd Southern Conference on Programmable Logic (SPL 07), pp , February [12] M. A. Kinsy, M. Pellauer, S. Devadas, Heracles: Fully Synthesizable Parameterized MIPS-Based Multicore System, in Proceedings of the

9 November International Conference on Field Programmable Logic and Applications (FPL 2001), pp , September [13] E. Salminen, A. Kulmala, and T. D. H m l inen, HIBI-based multiprocessor soc on FPGA, in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS 05), vol. 4, pp , May [14] H.C Freitas, D.M. Colombo, F.L. Kastensmidt, P.O.A. Navaux, Evaluating Network-on-Chip for Homogeneous Embedded Multiprocessors in FPGAs, in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS 2007), pp , May [15] S. Kumar, A. Jantsch, J.-P Soininen, M. Forsell, M. Millberg, J. Oberg, K. Tiensyrja, A. Hemani, A Network on Chip Architecture and Design Methodology, in Proceedings of the IEEE Computer Society Annual Symposium on VLSI (2002), pp , April [16] T. Kranenburg and R. van Leuken, MB-LITE: A robust, light-weight soft-core Implementation of the MicroBlaze architecture, in Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE 10), pp , April [17] A. Ziad, A. Musbah and E.E. Ibrahiem, Performance Analysis and Evaluation of Parallel Matrix Multiplication Algorithms, World Applied Sciences Journal, pp , [18] FIFO Generator v4.3 documentation.

Multi MicroBlaze System for Parallel Computing

Multi MicroBlaze System for Parallel Computing Multi MicroBlaze System for Parallel Computing P.HUERTA, J.CASTILLO, J.I.MÁRTINEZ, V.LÓPEZ HW/SW Codesign Group Universidad Rey Juan Carlos 28933 Móstoles, Madrid SPAIN Abstract: - Embedded systems need

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y.

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y. Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y. Published in: Proceedings of the 2010 International Conference on Field-programmable

More information

DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC

DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC 1 Pawar Ruchira Pradeep M. E, E&TC Signal Processing, Dr. D Y Patil School of engineering, Ambi, Pune Email: 1 ruchira4391@gmail.com

More information

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture

More information

Hardware Design. University of Pannonia Dept. Of Electrical Engineering and Information Systems. MicroBlaze v.8.10 / v.8.20

Hardware Design. University of Pannonia Dept. Of Electrical Engineering and Information Systems. MicroBlaze v.8.10 / v.8.20 University of Pannonia Dept. Of Electrical Engineering and Information Systems Hardware Design MicroBlaze v.8.10 / v.8.20 Instructor: Zsolt Vörösházi, PhD. This material exempt per Department of Commerce

More information

AN OCM BASED SHARED MEMORY CONTROLLER FOR VIRTEX 4. Bas Breijer, Filipa Duarte, and Stephan Wong

AN OCM BASED SHARED MEMORY CONTROLLER FOR VIRTEX 4. Bas Breijer, Filipa Duarte, and Stephan Wong AN OCM BASED SHARED MEMORY CONTROLLER FOR VIRTEX 4 Bas Breijer, Filipa Duarte, and Stephan Wong Computer Engineering, EEMCS Delft University of Technology Mekelweg 4, 2826CD, Delft, The Netherlands email:

More information

Hardware Design. MicroBlaze 7.1. This material exempt per Department of Commerce license exception TSU Xilinx, Inc. All Rights Reserved

Hardware Design. MicroBlaze 7.1. This material exempt per Department of Commerce license exception TSU Xilinx, Inc. All Rights Reserved Hardware Design MicroBlaze 7.1 This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: List the MicroBlaze 7.1 Features List

More information

Simplifying Microblaze to Hermes NoC Communication through Generic Wrapper

Simplifying Microblaze to Hermes NoC Communication through Generic Wrapper Simplifying Microblaze to Hermes NoC Communication through Generic Wrapper Andres Benavides A. 1, Byron Buitrago P. 2, Johnny Aguirre M. 1 1 Electronic Engineering Department, University of Antioquia,

More information

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 727 A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 1 Bharati B. Sayankar, 2 Pankaj Agrawal 1 Electronics Department, Rashtrasant Tukdoji Maharaj Nagpur University, G.H. Raisoni

More information

System Verification of Hardware Optimization Based on Edge Detection

System Verification of Hardware Optimization Based on Edge Detection Circuits and Systems, 2013, 4, 293-298 http://dx.doi.org/10.4236/cs.2013.43040 Published Online July 2013 (http://www.scirp.org/journal/cs) System Verification of Hardware Optimization Based on Edge Detection

More information

DESIGN OF A SOFT-CORE PROCESSOR BASED ON OPENCORES WITH ENHANCED PROCESSING FOR EMBEDDED APPLICATIONS

DESIGN OF A SOFT-CORE PROCESSOR BASED ON OPENCORES WITH ENHANCED PROCESSING FOR EMBEDDED APPLICATIONS DESIGN OF A SOFT-CORE PROCESSOR BASED ON OPENCORES WITH ENHANCED PROCESSING FOR EMBEDDED APPLICATIONS TEJASWI AGARWAL & MAHESWARI R. School of Computing Sciences and Engineering, VIT University, Vandalur-Kelambakkam

More information

PS2 VGA Peripheral Based Arithmetic Application Using Micro Blaze Processor

PS2 VGA Peripheral Based Arithmetic Application Using Micro Blaze Processor PS2 VGA Peripheral Based Arithmetic Application Using Micro Blaze Processor K.Rani Rudramma 1, B.Murali Krihna 2 1 Assosiate Professor,Dept of E.C.E, Lakireddy Bali Reddy Engineering College, Mylavaram

More information

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013 Introduction to FPGA Design with Vivado High-Level Synthesis Notice of Disclaimer The information disclosed to you hereunder (the Materials ) is provided solely for the selection and use of Xilinx products.

More information

AC : INTRODUCING LABORATORIES WITH SOFT PROCES- SOR CORES USING FPGAS INTO THE COMPUTER ENGINEERING CURRICULUM

AC : INTRODUCING LABORATORIES WITH SOFT PROCES- SOR CORES USING FPGAS INTO THE COMPUTER ENGINEERING CURRICULUM AC 2012-4159: INTRODUCING LABORATORIES WITH SOFT PROCES- SOR CORES USING FPGAS INTO THE COMPUTER ENGINEERING CURRICULUM Prof. David Henry Hoe, University of Texas, Tyler David Hoe received his Ph.D. in

More information

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA Maheswari Murali * and Seetharaman Gopalakrishnan # * Assistant professor, J. J. College of Engineering and Technology,

More information

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) D.Udhayasheela, pg student [Communication system],dept.ofece,,as-salam engineering and technology, N.MageshwariAssistant Professor

More information

ISSN:

ISSN: 113 DESIGN OF ROUND ROBIN AND INTERLEAVING ARBITRATION ALGORITHM FOR NOC AMRUT RAJ NALLA, P.SANTHOSHKUMAR 1 M.tech (Embedded systems), 2 Assistant Professor Department of Electronics and Communication

More information

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

Applying the Benefits of Network on a Chip Architecture to FPGA System Design white paper Intel FPGA Applying the Benefits of on a Chip Architecture to FPGA System Design Authors Kent Orthner Senior Manager, Software and IP Intel Corporation Table of Contents Abstract...1 Introduction...1

More information

SOCS BASED OPENRISC AND MICROBLAZE SOFT PROCESSORS COMPARISON STUDY CASES: AUDIO IMPLEMENTATION AND NETWORK IMPLEMENTATION BASED SOCS

SOCS BASED OPENRISC AND MICROBLAZE SOFT PROCESSORS COMPARISON STUDY CASES: AUDIO IMPLEMENTATION AND NETWORK IMPLEMENTATION BASED SOCS SOCS BASED OPENRISC AND MICROBLAZE SOFT PROCESSORS COMPARISON STUDY CASES: AUDIO IMPLEMENTATION AND NETWORK IMPLEMENTATION BASED SOCS Faroudja Abid, Nouma Izeboudjen, Dalila Lazib, Mohamed Bakiri, Sabrina

More information

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located

More information

DESIGN A APPLICATION OF NETWORK-ON-CHIP USING 8-PORT ROUTER

DESIGN A APPLICATION OF NETWORK-ON-CHIP USING 8-PORT ROUTER G MAHESH BABU, et al, Volume 2, Issue 7, PP:, SEPTEMBER 2014. DESIGN A APPLICATION OF NETWORK-ON-CHIP USING 8-PORT ROUTER G.Mahesh Babu 1*, Prof. Ch.Srinivasa Kumar 2* 1. II. M.Tech (VLSI), Dept of ECE,

More information

Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit

Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit P Ajith Kumar 1, M Vijaya Lakshmi 2 P.G. Student, Department of Electronics and Communication Engineering, St.Martin s Engineering College,

More information

FPGA Implementation of A Pipelined MIPS Soft Core Processor

FPGA Implementation of A Pipelined MIPS Soft Core Processor FPGA Implementation of A Pipelined MIPS Soft Core Processor Lakshmi S.S 1, Chandrasekhar N.S 2 P.G. Student, Department of Electronics and Communication Engineering, DBIT, Bangalore, India 1 Assistant

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Pipelined MIPS processor with cache controller using VHDL implementation for educational purpose

Pipelined MIPS processor with cache controller using VHDL implementation for educational purpose Journal From the SelectedWorks of Kirat Pal Singh Winter December 28, 203 Pipelined MIPS processor with cache controller using VHDL implementation for educational purpose Hadeel Sh. Mahmood, College of

More information

Supporting the Linux Operating System on the MOLEN Processor Prototype

Supporting the Linux Operating System on the MOLEN Processor Prototype 1 Supporting the Linux Operating System on the MOLEN Processor Prototype Filipa Duarte, Bas Breijer and Stephan Wong Computer Engineering Delft University of Technology F.Duarte@ce.et.tudelft.nl, Bas@zeelandnet.nl,

More information

ARM Processors for Embedded Applications

ARM Processors for Embedded Applications ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or

More information

ARM ARCHITECTURE. Contents at a glance:

ARM ARCHITECTURE. Contents at a glance: UNIT-III ARM ARCHITECTURE Contents at a glance: RISC Design Philosophy ARM Design Philosophy Registers Current Program Status Register(CPSR) Instruction Pipeline Interrupts and Vector Table Architecture

More information

Designing Embedded AXI Based Direct Memory Access System

Designing Embedded AXI Based Direct Memory Access System Designing Embedded AXI Based Direct Memory Access System Mazin Rejab Khalil 1, Rafal Taha Mahmood 2 1 Assistant Professor, Computer Engineering, Technical College, Mosul, Iraq 2 MA Student Research Stage,

More information

ISSN Vol.03, Issue.02, March-2015, Pages:

ISSN Vol.03, Issue.02, March-2015, Pages: ISSN 2322-0929 Vol.03, Issue.02, March-2015, Pages:0122-0126 www.ijvdcs.org Design and Simulation Five Port Router using Verilog HDL CH.KARTHIK 1, R.S.UMA SUSEELA 2 1 PG Scholar, Dept of VLSI, Gokaraju

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

SEMICON Solutions. Bus Structure. Created by: Duong Dang Date: 20 th Oct,2010

SEMICON Solutions. Bus Structure. Created by: Duong Dang Date: 20 th Oct,2010 SEMICON Solutions Bus Structure Created by: Duong Dang Date: 20 th Oct,2010 Introduction Buses are the simplest and most widely used interconnection networks A number of modules is connected via a single

More information

SoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik

SoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik SoC Design Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik Chapter 5 On-Chip Communication Outline 1. Introduction 2. Shared media 3. Switched media 4. Network on

More information

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning By: Roman Lysecky and Frank Vahid Presented By: Anton Kiriwas Disclaimer This specific

More information

MULTI-PROCESSOR SYSTEM-LEVEL SYNTHESIS FOR MULTIPLE APPLICATIONS ON PLATFORM FPGA

MULTI-PROCESSOR SYSTEM-LEVEL SYNTHESIS FOR MULTIPLE APPLICATIONS ON PLATFORM FPGA MULTI-PROCESSOR SYSTEM-LEVEL SYNTHESIS FOR MULTIPLE APPLICATIONS ON PLATFORM FPGA Akash Kumar,, Shakith Fernando, Yajun Ha, Bart Mesman and Henk Corporaal Eindhoven University of Technology, Eindhoven,

More information

Ultra-Fast NoC Emulation on a Single FPGA

Ultra-Fast NoC Emulation on a Single FPGA The 25 th International Conference on Field-Programmable Logic and Applications (FPL 2015) September 3, 2015 Ultra-Fast NoC Emulation on a Single FPGA Thiem Van Chu, Shimpei Sato, and Kenji Kise Tokyo

More information

Buses. Maurizio Palesi. Maurizio Palesi 1

Buses. Maurizio Palesi. Maurizio Palesi 1 Buses Maurizio Palesi Maurizio Palesi 1 Introduction Buses are the simplest and most widely used interconnection networks A number of modules is connected via a single shared channel Microcontroller Microcontroller

More information

Five Ways to Build Flexibility into Industrial Applications with FPGAs

Five Ways to Build Flexibility into Industrial Applications with FPGAs GM/M/A\ANNETTE\2015\06\wp-01154- flexible-industrial.docx Five Ways to Build Flexibility into Industrial Applications with FPGAs by Jason Chiang and Stefano Zammattio, Altera Corporation WP-01154-2.0 White

More information

DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL. Shruti Hathwalia* 1, Meenakshi Yadav 2

DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL. Shruti Hathwalia* 1, Meenakshi Yadav 2 ISSN 2277-2685 IJESR/November 2014/ Vol-4/Issue-11/799-807 Shruti Hathwalia et al./ International Journal of Engineering & Science Research DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL ABSTRACT

More information

Networks-on-Chip Router: Configuration and Implementation

Networks-on-Chip Router: Configuration and Implementation Networks-on-Chip : Configuration and Implementation Wen-Chung Tsai, Kuo-Chih Chu * 2 1 Department of Information and Communication Engineering, Chaoyang University of Technology, Taichung 413, Taiwan,

More information

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Stanley Bak Abstract Network algorithms are deployed on large networks, and proper algorithm evaluation is necessary to avoid

More information

Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors

Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors Siew-Kei Lam Centre for High Performance Embedded Systems, Nanyang Technological University, Singapore (assklam@ntu.edu.sg)

More information

A Scalable Multiprocessor for Real-time Signal Processing

A Scalable Multiprocessor for Real-time Signal Processing A Scalable Multiprocessor for Real-time Signal Processing Daniel Scherrer, Hans Eberle Institute for Computer Systems, Swiss Federal Institute of Technology CH-8092 Zurich, Switzerland {scherrer, eberle}@inf.ethz.ch

More information

Microprocessor Soft-Cores: An Evaluation of Design Methods and Concepts on FPGAs

Microprocessor Soft-Cores: An Evaluation of Design Methods and Concepts on FPGAs Microprocessor Soft-Cores: An Evaluation of Design Methods and Concepts on FPGAs Pieter Anemaet (1159100), Thijs van As (1143840) {P.A.M.Anemaet, T.vanAs}@student.tudelft.nl Computer Architecture (Special

More information

Design and Simulation of Router Using WWF Arbiter and Crossbar

Design and Simulation of Router Using WWF Arbiter and Crossbar Design and Simulation of Router Using WWF Arbiter and Crossbar M.Saravana Kumar, K.Rajasekar Electronics and Communication Engineering PSG College of Technology, Coimbatore, India Abstract - Packet scheduling

More information

Design Space Exploration Using Parameterized Cores

Design Space Exploration Using Parameterized Cores RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS UNIVERSITY OF WINDSOR Design Space Exploration Using Parameterized Cores Ian D. L. Anderson M.A.Sc. Candidate March 31, 2006 Supervisor: Dr. M. Khalid 1 OUTLINE

More information

ORCA FPGA- Optimized VectorBlox Computing Inc.

ORCA FPGA- Optimized VectorBlox Computing Inc. ORCA FPGA- Optimized 2016 VectorBlox Computing Inc. 1 ORCA FPGA- Optimized Tiny, Low-Power FPGA 3,500 LUT4s 4 MUL16s < $5.00 ISA: RV32IM hw multiply, sw divider < 2,000 LUTs ~ 20MHz What is ORCA? Family

More information

VLSI Design of Multichannel AMBA AHB

VLSI Design of Multichannel AMBA AHB RESEARCH ARTICLE OPEN ACCESS VLSI Design of Multichannel AMBA AHB Shraddha Divekar,Archana Tiwari M-Tech, Department Of Electronics, Assistant professor, Department Of Electronics RKNEC Nagpur,RKNEC Nagpur

More information

Interfacing a High Speed Crypto Accelerator to an Embedded CPU

Interfacing a High Speed Crypto Accelerator to an Embedded CPU Interfacing a High Speed Crypto Accelerator to an Embedded CPU Alireza Hodjat ahodjat @ee.ucla.edu Electrical Engineering Department University of California, Los Angeles Ingrid Verbauwhede ingrid @ee.ucla.edu

More information

Supporting Multithreading in Configurable Soft Processor Cores

Supporting Multithreading in Configurable Soft Processor Cores Supporting Multithreading in Configurable Soft Processor Cores Roger Moussali, Nabil Ghanem, and Mazen A. R. Saghir Department of Electrical and Computer Engineering American University of Beirut P.O.

More information

Design of AHB Arbiter with Effective Arbitration Logic for DMA Controller in AMBA Bus

Design of AHB Arbiter with Effective Arbitration Logic for DMA Controller in AMBA Bus www.semargroups.org, www.ijsetr.com ISSN 2319-8885 Vol.02,Issue.08, August-2013, Pages:769-772 Design of AHB Arbiter with Effective Arbitration Logic for DMA Controller in AMBA Bus P.GOUTHAMI 1, Y.PRIYANKA

More information

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design Lecture Objectives Background Need for Accelerator Accelerators and different type of parallelizm

More information

Embedded Computing Platform. Architecture and Instruction Set

Embedded Computing Platform. Architecture and Instruction Set Embedded Computing Platform Microprocessor: Architecture and Instruction Set Ingo Sander ingo@kth.se Microprocessor A central part of the embedded platform A platform is the basic hardware and software

More information

A Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs

A Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs A Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs Antonino Tumeo, Matteo Monchiero, Gianluca Palermo, Fabrizio Ferrandi, Donatella Sciuto Politecnico di Milano, Dipartimento di Elettronica e Informazione

More information

Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor

Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor Abstract The proposed work is the design of a 32 bit RISC (Reduced Instruction Set Computer) processor. The design

More information

Novel Design of Dual Core RISC Architecture Implementation

Novel Design of Dual Core RISC Architecture Implementation Journal From the SelectedWorks of Kirat Pal Singh Spring May 18, 2015 Novel Design of Dual Core RISC Architecture Implementation Akshatha Rai K, VTU University, MITE, Moodbidri, Karnataka Basavaraj H J,

More information

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer)

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer) ESE Back End 2.0 D. Gajski, S. Abdi (with contributions from H. Cho, D. Shin, A. Gerstlauer) Center for Embedded Computer Systems University of California, Irvine http://www.cecs.uci.edu 1 Technology advantages

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

Design and Implementation of a Packet Switched Dynamic Buffer Resize Router on FPGA Vivek Raj.K 1 Prasad Kumar 2 Shashi Raj.K 3

Design and Implementation of a Packet Switched Dynamic Buffer Resize Router on FPGA Vivek Raj.K 1 Prasad Kumar 2 Shashi Raj.K 3 IJSRD - International Journal for Scientific Research & Development Vol. 2, Issue 02, 2014 ISSN (online): 2321-0613 Design and Implementation of a Packet Switched Dynamic Buffer Resize Router on FPGA Vivek

More information

Design and Analysis of On-Chip Router for Network On Chip

Design and Analysis of On-Chip Router for Network On Chip Design and Analysis of On-Chip Router for Network On Chip Ms. A.S. Kale #1 M.Tech IInd yr, Electronics Department, Bapurao Deshmukh college of engineering, Wardha M. S.India Prof. M.A.Gaikwad #2 Professor,

More information

Cost-and Power Optimized FPGA based System Integration: Methodologies and Integration of a Lo

Cost-and Power Optimized FPGA based System Integration: Methodologies and Integration of a Lo Cost-and Power Optimized FPGA based System Integration: Methodologies and Integration of a Low-Power Capacity- based Measurement Application on Xilinx FPGAs Abstract The application of Field Programmable

More information

Implementation of Optimized ALU for Digital System Applications using Partial Reconfiguration

Implementation of Optimized ALU for Digital System Applications using Partial Reconfiguration 123 Implementation of Optimized ALU for Digital System Applications using Partial Reconfiguration NAVEEN K H 1, Dr. JAMUNA S 2, BASAVARAJ H 3 1 (PG Scholar, Dept. of Electronics and Communication, Dayananda

More information

Transaction Level Model Simulator for NoC-based MPSoC Platform

Transaction Level Model Simulator for NoC-based MPSoC Platform Proceedings of the 6th WSEAS International Conference on Instrumentation, Measurement, Circuits & Systems, Hangzhou, China, April 15-17, 27 17 Transaction Level Model Simulator for NoC-based MPSoC Platform

More information

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN Xiaoying Li 1 Fuming Sun 2 Enhua Wu 1, 3 1 University of Macau, Macao, China 2 University of Science and Technology Beijing, Beijing, China

More information

Intellectual Property Macrocell for. SpaceWire Interface. Compliant with AMBA-APB Bus

Intellectual Property Macrocell for. SpaceWire Interface. Compliant with AMBA-APB Bus Intellectual Property Macrocell for SpaceWire Interface Compliant with AMBA-APB Bus L. Fanucci, A. Renieri, P. Terreni Tel. +39 050 2217 668, Fax. +39 050 2217522 Email: luca.fanucci@iet.unipi.it - 1 -

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Session: Configurable Systems. Tailored SoC building using reconfigurable IP blocks

Session: Configurable Systems. Tailored SoC building using reconfigurable IP blocks IP 08 Session: Configurable Systems Tailored SoC building using reconfigurable IP blocks Lodewijk T. Smit, Gerard K. Rauwerda, Jochem H. Rutgers, Maciej Portalski and Reinier Kuipers Recore Systems www.recoresystems.com

More information

A Novel Energy Efficient Source Routing for Mesh NoCs

A Novel Energy Efficient Source Routing for Mesh NoCs 2014 Fourth International Conference on Advances in Computing and Communications A ovel Energy Efficient Source Routing for Mesh ocs Meril Rani John, Reenu James, John Jose, Elizabeth Isaac, Jobin K. Antony

More information

Design of Synchronous NoC Router for System-on-Chip Communication and Implement in FPGA using VHDL

Design of Synchronous NoC Router for System-on-Chip Communication and Implement in FPGA using VHDL Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

More information

Design of Embedded Hardware and Firmware

Design of Embedded Hardware and Firmware Design of Embedded Hardware and Firmware Introduction on "System On Programmable Chip" NIOS II Avalon Bus - DMA Andres Upegui Laboratoire de Systèmes Numériques hepia/hes-so Geneva, Switzerland Embedded

More information

ISSN Vol.03, Issue.08, October-2015, Pages:

ISSN Vol.03, Issue.08, October-2015, Pages: ISSN 2322-0929 Vol.03, Issue.08, October-2015, Pages:1284-1288 www.ijvdcs.org An Overview of Advance Microcontroller Bus Architecture Relate on AHB Bridge K. VAMSI KRISHNA 1, K.AMARENDRA PRASAD 2 1 Research

More information

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing Linköping University Post Print epuma: a novel embedded parallel DSP platform for predictable computing Jian Wang, Joar Sohl, Olof Kraigher and Dake Liu N.B.: When citing this work, cite the original article.

More information

Module 5 Introduction to Parallel Processing Systems

Module 5 Introduction to Parallel Processing Systems Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this

More information

Design of an Efficient FSM for an Implementation of AMBA AHB in SD Host Controller

Design of an Efficient FSM for an Implementation of AMBA AHB in SD Host Controller Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 11, November 2015,

More information

OASIS Network-on-Chip Prototyping on FPGA

OASIS Network-on-Chip Prototyping on FPGA Master thesis of the University of Aizu, Feb. 20, 2012 OASIS Network-on-Chip Prototyping on FPGA m5141120, Kenichi Mori Supervised by Prof. Ben Abdallah Abderazek Adaptive Systems Laboratory, Master of

More information

Design of Reconfigurable Router for NOC Applications Using Buffer Resizing Techniques

Design of Reconfigurable Router for NOC Applications Using Buffer Resizing Techniques Design of Reconfigurable Router for NOC Applications Using Buffer Resizing Techniques Nandini Sultanpure M.Tech (VLSI Design and Embedded System), Dept of Electronics and Communication Engineering, Lingaraj

More information

FPGA Based FIR Filter using Parallel Pipelined Structure

FPGA Based FIR Filter using Parallel Pipelined Structure FPGA Based FIR Filter using Parallel Pipelined Structure Rajesh Mehra, SBL Sachan Electronics & Communication Engineering Department National Institute of Technical Teachers Training & Research Chandigarh,

More information

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture II Benny Thörnberg Associate Professor in Electronics Parallelism Microscopic vs Macroscopic Microscopic parallelism hardware solutions inside system components providing

More information

Politecnico di Milano

Politecnico di Milano Politecnico di Milano Prototyping Pipelined Applications on a Heterogeneous FPGA Multiprocessor Virtual Platform Antonino Tumeo, Marco Branca, Lorenzo Camerini, Marco Ceriani, Gianluca Palermo, Fabrizio

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Universität Dortmund. ARM Architecture

Universität Dortmund. ARM Architecture ARM Architecture The RISC Philosophy Original RISC design (e.g. MIPS) aims for high performance through o reduced number of instruction classes o large general-purpose register set o load-store architecture

More information

RECONFIGURABLE SPI DRIVER FOR MIPS SOFT-CORE PROCESSOR USING FPGA

RECONFIGURABLE SPI DRIVER FOR MIPS SOFT-CORE PROCESSOR USING FPGA RECONFIGURABLE SPI DRIVER FOR MIPS SOFT-CORE PROCESSOR USING FPGA 1 HESHAM ALOBAISI, 2 SAIM MOHAMMED, 3 MOHAMMAD AWEDH 1,2,3 Department of Electrical and Computer Engineering, King Abdulaziz University

More information

SoC Design Lecture 11: SoC Bus Architectures. Shaahin Hessabi Department of Computer Engineering Sharif University of Technology

SoC Design Lecture 11: SoC Bus Architectures. Shaahin Hessabi Department of Computer Engineering Sharif University of Technology SoC Design Lecture 11: SoC Bus Architectures Shaahin Hessabi Department of Computer Engineering Sharif University of Technology On-Chip bus topologies Shared bus: Several masters and slaves connected to

More information

Embedded Systems. "System On Programmable Chip" NIOS II Avalon Bus. René Beuchat. Laboratoire d'architecture des Processeurs.

Embedded Systems. System On Programmable Chip NIOS II Avalon Bus. René Beuchat. Laboratoire d'architecture des Processeurs. Embedded Systems "System On Programmable Chip" NIOS II Avalon Bus René Beuchat Laboratoire d'architecture des Processeurs rene.beuchat@epfl.ch 3 Embedded system on Altera FPGA Goal : To understand the

More information

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi. Lecture - 10 System on Chip (SOC)

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi. Lecture - 10 System on Chip (SOC) Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 10 System on Chip (SOC) In the last class, we had discussed digital signal processors.

More information

Chapter 2 The AMBA SOC Platform

Chapter 2 The AMBA SOC Platform Chapter 2 The AMBA SOC Platform SoCs contain numerous IPs that provide varying functionalities. The interconnection of IPs is non-trivial because different SoCs may contain the same set of IPs but have

More information

SoC Platforms and CPU Cores

SoC Platforms and CPU Cores SoC Platforms and CPU Cores COE838: Systems on Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University

More information

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola 1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study Bradley F. Dutton, Graduate Student Member, IEEE, and Charles E. Stroud, Fellow, IEEE Dept. of Electrical and Computer Engineering

More information

Controller IP for a Low Cost FPGA Based USB Device Core

Controller IP for a Low Cost FPGA Based USB Device Core National Conference on Emerging Trends in VLSI, Embedded and Communication Systems-2013 17 Controller IP for a Low Cost FPGA Based USB Device Core N.V. Indrasena and Anitta Thomas Abstract--- In this paper

More information

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on on-chip Donghyun Kim, Kangmin Lee, Se-joong Lee and Hoi-Jun Yoo Semiconductor System Laboratory, Dept. of EECS, Korea Advanced

More information

Advanced processor designs

Advanced processor designs Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The

More information

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses Today Comments about assignment 3-43 Comments about assignment 3 ASICs and Programmable logic Others courses octor Per should show up in the end of the lecture Mealy machines can not be coded in a single

More information

2. System Interconnect Fabric for Memory-Mapped Interfaces

2. System Interconnect Fabric for Memory-Mapped Interfaces 2. System Interconnect Fabric for Memory-Mapped Interfaces QII54003-8.1.0 Introduction The system interconnect fabric for memory-mapped interfaces is a high-bandwidth interconnect structure for connecting

More information

Fault-Tolerant Multiple Task Migration in Mesh NoC s over virtual Point-to-Point connections

Fault-Tolerant Multiple Task Migration in Mesh NoC s over virtual Point-to-Point connections Fault-Tolerant Multiple Task Migration in Mesh NoC s over virtual Point-to-Point connections A.SAI KUMAR MLR Group of Institutions Dundigal,INDIA B.S.PRIYANKA KUMARI CMR IT Medchal,INDIA Abstract Multiple

More information

DESIGN AND IMPLEMENTATION ARCHITECTURE FOR RELIABLE ROUTER RKT SWITCH IN NOC

DESIGN AND IMPLEMENTATION ARCHITECTURE FOR RELIABLE ROUTER RKT SWITCH IN NOC International Journal of Engineering and Manufacturing Science. ISSN 2249-3115 Volume 8, Number 1 (2018) pp. 65-76 Research India Publications http://www.ripublication.com DESIGN AND IMPLEMENTATION ARCHITECTURE

More information