Comparing Interconnection Models in an On-Chip Reconfigurable Multiprocessor

Size: px

Start display at page:

Download "Comparing Interconnection Models in an On-Chip Reconfigurable Multiprocessor"

Rosamund Parker
5 years ago
Views:

1 Comparing Interconnection Models in an On-Chip Reconfigurable Multiprocessor Rodrigo Soares, Sérgio Queiroz de Medeiros, Ivan Saraiva Silva, David Déharbe Universidade Federal do Rio Grande do Norte Departamento de Informática e Matemática Aplicada [rodrigo, sergio]@consiste.dimap.ufrn.br, [ivan, david]@dimap.ufrn.br Abstract The increasing complexity of present SoCs demands new, scalable, reusable, parallel interconnection models for their cores. This paper presents a comparison study made in an on chip reconfigurable multiprocessor, the X4CP32, on its interconnection. Three models were proposed, a bus system, a SoC using FIFO buffering, and a SoC using SAFC buffering. All the models were described in SystemC and simulated. Results show a great difference between the NoCs and the bus system s performance. 1. Introduction The Reconfigurable Architectures (RAs) have grown as subject of research and importance in the past few years. Currently, most microelectronics conferences have a special session for RAs, which is an evidence of their current popularity. The RAs are composed of two major elements: reconfigurable units and interconnection. Many studies have focused the RAs hierarchy and many aspects of the reconfigurable units, but little is done to improve the other element of RAs, and a vital one, the interconnection. Even the best RA model will not have a satisfactory performance unless it solves the interconnection bottleneck. Some RAs interconnection are composed of a single bus interconnecting all reconfigurable units. That is a simple and economic solution, but not a good one at all. A single bus, despite its low silicon cost, only allows one communication at a time, it isn t scalable, which heavily limits the reconfigurable units number, has a limited bandwidth, which is shared among all the RAs, setting a maximum limit for the RA s performance, and depends on sequential communication. For overcoming some of the problems that the bus presents, many RAs use hierarchic buses architecture, such as ARM AMBA [1], but the bandwidth is still limited. As an alternative to the traditional bus system, we propose the use of a Network-on-Chip (NoC) to interconnect a RA. The NoC has many known advantages, such as scalability, reusability, parallelism in communication, asynchronous communication, pipelined communication channels and it is very customizable. Most, if not all, of these advantages are desirable in an interconnection system, especially in a RA. The major problem with NoC is still the high area cost, which can t be afforded in some cases. This paper is divided as follows: Section 2 presents the State-of-the-Art of interconnection; Section 3 presents the X4CP32 architecture; in Section 4 we have a discussion on different interconnection schemes for the X4CP32; Section 5 has the results for the SystemC simulations; and finally, in Section 6, the conclusion. 2. State of the Art This section presents some of the current interconnection systems used in the literature.

2.1. Communication in Reconfigurable Architectures The FPGAs are still the most famous example of RAs, to the point of some people think they are the only kind of RAs existent.

2 2.1. Communication in Reconfigurable Architectures The FPGAs are still the most famous example of RAs, to the point of some people think they are the only kind of RAs existent. These architectures use mostly global and local interconnections. A perfect example is Altera s FLEX 10K family [2]. The communication system in FPGAs is based on crossbar structures that form a static connection until the next reconfiguration. Most finegrained RA s make use of point-to-point connections between neighboring PEs (Processing Elements) [3, 4, 5]. Current coarse-grained architectures have as much communication demands as a multiprocessor environment. As an example of coarse-grained RA we have the PACT XPP. It uses vertical and horizontal buses for transmitting data packets. For the transmission of control packets, the XPP has a single shared bus for each PAC (coarse structure containing a grid of RUs) Networks-on-Chip In the near future a single chip is expected to contain billions of transistors [6], which means probably hundreds of cores in a single chip. However, none of the current RAs use NoCs or any similar device for communication, as far as we know. The Network-on-Chip [7, 8, 9] is a network communication structure composed of many interconnected devices called routers. In direct topology networks, the routers are associated to cores, or to RUs in this case, and transmit packets sent from that core to other routers, until the packet reaches its destination. The router also transmits packets sent to its associated core. There are 3 major reasons why NoCs should be used in a multi-core microchip. First of all, because of its performance. Although a dedicated channel structure has better performance, it is virtually unviable in a system with hundreds of cores. The high performance of a NoC is due to its parallel and pipelined communication. Second, because of its scalability. Unlike other communication structures, a NoC can be easily adapted to any number of cores. In addition, the performance will not be affected. Third, reuse. The same NoC can be used in a diversified range of architectures without adaptation, as long as the cores and the NoC have the same communication protocol. That saves time in development and testing, and allows a faster design space exploration. However, the use of NoCs is not without its drawbacks. In a direct topology, where every core is connected to a router, the area overhead for a NoC interconnection system is given by the overhead of a single router multiplied by the number of cores. The pipelined communication of the NoCs adds latency to messages, which may be too high depending on the distance between the communication cores. Also, failing to choose the right routing, buffering and switching mechanisms might cause the system to suffer from performance loss to total stalling due to a deadlock. In spite of its drawbacks, NoCs rise as the only feasible communication model for large systems. A model of full interconnection between a core and its neighbors not only adds a large port overhead to cores, but is also inefficient if messages are sent to distant cores. A bus provides freelatency communication, but its scalability is limited to only dozens cores [10]. A bus system needs bridges between buses, which adds area overhead, latency and performance loss for distant cores communication. 3. An On-Chip Reconfigurable Multiprocessor Architecture The Reconfigurable Architecture used for the combination with Network-on-Chip is the X4CP32 [11, 12, 13, 14]. The X4CP32 architecture is composed by two basic elements: the cell and the RPU (Reconfigurable and Programming Unity). Each RPU contains 4 cells, which grants it the autonomy to keep its own program flow. The processor is a grid of RPUs, distributed in rows and columns, as seen in Figure 1. Every RPU is connected to all other RPUs in the same row and column through a bus. This bus also connects the RPUs to the external memory. Figure 1 RPU connections

3 3.1. The RPU The RPU is the highest-level entity in X4CP32. It is responsible for its reconfiguration and parallel processing. The RPU consists of 4 cells, an internal memory (G- MEM), an Instruction Memory (I-MEM), a Communication Buffer, an internal bus, a Bus Arbiter and a control logic. The I-MEM is located inside the top-left Cell. Each RPU is connected to its neighbors through its cells neighbor connections. It is also connected to every other RPU in the same row and column through the Row/Column buses. These connections can be seen in Figure 1. Figure 2 shows the RPU architecture. The G-MEM is the RPU memory block with 64k postions and 32-bit long. All components from RPU can read/write data from/to G-MEM. Bus Arbiter is the access controller of the Internal Bus. The Bus Arbiter manages the Internal Bus access protocol and priorities. The Communication Buffer is the interface to the inter-rpu communication (communication with other RPUs and the main memory). parallel processor. The top left Cell assumes the Processor Operation Mode. The other Cells assume the Dynamic ALU Operation Mode, to execute the instructions sent from the top left Cell. Thus, the RPU has 3 parallel processing units controlled by the other one. A compiler can implement a methodology to find the better way to distribute the instructions among the Dynamic ALU Cells, exploiting the architecture to its best. In the Reconfigurable Execution Mode, the RPU sets all of its Cells to Static ALU Operation Mode and configures each Cell inputs, operations, outputs and routings building a systolic data path, just like the usual reconfigurable architectures. When the inputs are ready, the Cell operates them and writes the result in the output port, so a neighbor Cell can read it. Figure 2 RPU internal view 3.2. Execution Modes The Execution Mode is the way X4CP32 achieves its hybridism. The Execution Mode defines the behavior of the RPU. There are two Execution Modes: the Programming Execution Mode and the Reconfigurable Execution Mode. In Programming Execution Mode, the RPU acts as a 3.3. The Cell Figure 3 Cell data path The cell is a microprocessor with an instruction set common to most general-purpose microprocessors, plus a few configuration specific instructions. It is connected to 5 other cells through half-duplex buses. These connections are called ports. The cell also has an ALU,

4 capable of performing basic logic and arithmetic operations in fixed and floating-point numbers, a 32- position register bank (C-MEM) and a 64 positions stack, C-LIFO. Figure 3 shows an example of the cell data path. The synthesis results of the cell are presented in Table 1. Table 1 Synthesis Results of the Cell Unit Clock (Mhz) Area (# LC) ALU 70 2,307 Cells 50 3,739 PC Cell 50 4,256 PC Cell 59 1,404 Controller Cells Controller RPU 49 15, Operation Modes To implement the hybrid runtime-reconfigurable/ parallel paradigm, the Cell was planned to match the specifications of both paradigms. It can operate in three different modes, which are the key for having the hybridism desired. In Processor Operation Mode, the Cell is the program flow control unit. The Cell fetches instructions in the I- MEM and checks to which Cell the instruction is for. If the instruction is for another Cell, it sends the instruction through a predefined port to the destination Cell. If it s not for another Cell, it executes the instruction. While in Dynamic ALU Operation Mode, the Cell receives the instructions through a predefined port, which depends on its position in the RPU. It executes the instruction and waits for another one. In Static ALU Operation Mode, the Cell is set to work as a data-driven ALU. The operation, inputs (ports, ACC or a special memory position) and output (a port) are set and the Cell operates whenever there is a valid data, until it receives another reconfiguration instruction. a RPU will receive data in the same order they were originally generated. This need is of easy understanding, because since this is a parallel/reconfigurable environment, every time a RPU needs a data, it has to make sure the data that RPU receives is the same it actually needs for its correct execution. That could be archived in a number of ways, such as labeling every data, but that would take more bits to store and transmit. The R2R Protocol eliminates this necessity. To run R2R Protocol doesn t require extra processing from the RPU. The Communication Buffer runs all the protocol and works independently of the RPU control. However, with the increasing popularity of SoCs (System-on-Chip) [17, 18], and possibly Reconfigurable SoCs, it s necessary to evaluate the utilization of NoC in coarse-grained or polymorphous reconfigurable architectures NoCX4 The NoCX4 [19] is a Network-on-Chip especially designed to be integrated with the X4CP32. It consists, as several NoC models, of a router grid. Each router has 5 ports: north, south, east and west, which link the router to four neighboring routers; and the node port, which connects the router to a module, in this case the RPU. The NoCX4 uses XY routing, for its simplicity and deadlock prevention, despite its drawbacks. The packets are composed of a word-size (32-bit long) header and from 1 to 8 word-size data, as seen if Figure X4CP32 s Interconnection 4.1. Bus System (R2R) The RPUs in the same row and in the same column are all connected through buses. They communicate by using a simple, peer-to-peer oriented protocol, the R2R (RPUto-RPU). A SystemC [15] implementation of the protocol and the results obtained can be seen in [16]. The fundamental principle of the R2R is to assure that Figure 4 NoCX4 Packet The NoCX4 s Router is simple, with only a few internal modules. It has 5 special FIFOs, which perform all the control flow and routing functions. The FIFO has bit positions buffer and a 5-state control logic. The Crossbar allows 5 different data to be routed at the same time, because of its internal connections. Every

5 output port has an associated Arbiter. The Arbiters use a simple Round-Robin scheduling policy, with no priority. An arbiter evaluates how many available positions the FIFO has and the data size of all requesting FIFOs before it grants the datapath. The Router architecture is shown in Figure 5. The Arbiters are not shown in the figure for spatial matters, but they are associated to every multiplexer. Each FIFO is connected to all output port, except its own, because a packet never needs to return to where it came from. Figure 6 Simulations Communication Parameters Figure 5 NoCX4 Router Architecture 5. Simulations and Results For the comparison of different communication models in the X4CP32 it was run a set of simulations in SystemC. The simulations were performed in a 3x3 RPU grid. Because of this paper s scope, the RPU was replaced by a Communication Buffer encapsulated by a wrapper, which provided the communications instructions from the simulation files. The input simulation files were randomly generated according to some specific required parameters, show in Figure 6. Each square represents a Wrapper. The Send field has information on the number of peers received instructions from that Wrapper; the Receive field informs the number of Wrappers that generated data for that Wrapper; and the Route field informs how many routing paths crossed the router linked to that Wrapper. Notice that the data generated in the router s Wrapper, or the ones received by it, are not counted in the Route field. The number in parenthesis represents the percentage of sent, received and routed data over the total simulation dataflow. Two versions of NoCX4 were simulated. Both have the same characteristics, expect for their buffering. The first version uses a single 32-position deep FIFO in each input port. The second version had its FIFOs replaced by four 8-position deep SAFC (Statically-Allocated Fully Connected) buffers. Since both buffering have the same number of positions, their areas are equivalent, as can be seen in Table 2, that represents the logic cells cost and frequency operation of the FIFO and SAFC models. The implementation used the VHDL [20] language and FPGAs. During the compilation the software Quartus II, from Altera, was utilized. Despite of te equivalent area of the chip, the results show that the introduction of virtual channels have a major impact on the NoCX4 performance. Table 3 presents the synthesis results of the router, describing the number of logic cells needed by each unit and its frequency. The last row of the table indicates the total cost for a router with five buffers SAFC and all other elements of the table. Table 2 Area cost and Frequency of Buffering Models Buffer Number of LCs Frequency (MHz) FIFO ,92 SAFC ,6

6 Table 3 Area cost and Frequency of the Router Unit Number of LCs Frequency (MHz) Bufferering (SAFC) Keying Routing Flux Control 1, Handshake 10 - Router (Total) 5, A total of 15 simulations were performed for each communication model. The traffic rate varies from 5% up to 25%, and 3 different packet sizes were simulated: 2, 4 and 8. The traffic rate is a percentage of 10,000 instructions for each Wrapper. This means in the simulation of 25% traffic rate and packet size 8, there are 22,500 packets. The traffic rate can be translated as 18kBytes per 1 packet size per 5%. The throughput results, in bytes/cycle, for the 3 communication models can be seen in Figure 7. These results only consider the payload of the packet. The R2R results, represented as the BUS, never reach a rate of 1 byte/cycle. It maintains an almost constant throughput rate in all cases, with a slight decrease as the packet size increases. The FIFO version of NoCX4 shows a linear increase until a certain saturation point, which varies depending on the packet size. After that point on, the throughput decreases. The SAFC version of NoCX4 shows similar results to the FIFO, with one important difference: its saturation point is usually after the FIFO s saturation. With that, the SAFC allows the throughput rate to increase to the highest rates of the 3 models. At 20% traffic rate the FIFO PCK 8 and the SAFC PCK 4 curves come across. Despite having twice as much overhead due to the header, and packet half as big, the use of virtual channels allows much higher throughput rates. A better comparison between NoCX4 and R2R is in Figure 8, which shows the results of ending cycles. Ending cycles is the number of cycles necessary for sending all the data, thus ending the simulation. Here only one version of NoCX4, the FIFO, appears. Showing the SAFC version s results would hamper the graphic s visibility. The results are coherent to those seen in Figure 7. The instructions were distributed over 50,000 cycles, so a result over that amount means saturation. The saturation point for the FIFO is the same seen in throughput results, but only here the R2R results are visible. Unlike NoCX4, the R2R saturates almost at the 5% traffic rate, presenting a linear increase in the ending cycles, proportional to the amount of data in the simulation. At the higher traffic rates, the R2R takes 7 times more than the FIFO to finish the data transmission, and almost 9 times more than the SAFC % 10% 15% 20% 25% BUS PCK 2 FIFO PCK 2 SAFC PCK 2 BUS PCK 4 FIFO PCK 4 SAFC PCK 4 BUS PCK 8 FIFO PCK 8 SAFC PCK 8 0 Figure 7 Throughput (bytes/cycle) 5% 10% 15% 20% 25% FIFO PCK2 BUS PCK 2 FIFO PCK 4 BUS PCK 4 FIFO PCK 8 BUS PCK 8 Figure 8 Ending Cycles Finally, we have a comparison between the two versions of NoCX4 in Figure 9. The results for average latency are measured in cycles. Since buses have a constant latency, the R2R results were not included there. With a 5% traffic load, the results are exactly the same. That s because there was no saturation, so the time to

7 send the packets is the same. At 10% traffic load and packet size of 8, the FIFO reaches its saturation point. From that point on the average latency grows larger. However, for the SAFC, even when it reaches its saturation point, there is still a good flowing of data, because of the virtual channels. That gives the SAFC version a controlled latency, even for 25% traffic rate, and has a major impact on the performance. Since both versions of NoCX4 use the same routing mechanism, policy schedule and so on, the better performance of SAFC version is due to its reduced latency when compared to the FIFO version % 10% 15% 20% 25% FIFO PCK 2 SAFC PCK 2 FIFO PCK 4 SAFC PCK 4 FIFO PCK 8 SAFC PCK 8 Figure 9 Average Latency (cycles) 6. Conclusion and Future Works This paper showed a study on different communication models for an on chip reconfigurable multiprocessor, the X4CP32. A bus system, the R2R, and a Network-on-Chip, NoCX4, were developed for this work. The NoCX4 was implemented in two versions, which had the same characteristics except for their buffering. SystemC simulations of the 3 communication models were performed, and the results, throughput, average latency, and ending cycles, were presented and discussed. In all results, the R2R has proven to be the worst alternative. The single changing of buffering model in NoCX4 had a major impact on the results, showing that the use of virtual channels decreases the average latency, thus improving the final throughput of the NoC. As for future works, it will be designed a new hybrid model, using both bus and NoC, and a cache hierarchy development for the X4CP References [1] ARM AMBA, [2] Altera, [3] E. Mirsky, A. DeHon: MATRIX: A Reconfigurable Computing Architecture with Configurable Instruction Distribution and Deployable Resources ; Proc. IEEE FCCM 96, Napa, CA, USA, April 17-19, [4] C. Ebeling et al.: RaPiD: Reconfigurable Pipelined Datapath. In; sixth International Workshop on Feld Programmable Logic and Compilers, Lecture Notes on Computer Science, pp Springer-Verlag, September [5] R. Kress et al.: A Datapath Synthesis System for the Reconfigurable Datapath Architecture; ASDP-DAC 95, Chiba, Japan, Aug. 29 Sept. 1, [6] International Technology Roadmap for Semiconductors, [7] L. Benini and G. De Micheli. Networks on Chips: a New SoC Paradigm, IEEE Computer, Jan. 2002, pp [8] Kumar, S. et al; A network on chip architecture and design methodology VLSI, Proceedings. IEEE Computer Society Annual Symposium on, pp , [9] Jantsch, A. and Tenhunen, H. (Editors); Networks on Chip ; Kluwer Academic Publishers, [10] Kai Hwang Advanced Computer Architecture: Parallelism, Scalability, Programmability McGraw-Hill, Inc [11] A. Azevedo, R. Soares and I. S. Silva. A New Hybrid Parallel/Reconfigurable Architecture: The X4CP32. In Proceedings of the 16th Symposium on Integrated Circuits and Systems Design. SBCCI 03. ACM Press, [12] R. Soares, A. Pereira, I. Saraiva, X4CP32: a Programmable Multi-level Reconfigurable Microprocessor, Proceedings of Students Forum on Microeletronics 02, SBC, Porto Alegre, pp , [13] R. Soares, A. Pereira, I. Saraiva, X4CP32: A Coarse Grain General Purpose Reconfigurable Microprocessor, RAW 03, Nice, France, [14] A. Pereira, R. Soares, I. Saraiva, Implementação da DCT 2D em arquiteturas reconfiguráveis utilizando a X4CP32, Proceedings of Iberchip 03, Havana, Cuba, [15] Open SystemC Initiative: [16] R. Soares, A. Pereira, I. Saraiva, A Case-Study of Communication in a Reconfigurable Architecture: The X4CP32 s Communication Buffer, Proceedings of Iberchip 04, Cartagena des Indias, Colombia, [17] R. A. Bergamaschi and J. Cohn. The A to Z of SoCs. In Proceedings of the 2002 IEEE/ACM international Conference on Computer-Aided Design. ICCAD 02. ACM Press, , [18] Benini, L.; et al. MPARM: Exploring the Multi-Processor SoC Design Space with SystemC. The Journal of VLSI Signal Processing, 41(2): , September 2005.

8 [19] R. Soares, I. S. Silva and A. Azevedo. When reconfigurable architecture meets network-on-chip. In Proceedings of the 17th Symposium on integrated Circuits and System Design. SBCCI 04. ACM Press, [20] D. Pellerin and D. Taylor. VHDL made easy!, Prentice- Hall, Inc., 1997.

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) D.Udhayasheela, pg student [Communication system],dept.ofece,,as-salam engineering and technology, N.MageshwariAssistant Professor