The Impact of Parallel and Multithread Mechanism on Network Processor Performance

The Impact of Parallel and Multithread Mechanism on Network Processor Performance Chunqing Wu Xiangquan Shi Xuejun Yang Jinshu Su Computer School, National University of Defense Technolog,Changsha, HuNan, China 4173 np4gs3@163.com Abstract Network processors are becoming a predominant feature in the field of network hardware due to its high performance and flexibility. The performance of network processor mainly depends on its architecture. This paper studied the parallel architecture and multithread mechanism in network processor. We discussed reasons of thread stalls and the principle of hiding latencies caused by various stalls using multithread mechanism. Lastly, we present the test results based on analyzing the relationship of active thread number and the performance of network processor. 1. Introduction The ceaseless rise of network link rate demands network devices process packets in wondrously short time. For example, the arrival interval of 4 bytes packet is 35ns on 1Gbps links, and this interval is 8ns on 4G links. It is quite difficult in so short time to complete the process of QoS and looking up route table in line rate. Network processor is canonized by network device manufacturers due to the characteristic of its flexibility and the process performance close to ASIC. The architecture of network processor is crucial to its processing power, but its memory capacity and computing capability are the basic factors. As a SOC system, it is unpractical to make its frequency as high as generic CPU s. For example, the arrival interval of packet is about 16ns on OC-48 links. That is, the network processor with frequency of 133Mhz and clock cycle of 7.5ns must complete the process of a 4bytes packet in 21.3 clock cycles to avoid packet dropping. Network processor usually employs DRAM as external memory instead of SRAM due to its high price, although the access delay of SRAM could reach 1ns. As DRAM access delay is about 55-7ns(DDR,RDR), it needs 1 clock cycles to complete an external memory access. This indicates the impossibility of completing the packet process in 21 clock cycles only by single thread. [1] and [2] evaluate the requirement of the network process ability by the compute algorithm to perform looking up route table(rtr) and IP segment(frag). The compute complexity of RTR is 2.1 instructions per byte. For 2.5Gbps link, the required compute ability is 2.5Gbps/8*2.1= 656MIPS, The compute complexity of FRAG is 7.7 instructions per byte. The required compute ability is 2.5/8*7.7=247MIPS. Now the process ability of single processing element at the art of state in network processor is about 15MIPS. So the architecture of network processor must be lucubrated. We can meet the high speed network requirement by appropriate design or arrangement of process elements() to overcome the process ability shortage of single. It has been sufficiently verified that the parallel and multithread mechanisms are effectual approaches to rise the performance of computer systems. This paper focuses on the impact of parallel and multithread mechanism on network processor performance. In section 2, related works are described. Thread stalls and introducing of multithreads are presented in section 3. In section 4, we discuss the relationship of parallel mechanism and network processor performance. And at last, we give implementation and testing results. 2. Related works Usually, the core proportion of network processor is multi-s. Every is a simple micro-processor on which there are multithreads running, and one packet is assigned to one thread to process each time. The essential idea is to exploit the parallelism of packet processing by assign uncorrelated packets to different threads. The hardware architecture of network This work is supported by National Natural Science Foundation of China (NSFC), under agreement no 9646 Proceedings of the Fifth International Conference on Grid and Cooperative Computing (GCC'6) -7695-2694-2/6 $2. 26

processor is shown in Fig1. There are a group of s, multi co-processors and multi hardware logic block. CoP CoP CoP Fig1. The hardware architecture of network processor We can partition network processor to two classes according to its architecture[3]: Pipelined: each processor is designed for a particular packet processing task and communicates in a pipelined fashion. Examples of this architectural style include Cisco s PXF, Motorola s C-5 DCP, and Xelerated Packet Devices. Parallel: each is performing similar functionality. This approach is commonly coupled with numerous co-processors to accelerate specific types of computation. Since these co-processors are shared across many s, an arbitration unit is often required. The Agere PayloadPlus, Intel IXP12, IBM PowerNP, and Lexra NetVortex are examples of this type of macro-architecture. The parallelism of this type architecture includes the parallelism of s[4], the parallelism of multithreads in s[5], the parallelism of task/packet level and the parallelism of data/instruction level in s,etc. Tilman Wolf and his fellows[6] researched on how organization impact on system performance. They analyze the network processor performance of four types of parallel, serial, pipelined and mixed architectures. The simulation results indicate that organization has important impact on system performance, where The depth of pipeline can rise system throughput, but the competition of memory access will limits the width of pipeline contributing to system throughput. Access to off-chip memory will reduce system throughput and increase packet queuing delay. This may be alleviated by introduce the hidden delay mechanism. The cost of communication and synchronization has more important impact on system throughput than processing time. Venkatachalam and his fellows[7] researched on how to use configurable micro-engine architecture and programming model to develop serial pipeline architecture and parallel architecture on Intel IXP24. They evaluated the efficiency of IXP24 network processor according to the two applications of ATM AAL2 and ATM flow management. The presently research results indicate that pipeline architecture may rise processing performance by increasing pipeline stages, but it is difficult to develop the software which will not generate system bottle neck and can efficiently drive processing engines. Comparatively, software development is easier in parallel architecture. There are problems such as memory access conflict between multi-processors and resource sharing. Packets processed in different s are not always independent, but sometimes packets may be dependent each other. This conditioning relationship performs in two aspects of service order and resource operation. The synchronization problem caused by service order could be solved by maintaining status of packet process. Lock mechanism must be introduced to avoid resource operation confliction. Although some packets may be dependent each other, the independent packets are the majority of Internet traffic. Parallel processing packets has great effect in rising system throughput. Many researches have indicated that the performance of packets processing in parallel architecture with multi-s is more powerful than in pure pipeline architecture. Additionally, time wasting can be caused by kinds of thread stall during threads running inside, which will impact on network processor performance. Next we will focus on the impact on network processor performance coming from thread stalls, the introducing of multi-threads and parallel s. 3. Thread stalls and introducing of multithreads We have mentioned that thread stall can reduce the system process ability. This section will discuss the type of thread stall, the impact on network processor Proceedings of the Fifth International Conference on Grid and Cooperative Computing (GCC'6) -7695-2694-2/6 $2. 26

performance coming from thread stall and the stall hiding mechanism by utilizing multi-threads. 3.1. Type of thread stall Resource sharing and exclusive access will cause running thread stall. These stalls have great impact on the performance of multithread network processing. The main stall types are: (1)Coprocessor stall: A coprocessor stall occurs when the thread is stalled waiting for a coprocessor to finish executing. Some examples of when a coprocessor stall occurs are: Synchronous coprocessor command is issued and the thread is stalled until the coprocessor is done executing Asynchronous coprocessor command is issued and the coprocessor is already in use Wait instruction is executed and a coprocessor is still executing from a previous asynchronous coprocessor command (2)Data stall: A data stall occurs when an instruction must wait for a specific general purpose register (GPR) to get data that is being loaded across the data bus (3)Instruction stall: An instruction stall occurs when the thread is stalled waiting for an instruction fetch to complete. An example of an instruction stall is when a branch instruction is executed. (4)Bus stall: A bus stall occurs when the thread is stalled waiting to access a data bus. Contention for a data bus can be due to any of the following: Another instruction executed by this thread or another thread is already using the bus during the cycle that the CLP requests it for this instruction Another coprocessor is already using the bus on the cycle that the CLP requests it. (5)GPR stall: A GPR stall occurs when two operations attempt to copy data into any GPR on the same cycle. 3.2.Impact on performance caused by thread stalls By utilizing the Npprofile, a network processor performance analysis toolkit, we have analyzed trace message log file produced by packet forward picocode of the third layer and second layer. Figure2 and Figure3 presents the result about thread stalls. In Figure 2,the first column means that the threads were stalled about 12 times, each lasting 1-1 cycles (coprocessor stalls are 1, data stalls are 35, instruction stalls are 12, and bus stalls are 6). The second column indicates that there were 1 stalls that lasted 1 2 cycles (was a coprocessor stall ). Among various thread stalls, the bus stall is dominant. And it is noticed that the coprocessor stall occupies the most number of cycles. stall frequency 14 12 1 8 6 4 2 1 2 3 4 5 Number of cycles stalled Figure2. Thread stalls frequency and cycles tested with forward picocode of Layer 3 stall frequency 12 1 8 6 4 2 1 2 3 4 5 6 7 8 Number of cycles stalled Figure3. Thread stalls frequency and cycles tested with forward picocode of Layer 2 By utilizing the Npprofile, we have analyzed trace message log file produced by packet forward picocode of the third layer. Figure 4 presents the distribution of various stalls and running cycles of single thread. It shows that various stalls occupy 59% of the thread running period. It is necessary to hide these stalls by adopting parallel multithread mechanism to achieve higher utilization of processors. 13% 13% CLP EXEC 41% 12% 21% Fig4. Distribution of various stalls and running cycles of single thread CLP EXEC Proceedings of the Fifth International Conference on Grid and Cooperative Computing (GCC'6) -7695-2694-2/6 $2. 26

3.3.The principle of hiding stalls using multithrea By introducing the multithread model, we can avoid waiting caused by various stalls of single thread in micro-engine. Figure5 shows the stall hiding in multithread model. Thread2 Thread1 Thread run stall packet1 packet3 packet2 thread life(t) Figure5. The stall hiding in multithread model When Thead accesses the tree search engine, it can hand over the to thread1. When thead1 accesses external memory, it can hand over the to thread2. Thus, by adopting multithread model and thread switching technology, the delays caused by various stalls are effectually hidden. As a result, will not waste processing cycles for waiting for the end of stalls. 4. The relationship of parallel mechanism and network processor performance Suppose m is number of s, n is number of packets,for network processor with single, we can compute the time of processing single packet of length L as expressions (1): St=C t +f(m,n)+p t (1) Where, Ct is the time of keep packet order, f(m,n) is the stall time, here m=1, n=1, Pt is the time of processing single packet in without any stall. Pt is associated with packet length L. The larger the L, the larger the Pt. Pt is approximately a linearity function associated with L as expressions(2): P t =ß*L (2) For network processor with single, we can compute the time of processing n packets of length L as expressions (3): NSt=Ct+f(m,n)+n*ß*L (3) Where, m=1. For network processor with m s, we can compute the time of processing n packets of length L as expressions (4): NTt=g(m)Ct+f(m,n)+n*ß*L /m (4) For network processor with m parallel s, keeping packet order is performed by special hardware. So g(m) is approximately a linearity function associated with m, and independent of packet number. So we can suppose g(m) is a constant. For network processor with m parallel s, f(m,n) is associated with the number m of active s, the larger the m, the larger the f(m,n). According to the principle of hiding stalls by multithread, f(m,n) increases very slowly. We know that f(1,n)< f(m,n), but npt>> n*pt/m, suppose m is 32, f(1,n) may be 32 times to f(m,n). The length of packet (say, L)also affect the number of needed. The smaller the L, the shorter the packet arrival intervals, so St should be shorter. For short packets, it can effectively reduce the pressure of packet processing by increasing the number of s. It is difficult to greatly shorten Pt to reduce n*pt/m only by rising the frequency of single, which is limited by technique of chip. Comparatively, it is more easy to implement the number of s within certain scope. We notice that when m increases, f(m,n) increases too. It impairs the performance of network processor with multithread. And m is limited by technique of chip too. It is important to trade off between f(m,n) and Pt. 5. Implementation and testing results We have implemented a high performance core router using network processor with multi-s. The test was performed on 2.5Gbps network interface of this core router. The relationship of throughput and number of needed was shown in figure6. When packet length is 124Bytes, line rate forwarding only need 2-3 s. When packet length is 6Bytes, line rate forwarding will need 2 s. Packet Number Be Sent(Million) 1 8 6 4 2 1 5 9 13 17 21 25 Number of Active Thread 29 6 128 256 124 Figure6. Relationship of throughput and number Proceedings of the Fifth International Conference on Grid and Cooperative Computing (GCC'6) -7695-2694-2/6 $2. 26

Max BandWidth Rate 1 9 8 7 6 5 4 3 2 1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Number of Active Thread 42 64 128 the number of s and the network processor performance. Experiments and Simulations show, parallel and multithread mechanism play an very important role in the network processor. On the other hand, stalls caused by resource competition impact greatly on the performance of network processor. Besides the hardware work, we could increase system s parallelism by software work., such as parallel searching of multi-route tables. Figure7. The relationship of max bandwidth rate and active thread number without packet loss Throughput 8 6 4 2 Gigabits per Second and Packets per Second Rates Mpps Gbit/s 48 64 128 256 124 1518 918 Packet Size Figure8. Performance of our core router using network processor with multi-s The test results indicate that the performance will rise greatly when active thread number increases for network processor with multi-s. Short packets require more parallelism because of their higher arriving rate. But the performance is not a linearity function associated with active thread number because the thread stalls will increase in multithread environment. 6. Summary and Conclusion Network Processors are an emerging technology in the network industry. The performance of network processor is closely associated with its architecture. In our studies we focused the efforts on relation between 7. References [1] TAN Zhang-Xi, LIN Chuang, Analysis and Research on Network Processor Journal of Software Vol.14, No.2 23 [2] Wolf T, Franklin MA. CommBench a telecommunications benchmark for network processors. In: Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. Austin, TX, 2. 154~162. [3] Niraj Shah,Understanding Network Processors 21-9-4 Technical Report University of berkeley [4] Ning Weng and Tilman Wolf. Pipelining vs. multiprocessors - choosing the right network processor system topology. in Proceedings of Advanced Networking and Communications Hardware Workshop (ANCHOR 24), Munich, Germany, June 24 [5] Patrick Crowley, Marc E. Fiuczynski, Jean-Loup Baer. On the Performance of Multithreaded Architectures for Network Processors Technical Report 2-1-, University of Washington. [6] L. Kencl, JY Le Boudec, T. Wolf et al. Adaptive Load Sharing for Network Processors In IEEE INFOCOM 22, New York [7] Muthu Venkatachalam, Prashant Chandra, RajYavatkar. A highly flexible, distributed multiprocessor architecture for network processing. IEEE Computer Networks Vol.41 23, pp563 586 Proceedings of the Fifth International Conference on Grid and Cooperative Computing (GCC'6) -7695-2694-2/6 $2. 26