A High Level Development, Modeling and Simulation Methodology for Complex Multicore Network Processors

Size: px

Start display at page:

Download "A High Level Development, Modeling and Simulation Methodology for Complex Multicore Network Processors"

Hilda West
6 years ago
Views:

1 A High Level Development, Modeling and Simulation Methodology for Complex Multicore Network Processors Gianni Antichi, Christian Callegari, Andrea Di Pietro, Domenico Ficara, Stefano Giordano, Fabio Vitucci Dept. of Information Engineering, University of Pisa, ITALY Massimiliano Meneghin Massimo Torquati Marco Vanneschi Computer Science Dept., University of Pisa, ITALY Massimo Coppola ISTI-CNR, Pisa, ITALY Abstract Network Processors (NPs) are attracting and powerful platforms for the fast development of high performance network applications. However, despite their greater flexibility and limited cost with respect to specialized hardware design, NP still face developers with significant difficulties. As they target complex and high-performance applications, programmers are often forced to write assembly code in order to better exploit the hardware. In this paper we propose an approach to NP programming which is based on a three-phase development methodology, and we apply it to Intel NPs of the IXP2XXX family. By exploiting a composition of software tools, a high-level definition of the application is turned first into a distributed program, then into an NP prototype, and finally into an efficient NP executable. The methodology we describe takes advantage of the ASSIST technology, which allows for the porting, testing, modeling and profiling of parallel applications on a cluster of standard PCs. We developed a C library that acts as a communication layer, hiding the hardware details of NP programming and allowing for high performance code development. The ultimate goal of this approach is to let programmers write C code, exploiting ASSIST-provided hints to (1) perform functional debugging and performance analysis and (2) to experiment with different parallel structures. The resulting code can be then directly compiled for the NP without modifications, largely reducing the overall coding effort. I. INTRODUCTION In recent years, Network Processors (NPs) have been attracting the interest of manufacturers and vendors of network devices. They appear as the most promising solution to realize high-performance, upgradable network devices that perform important tasks as packet classification, intrusion detection, traffic policing. Indeed, NP platforms offer very high packet processing capabilities, suited to multi-gigabit networks, and combine the programmability of general-purpose processors with the high performance typical of hardware-based solutions. Usually, a NP includes a set of programmable processors that run in parallel, allowing for fast packet processing. NPs provide processor instruction sets optimized for packet processing, often supporting hardware multithreading, as well as hierarchically distributed memories among processors. As of today, an undisputed standard for NPs does not yet exist, and each vendor presents its own different solution. This architectural heterogeneity, which strongly reduces code portability, is one of the main problems that slow down the adoption of NPs, as well as the lack of a friendly framework to simplify the programming of such devices. Many works ([1][2][3][4][5][6]) have addressed these issues, but none of them provided a final solution, especially in terms of ease and speed of programming. Moreover a great design challenge for programmers is also the definition of the best parallel paradigm for a piece of software to be run on NPs. Such a decision can vastly influence the performance and extensibility of the application, and must be made with great care. In order to face these programmability issues, some NP vendors such as EZ-chip [7] limit the flexibility of their processors by constraining the applications to be a composition of already-made and working pieces of software from a vendorprovided set. While this approach is certainly effective, it also limits the programmer s possibilities. On the other hand, NP makers that stuck to the paradigm of total flexibility, such as Intel [8], do not provide significant solutions to ease the NP programmers life. In both cases, no aid is given in order to solve the design issue of finding efficient parallel schemes. In this paper, we will focus on Intel NPs because of their widespread diffusion over both the academia and industries. The technique we propose is nevertheless generic, and can be applied to many other NPs and reconfigurable processing units. Intel calls microengines (µes) the atomic processor cores in its IXP family of NPs. Two programing languages are provided for the µes: the µ-assembly assembly language and a sort of C language called µ-c. While the presence of a C compiler can be an attraction and an advantage, a well-established choice in literature is to code in µ-assembly. The reason is simple: µ-c is a proprietary C with so many differences (both additions and restrictions) with respect to standard C that programmers usually decide to switch to a new low-level language which has the undoubtful advantage of allowing for very optimized and fast code. The optimization (and hence the speed) of a piece

2 of software on a Network Processor is actually an important factor that determines the real value of the overall application. In addition, the speed of an application also depends on the parallelism level of the code, whose optimal value is not easily determined before the actual implementation and testing. For these reasons, in this paper we propose a simplifying approach to NP programming by means of a small and programmer-friendly library, providing basic functionalities for parallel execution of code and for processor management on a NP platform. The library also allows for testing and analyzing the application code on a cluster of standard PCs, thanks to the ASSIST [9] environment. This approach results in a multi-phase development process and provides the capability to automatically determine the best code placement over the set of cores of the Network Processor and the optimal level of parallelism needed. The approach is extensible to other NP platforms, also because both the developed library and the ASSIST environment are easily portable. In order to prove the effectiveness of our methodology, we adopt as a case study a processing block which classifies packets according to a deterministic algorithm processing five header fields. As above mentioned, the greater ease of programming given by our approach is countervailed by a performance loss, which we evaluate by comparing the µ- C version of the packet classifier code with its µ-assembly version. We will see that the performance loss is mainly due to the use of a non-optimized µ-c compiler and can be thus reduced both by improving the compiler and by hand optimization of key application blocks. The paper is organized as follows. Sect. II describes the previous works in the area of NP programming. Sect. III illustrates the Intel IXP2XXX Network Processor platform, focusing on the IXP2400 where we performed our experiments. Sect. IV surveys the ASSIST programming environment which we use as a parallel framework. In Sect. V our methodology is described and Sect. VI explains in detail the development phases of parallel application for complex multicores, as well as the µal library, which provides compatibility between µ-c and C code. Finally, Sect. VII presents the case study application and shows the first results. II. RELATED WORK Programming NPs involves dealing with multiple generalpurpose and specialized processor cores, as well as multiple memory and interconnection technologies both on- and off-chip. There is extreme architectural heterogeneity across vendors and products, and several studies ([1][2]) already observed a critical lack of tools supporting an integrated, portable approach to partitioning and scheduling code blocks and managing data transfer within NP platforms. In one sentence, NPs tend to share programming complexity and the lack of programmer-friendly development tools. Intel s MicroACE [3] targets products of the Intel Internet Exchange Architecture (IXA) family, starting from the IXP1200 model. In MicroACE, proxy-like software elements on the IXP1200 s StrongARM control processor (ACEs) are linked to blocks of code on the microengines, so that loading an ACE element causes the corresponding microblock to be transparently loaded onto a µe. Microblocks implementing the fast path for data can choose to bounce packets to their associated ACE blocks on the slow path to deal with special cases. MicroACE provides a useful abstraction, but it is limited to IXA products. Microblocks are combined only according to topologies chosen by their authors, and dynamic software reconfiguration is not possible. The Intel Architecture Tool, which is also part of the Intel SDK, provides an Eclipse plugin compatible with the Developer Workbench, simplifying creation and composition of building blocks into applications. However block programming is still bound to µ-c or µ- assembly, and the product only supports a subset of the IPX2XXX family, e.g. not including the 1200 and 2400 models. Teja NP [4] targets the IXP1200 and IBM PowerNP series. It focuses on the provision of an integrated tool chain and development environment, but provides minimal architectural abstraction and therefore minimal design portability. NetBind [5] provides the abstraction of a set of packetprocessing components that can be bound into a data path. It separates the raw functionality of a microblock from the way it is composed with others, and also allows to dynamically reconfigure microblock compositions. However, it does not provide any abstraction over the NPs interconnects or over different sorts of processors. Other approaches have been proposed to accommodate architectural complexity and heterogeneity, and to support design portability, high performance and runtime reconfigurability by providing more effective abstractions. NP-Click [6] is also a component-based programming model for NPs, employing a richer component model than Net- Bind and providing greater flexibility in linking microblocks, but it still falls short of providing a generic approach to NP programming. OpenCOM [2] is a low-overhead generic programming model for NPs, which is based on a run-time software component model. The approach promotes design portability across a wide variety of processors, while simultaneously exploiting hardware assists that are specific to individual NPs. It features a distributed runtime with low memory footprint, employs formally specified interfaces, supports components written in different programming languages, and uniformly abstracts over different processor types and different inter-processor communication mechanisms. Although very promising, OpenCOM requires reimplementing the component runtime from scratch on any new architecture, and only experiments on the IXP1200 and IXP2400 NPs are reported in the literature. The PacLang [10] approach tries to provide a framework useful to program Network Processor that mimics already known programming tools. The PacLang compiler takes two files as input: 1) a high level PacLang program; 2) an Architecture Mapping Script (AMS). The former does not contain any architecture-specific details, and it is written in a C-like syntax extended with a special type system for packets. The

3 Figure 1. Scheme of the IXP2400. latter tells the compiler how to transform the code into one that runs efficiently on the target architecture. The clean separation improves both code readability and application portability, as changing the target only needs to modify the AMS file. The PacLang approach achieves very high performances on the IXP2400, close to those obtained by using µ-assembly. Our approach is similar to PacLang in easing the programming for NPs, but we introduce different fundamental features. Our library allows to program NPs by means of a subset of the C syntax, and by using ASSIST in the early development we allow programmers to start from off-the-shelf code, and to test and evaluate different parallel solutions long before beginning to work on the NP board. III. EXPERIMENTAL PLATFORM Intel IXP2XXX is a family of Network Processors which are designed to perform a wide range of functionalities, including multi-service switches, routers, broadband access devices and wireless infrastructure systems. The differences among models are related to the number of processing units, or the availability of specific functionalities (for instance, units which allow for encryption algorithms). The IXP2XXX are fully programmable and implement a high-performance parallel processing architecture on a single chip suitable for processing complex algorithms, deep packet inspection and traffic management at wire speed. They have been granted to Universities through the Intel IXA University Program, along with the development environment, which includes the IXA Portability Framework. A. Hardware Platform Fig. 1 shows a scheme of the functional units and internal connections of the IXP2400, the NP used in this work. It contains 9 programmable processors: an Intel XScale and 8 units called µes, which are divided in 2 cluster of 4 (ME 0:0... ME 1:3). The figure also shows the different available memories (SRAM, DRAM, scratchpad), which have different sizes and operational features, and the presence of shared functional units with specific purposes, e.g. I/O interfaces and hardware-accelerated units computing hash functions. The XScale processor supports real-time operating systems for embedded systems like VxWorks or Linux, therefore it can take advantage of C/C++ compilers available in these environments. The µes have a specific instruction set for processing packets, comprising 50 different instructions including ALU (Arithmetic Logic Unit) opcodes which work on bits, bytes and longwords, and can introduce shift or rotations in a single operation. There is no support to integer division or floating point operations. In the IXP2400, both the XScale core and the µes run at 600 Mhz. Each µe is a 6-stage pipelined processor, and fetches one instruction per cycle from its instruction store, which contains up to 4K of 40bit instructions. The XScale is in charge of loading the code in the instruction stores at startup. Jumps or context swaps can result in a performance reduction, due to pipeline refilling. To reduce the branch and memory access overhead, each µe can run up to 8 threads with hardware support to zero-overhead context switch. 1) Memory banks: IXP2XXX Network Processors can access 4 different types of memory, listed in Table I, from the fastest to the slowest: local memory, scratchpad, external SRAM, and DRAM. The local memory consists of bitwords, and can be accessed only by the single microengine that contains it, while the other memories are shared. The scratchpad is an on-chip SRAM memory that is mainly adopted for the management of FIFO circular queues (called rings) which are the main communication channel among µes. 2) NP evaluation board: The target platform of this work is the Radisys ENP-2611 board [11]. It is a PCI-X card, equipped with the IXP2400 NP, which can sustain a nominal line rate of 2.5Gbps. The card provides 8 Mbytes of SRAM and 256 Mbytes of DRAM. B. Software Tools The Intel programming model allows for and encourages the development of modular software. The two hierarchical layers of the NP require two separate programming environments and two different languages: µ-c or µ-assembly at the µe level and standard C or C++ at the XScale level. The development of the µe code is done through the Developer Workbench [12], which is an Integrated Development Environment (IDE) supporting both µ-c and µ-assembly programming, an ARMtargeting compiler for the XScale core, and a µe simulator. The modular structure of the software, which enables code portability, is based on two types of building blocks: core components at the XScale level and microblocks at microengines level. Each building block provides a packet pro- Memory Table I RADISYS ENP-2611 MEMORIES. Logical Width Size Latency (bytes) (bytes) (clock cycles) Local Memory Scratchpad 4 16k 60 SRAM 4 8M 90 DRAM 8 256M 120

4 cessing functionality (e.g. NAT, forwarding, Ethernet bridging, etc.). Programmers can use these elements, build new ones and combine them to create an application. Some blocks are called driver-blocks and perform those operations most dependent by the underlying hardware architecture, such as packet reception, transmission or queue handling. When custom µe blocks are produced, the compilation process of a piece of µ-assembly requires several passes (preprocessor, assembler and an optimizing allocator) in order to produce an executable image for a µe. The IDE provided with Intel SDK, the Developer s Workbench, allows for writing code and debugging of assembly or microengine-c in a visual environment in Microsoft Windows. A clock-cycle-accurate simulator of the µes is also available for debugging purposes. It precisely recreates the system behavior and is a valuable tool for testing applications prototypes, saving on the need to port the code onto the NP hardware just for the sake of performing accurate measurements. IV. HIGH LEVEL TOOLS FOR PARALLEL PROGRAMMING ASSIST (A Software development System based on Integrated Skeleton Technology) is a parallel programming environment based on skeleton and coordination language technology [9]. Structured parallel programming, based on the skeletons model, is an approach to deal with the complexity in the design of high-performance applications. The validity of this approach has been proved, in the last years, for homogeneous parallel machines (MPP, clusters) and Grid platforms ([13][14]). In the ASSIST environment, applications are developed by means of a coordination language (ASSIST-CL), in which sequential as well as parallel computation units can be expressed. The ASSIST environment provides the programmers with an integrated set of compiling tools (astcc, loader) and a portable and very efficient run time support. In ASSIST a parallel program can be structured as a generic graph, whose nodes are parallel or sequential modules, and arcs are the communication channels where streams of typed values are transmitted between modules. Sequential modules are basically procedure-like wrappings of sequential code written in C, C++ or Fortran. Parallel activities are designed by following regular parallelism exploitation patterns (i.e. skeletons), and are expressed in the ASSIST-CL syntax by the parmod construct. The parmod is a sort of generic skeleton that can emulate the classical parallel skeletons without significant performance degradation. Parallel activities in a parmod are decomposed in sequential indivisible units, which are then assigned to abstract executors, called virtual processors. The ASSIST run-time support allows them to communicate in a controlled and consistent way to access the shared computation state, which is partitioned among the virtual processors. Beside being skeleton-oriented, the coordination model is based on the concept of stream, an ordered sequence of typed values. Streams connect modules in generic graphs, and drive the activation of their computations. Each module can have any number of different input and output streams. By default, input streams of a module are used in a data-flow way (all inputs are required before module activation), but parmod modules allow selecting inputs in a CSP-like fashion, the language syntax explicitly defining 1) stream associations 2) prioritised and program-drive input choice 3) non-deterministic, guardcontrolled behaviour. The programming model of ASSIST includes also the concept of external objects, as a mean of inter-module interactions, and of interaction with resources outside the AS- SIST run-time (such as parallel file-systems, user interfaces, databases). A software shared memory library is integrated in the environment as an external object, based on Ad-HOC data service [15] and extending the language with a set of operations to effectively express allocation strategies, access and synchronization patterns. The pointers to shared memory data structures (called references) are an ASSIST basic data type and can be sent over the streams. The two typical stream parallelism exploitation patterns we employed in this work are the pipeline and the task farm (also referred to as master-worker). In a pipeline a set of input tasks are each one processed by a sequence of stages. Each stage just computes a function out of the result provided by the previous stage and delivers result to the immediately following one. In a task farm, a set, or a stream, of independent tasks are computed to obtain a set of results. A single program or function is used to compute all the results out of the input tasks, selecting for each task a new executor from the pool of available ones. V. METHODOLOGY The ASSIST environment simplifies the definition of a pipeline graph whose stages are task-farms or sequential modules. Since the stream paradigm perfectly fits packet processing, it appears natural to employ ASSIST as a programming framework for applications on Network Processors. In the ASSIST environment, the programmer has a greater extent of possibilities and more flexibility with respect to straight µ-c or µ-assembly programming. This flexibility presents its true values especially when experimenting with different ways to express the parallel program, transforming a sequential module to a task-farm one or adding/removing modules from the program graph. Furthermore the ASSIST environment simplifies the change of the parallelism degree in a task-farm modules over different runs of the same parallel program. When developing software for the IXP NPs, the parallel program is structured as a graph of modules (in our case it is typically a pipeline of sequential or task-farm modules) where the links among stages are streams of network packets references, as depicted in fig. 5. The sequential functions in each module are coded using the µ-c language and a specifically designed library we named µal (from µe ASSIST Layer). µal provides all its functionalities by exploiting the IXP2XXX hardware, and is thus able to hide all the low-level architecture-dependent details to the application programmer.

5 Figure 2. The three phases development model. ASSIST already provides support to the C, C++, and F77 sequential languages; for the purposes of this work, we extended the astcc compiler to also support µ-c sequential code. The ASSIST compiler just performs simple code transformations in order to let the µ-c code be compiled by a standard C compiler when compiling for an ordinary CPU. As a result of the compilation process, each module of the ASSIST pipeline has an input and an output ring. The connection between the ASSIST streams and the internal rings is transparently managed by the ASSIST compiler and the runtime support. All the threads inside each module can see only the input and the output rings, while it is up to the run-time support to fetch a task from the input stream and put it into the input ring (and vice versa for the output ring). The threads concur in accessing the data in the rings; the programmer can control the scheduling of the threads by accessing the rings through specific functions provided by the µal library. VI. DEVELOPMENT PHASES Our proposal for the development of parallel application for complex multicore architectures (like the Intel IXP2XXX Network Processors) is synthetically shown in fig. 2. The application design life-cycle proceeds through an iterative cycle involving three main phases: high-level ASSIST prototyping, clock-cycle-accurate simulation, fine-grained optimizations. a) Phase 1: The programmer uses the ASSIST environment for a rapid prototyping of the parallel application by writing the sequential code of each module in µ-c, with the code of the 8 threads in each module expressed using the wellknown SPMD model (Single Program Multiple Data). The threads access the memory hierarchy and all NP hardware by using the functions provided by the µal library. For this first stage, the code is compiled to run on a cluster of PCs, employing a C++ compiler and a specific version of the µal library in order to produce ordinary ASSIST parallel executables. Beside simplifying functional debugging of the program, as result of a first run this approach gathers a number of valuable measurements on the execution, such as the number of accesses to each memory level, the threads idle time, input and output throughput per stage. These first results can, in turn, help evaluate the choices made when parallelizing of the application, and allow for evolving its structure much more quickly than with direct µ-c programming. It should be noticed that (a) the programmer is also free to employ off-the-shelf C and C++ code within the modules, enjoying a smooth transition in porting a conventional code base to the NP platform, and (b) the development is not yet tied to the specific version of the IXP NP. As soon as the application is completely coded in µ-c and the test results are good, the programmer can move on to the next phase. b) Phase 2: Once the prototype has been tested and debugged on a cluster of PCs, the code of each module, produced by the ASSIST compiler and provided with all the necessary NP management functionalities from µal, is imported without any modification into the Developer s Workbench simulator. The aim of the simulation, by evaluating the latency in clockcycles of each module, is to allow for a finer performance tuning, ensuring that the service time of all modules satisfies the requirements of the application. According to the results provided by the simulation phase, the programmer may still choose to split (or merge) stages and/or transform a sequential stage in a task-farm module, possibly reverting back to the first development phase to restructure the code. c) Phase 3: Only at the end of the second phase, and only if it is strictly necessary in order to meet the application requirements, the code is manually optimized by rewriting critical code blocks, or whole modules in µ-assembly. Even in the case that this choice is mandated, still the developer retains several advantages from the approach: µ-assembly is used only when really needed, after compiled µ-c code inspection detects inefficiencies, and according to a sound performance analysis; there is reference µ-c code with correct functional behavior to help debug the µ-assembly; all functionalities of the µal are also directly available from µ-assembly, providing efficient and correct implementations of code block cooperation and simplifying the optimization phase. A. The µal library As previously said, the µal library has been specifically designed to hide all the low-level details of the underlying NP architecture. The library interface is generic enough to be easily ported to other different architectures and the functionalities provided can be divided in the following categories: Memory access functions: there are distinct functions for different memory hierarchy: SRAM, DRAM and scratchpad. All the functions read or write contiguous block of data of any size both in a synchronous or asynchronous way. The local memory of each µe is left to the management of the µ-c compiler. Synchronization primitives among threads: the library provides the user with intra-µes and inter-µes synchronization functions. Lock/unlock and test-and-set methods have been implemented both in scratchpad and SRAM memory. Ring access functions: the library implements initialization and get/put methods. Signals management functions:

$/************************************** * Function that reads and analyzes * the first node of the classifier **************************************/ int find_sa1 (int field){ lm_t result; /* local$ $=255){ result=result<<24; go_up[0]=(result mask)&go_up[0]; } address=class_base[0]+field; copy_buffer_from_sram(&reg_read, address, 1); result=reg_read>>24; return result; } Figure 4.$

6 /************************************** * Function that reads and analyzes * the first node of the classifier **************************************/ int find_sa1 (int field){ lm_t result; /* local memory location */ lm_t mask = 0xFFFFFF; sram_r_t reg_read; /* SRAM read transfer register*/ sram_address_t address; /* SRAM address */ copy_buffer_from_sram(&reg_read,class_base[0],1); result=reg_read>>24; if(result!=255){ result=result<<24; go_up[0]=(result mask)&go_up[0]; } address=class_base[0]+field; copy_buffer_from_sram(&reg_read, address, 1); result=reg_read>>24; return result; } Figure 4. Example of C code using the µal library. Figure 3. Schematic view of the software development phases. the Intel IXP2XXXX processor family uses a nonpreemptive scheduling, where signals are adopted to determine when memory accesses are concluded or generally to deal with context-switches. Thread scheduling control functions: these functions allow for a correct scheduling order among the 8 threads in a µe (for example in a roundrobin fashion). The order of threads practically reflects packet ordering. Various utility functions: these functions support the rest of the interface. The µal library allows for the use of a C-like code that can be compiled by the ASSIST tools (see Fig. 4). This is obtained through a series of functions that hide the use of signals and mask the effect of different type qualifiers specifying the memory placement of variables. This is an essential advantage on the IXPXXXX platform, where proper SRAM and DRAM synchronization is demanded to the µ-assembly programmer. The library interface has been implemented in two different versions: the first one in plain C++ as part of the ASSIST runtime environment, the second version in µ-c as integration of the Intel SDK environment. As sketched in fig. 3, the ASSIST compiler compiles the code of each stage and produces a set of source and include files in µ-c, with calls to some functions of the µal library. The files are compiled with the standard C compiler and linked against the µal assist library in order to build the executable program of each stage of the pipeline. At the same time, the set of files produced by the compiler can be imported into a Developer s Workbench project and compiled with the µ-c compiler linking the µal ixp.c file Figure 5. Implementation templates of the ASSIST graph for the Intel IXP2XXX processor family. which implements the µal functions in µ-c. Each stage of the ASSIST pipeline is mapped on a single µe according to the pipeline implementation template described in the next subsection. B. Implementing the pipeline and farm on the Intel IXP NP The implementation templates of the ASSIST pipeline for the Intel IXP2XXX NP family is shown in fig. 5. Each stage of the pipeline is mapped on a single µe and the communication channel between two stages is implemented with a ring. A task-farm module comprises a pool of µes, each one executing the same worker code. Each worker has its own input and output rings. Packets are distributed to any of the workers according to scheduling functions by the stage preceding the task-farm module in the pipeline, that we call scheduler stage. The threads in the scheduler stage call a scheduling function to choose the ring in which to put the output packet. It is possible to implement very complex scheduling functions, but for the sake of simplicity we have chosen the fast and simple round-robin scheduling. The same policy is also applied when collecting packets from the output rings of each worker, in the module that follows the task-farm one (the collector stage). A more advanced implementation template for the task-farm

7 Figure 6. Schematic view of the Classifier test application. module exploiting a more responsive scheduling function is also possible, but to allow packet reordering after the farm, an additional ring between the scheduler stage and the collector stage is needed, where the index of the scheduled workers are put. VII. APPLICATION AND FIRST RESULTS In this section, we show the adoption of our approach in a real case study: the development of a packet classifier. A. Packet Classifier description Basically, our packet classifier is an application that labels traffic flows according to the value of the 5-tuple of elements: IP Source Address, IP Destination Address, Layer 4 Protocol, Source Port and the Destination Port. Packet classification plays an important role in computer networks and is ubiquitous in many scenarios (such as traffic filtering, access control, billing, etc.). Therefore classification speed is a critical issue for any implementation and a wellperforming classifier requires a great effort of optimized development. Both the challenging and useful task and the sensibility of the performance on the code optimization motivate our choice to adopt a packet classifier as reference application. Specifically, the reference classifier we adopt in these tests is based on a multi-dimensional multi-bit trie algorithm as described in [16]. The code has been written both in µ-assembly and in µ-c with the µal library. The assembly version of the classifier has been extensively tested and evaluated in previous works [17], [18] confirming its high performance and capability to sustain up to 3 million packets per second. Since in the worst case 11 memory accesses to the classification table have to be performed per packet, in order to reduce the processing burden the classification table is stored in the external SRAM (i.e., the fastest among the external memories), which is large enough to keep a large number of classification rules (on the order of hundreds of thousands). The classifier presented in [17] is reconfigurable at runtime in the set of rules and provides a configurable cache in order to reduce the classification cost in terms of memory accesses. However, since in this paper we do only aim at providing a proof-of-concept to highlight the advantages of the methodology we propose, we will focus on a simplified version of the classifier that does not include advanced reconfiguration features. B. Implementation and Results In order to evaluate the overhead introduced by the proposed methodology (especially by the µ-c and the µal library), we have developed a classifier prototype by coding the application in ASSIST (adopting the µallibrary) and then by testing the code produced by the compiler on the IXP platform. We have compared the performance of the application on a single µe, with a native implementation written in µ-assembly. As shown in fig. 6, the application is a pipeline whose first and last stages read/write packets from/to network interfaces (or from/to files in the cluster-based testing). The intermediate stage applies the classification algorithm by retrieving the needed data from the memory hierarchy through the µal library functions. The first pipeline stage stores the entire packet data in DRAM and the header and the metadata in SRAM, while the reference of the packet is put into the output ring. Conversely, the last stage receives packet references from the input ring, retrieves the packet data from the shared memory and send them out on the network. In ASSIST, the input and output rings of each module are implemented around the stream concept and the accesses are completely masked by the µal library interface. Subsequently the code has been tested and debugged in a cluster of PCs: we have taken the files produced by the ASSIST compiler for the intermediate stage and recompiled in the Intel SDK environment by linking the IXP version of the µal library. On the IXP platform, the RX and TX codes, which read and send packets from/to the interfaces, are standard driver-microblocks provided by Radisys [11]. The resulting values for performance and code metrics are compared in tab. II and III. Specifically, the first table shows a comparison between the µ-assembly and the µ-c code (optimized for speed or size when specified) in terms of three code metrics: Instruction Store Words: the actual code size within the instruction store, measured in 40-bit machine instructions; Minimum Available Registers: the minimum number of available registers within any section of the code; this field reports the number of General Purpose (G), SRAM transfer (S) and DRAM Transfer (D) registers; Available Local Memory: the number of available 32-bit words in the µe after the compilation. A first general comment on the values of tab. II is that, as intuition suggests, the µ-assembly code is much more compact than its µ-c counterpart. Indeed, the µ-c classifier can be from 2 (code optimized for size) to 3 times (unoptimized code) larger than the µ-assembly version. This code inflation can introduce some difficulties when developing larger software running on the same µe. The adoption of the µ-c compiler and its runtime support library, which use Local Memory as temporary storage for small amounts of data, explains the different register usage reported, as well as part of the performance loss. A slight reduction (<6%) of the amount of Local Memory available for other code corresponds to more general purposes registers and less SRAM registers being available, in spite of the fact that µal exploits global registers. Performance results corresponding to the code metrics discussed are shown in table III. The table reports results in terms of maximum packet rate (in thousands of packets per second: Kpps) and processing delay (in clock cycles: cc) for the same

8 versions of the classifier reported in table II. Results confirm the intuition that µ-c code is not as fast as its µ-assembly counterpart. The non-optimal µ-c machine code generation also implies that processing delay is larger for the µ-c code. The speed-optimized µ-c classifier is fast enough to compete with the µ-assembly version and be useful in practice (about 10% slower in terms of Kpps), but clearly improvements in the µ-c compiler and runtime would make a stronger case for our approach. VIII. CONCLUSIONS AND FUTURE WORK Network Processors are powerful and flexible architectures for packet processing at high speed. However, their complex programming (often associated with the challenges in the definition of efficient parallel design) discourages their widespread adoption on network appliances. In this paper we have presented a novel approach to provide an easier way to program Intel IXP NPs. This approach, which can be easily extended to other multicore systems, takes advantage of the ASSIST environment as an aid to the choice of the best parallel paradigm for an application and as a reference for designing a parallel run-time library for NP platforms. When adding support for the Intel IXP programming languages to the ASSIST environment, we provided tools to simulate the behavior of an NP application on a cluster of standard PCs. Thus, execution statistics can be gathered and rough performance figures can be extrapolated before the actual porting of the code to the NP. The actual coding is performed in standard C for all or most of the development and is supported by a library of C functions which simplifies the programming phase. It is easy to test and evaluate different parallel decompositions of the application before reaching hardware-dependent development phase. The consequent cost reductions in application development, thanks to shorter time Table II COMPILATION RESULTS FOR THE µ-c AND µ-assembly VERSION OF THE PACKET CLASSIFIER CODE Code µ-assembly µ-c Opt. Instr. Store Minimum Available Words Avail. Regs LM (words) No 486 G:9 S:8 D:8 640 Yes 448 G:9 S:8 D:8 640 No 1225 G:11 S:0 D:8 603 Speed 1154 G:15 S:0 D:8 603 Size 913 G:15 S:0 D:8 603 Table III PERFORMANCE RESULTS OF THE µ-c AND µ-assembly VERSION OF THE PACKET CLASSIFIER CODE Code Opt. Max Pkt rate (Kpps) Proc. delay (cc) µ-assembly µ-c No Yes No Speed Size and reduced complexity, ease the economic exploitation of NP platforms. Experimental results on a multi-dimensional multi-bit trie packet classifier have shown limited performance differences with respect to a carefully hand-optimized version of the code which has been already employed in several projects. Even better results can be obtained by integrating in our approach high-level languages that are well optimized for the target NP architecture, thus avoiding low-level coding at all. Indeed, our approach does not rule out integration with languages specifically designed to this purpose (such as PacLang), as our run-time support has been designed to be small and general, in order to allow its porting to other NP architectures and tool-chains with low effort and good performance. ACKNOWLEDGEMENTS This work has been financed by Fondazione della Cassa di Risparmio di Pisa as part of the project FRINP ( Reconfigurable Firewall on Network Processor platforms ). REFERENCES [1] C. Kulkarni, M. Gries, C. Sauer, and K. Keutzer, Programming challenges in network processor deployment, in CASES 03. New York, NY, USA: ACM, 2003, pp [2] K. Lee and G. Coulson, Supporting runtime reconfiguration on network processors, in 20th Int.l Conf. on Advanced Information Networking and Applications. IEEE Computer Society, 2006, pp [3] Intel IXP1200 Processor-based MicroACE Test Framework. [Online]. Available: [4] Teja, Teja NP: The first software platform for multiprocessor systemon-chip architectures, [5] A. Campbell, S. Chou, M. Kounavis, V. Stachtos, and J. Vicente, Netbind: a binding tool for constructing data paths in network processorbased routers, in OpenArch. IEEE, 2002, pp [6] N. Shah, W. Plishker, K. Ravindran, and K. Keutzer, Np-click: A productive software development approach for network processors, IEEE Micro, vol. 24, no. 5, pp , [7] EZ-Chip Network Processors, [8] Intel Network Processors, intel.com/design/network/products/npfamily. [9] M. Vanneschi, The programming model of assist, an environment for parallel and distributed portable applications, Parallel Comput., vol. 28, no. 12, pp , [10] R. J. Ennals, R. W. Sharp, and A. Mycroft, Task partitioning for multicore network processors, in Compiler Construction: 14th International Conference, ser. LNCS, no Springer, April 2005, pp [11] Radisys ENP [Online]. Available: [12] Intel R, Ixp2400/2800 developer s tool user guide. [13] M. Aldinucci, M. Coppola, M. Danelutto, N. Tonellotto, M. Vanneschi, and C. Zoccolo, High level grid programming with ASSIST, Computational Methods in Science and Technology, vol. 12, no. 1, [14] M. Aldinucci, A. Petrocelli, E. Pistoletti, M. Torquati, M. Vanneschi, L. Veraldi, and C. Zoccolo, Dynamic reconfiguration of grid-aware applications in ASSIST, in Proc. of 11th Intl. Euro-Par 2005 Parallel Processing, ser. LNCS, J. C. Cunha and P. D. Medeiros, Eds., [15] M. Aldinucci, M. Danelutto, G. Giaccherini, M. Torquati, and M. Vanneschi, Towards a distributed scalable data service for the grid, in PARCO 2005, Malaga, ser. NIC, vol. 33. John von Neumann Institute for Computing, Dec. 2005, pp [16] S. Giordano, G. Procissi, F. Rossi, and F. Vitucci, Design of a multidimensional packet classifier for network processors, in ICC 06, vol. 2, June 2006, pp [17] D. Ficara, S. Giordano, F. Rossi, and F. Vitucci, Refine: The reconfigurable packet filtering on network processor, International Journal of Communication Systems, vol. 21, no. 11, pp , [18] D. Ficara, S. Giordano, F. Oppedisano, G. Procissi, and F. Vitucci, A cooperative pc/network-processor architecture for multi gigabit traffic analysis, in IT-NEWS th International, Feb. 2008, pp

Ruler: High-Speed Packet Matching and Rewriting on Network Processors

Ruler: High-Speed Packet Matching and Rewriting on Network Processors Tomáš Hrubý Kees van Reeuwijk Herbert Bos Vrije Universiteit, Amsterdam World45 Ltd. ANCS 2007 Tomáš Hrubý (VU Amsterdam, World45)