A High Level Development, Modeling and Simulation Methodology for Complex Multicore Network Processors

Size: px
Start display at page:

Download "A High Level Development, Modeling and Simulation Methodology for Complex Multicore Network Processors"

Transcription

1 A High Level Development, Modeling and Simulation Methodology for Complex Multicore Network Processors Gianni Antichi, Christian Callegari, Andrea Di Pietro, Domenico Ficara, Stefano Giordano, Fabio Vitucci Dept. of Information Engineering, University of Pisa, ITALY Massimiliano Meneghin Massimo Torquati Marco Vanneschi Computer Science Dept., University of Pisa, ITALY Massimo Coppola ISTI-CNR, Pisa, ITALY Abstract Network Processors (NPs) are attracting and powerful platforms for the fast development of high performance network applications. However, despite their greater flexibility and limited cost with respect to specialized hardware design, NP still face developers with significant difficulties. As they target complex and high-performance applications, programmers are often forced to write assembly code in order to better exploit the hardware. In this paper we propose an approach to NP programming which is based on a three-phase development methodology, and we apply it to Intel NPs of the IXP2XXX family. By exploiting a composition of software tools, a high-level definition of the application is turned first into a distributed program, then into an NP prototype, and finally into an efficient NP executable. The methodology we describe takes advantage of the ASSIST technology, which allows for the porting, testing, modeling and profiling of parallel applications on a cluster of standard PCs. We developed a C library that acts as a communication layer, hiding the hardware details of NP programming and allowing for high performance code development. The ultimate goal of this approach is to let programmers write C code, exploiting ASSIST-provided hints to (1) perform functional debugging and performance analysis and (2) to experiment with different parallel structures. The resulting code can be then directly compiled for the NP without modifications, largely reducing the overall coding effort. I. INTRODUCTION In recent years, Network Processors (NPs) have been attracting the interest of manufacturers and vendors of network devices. They appear as the most promising solution to realize high-performance, upgradable network devices that perform important tasks as packet classification, intrusion detection, traffic policing. Indeed, NP platforms offer very high packet processing capabilities, suited to multi-gigabit networks, and combine the programmability of general-purpose processors with the high performance typical of hardware-based solutions. Usually, a NP includes a set of programmable processors that run in parallel, allowing for fast packet processing. NPs provide processor instruction sets optimized for packet processing, often supporting hardware multithreading, as well as hierarchically distributed memories among processors. As of today, an undisputed standard for NPs does not yet exist, and each vendor presents its own different solution. This architectural heterogeneity, which strongly reduces code portability, is one of the main problems that slow down the adoption of NPs, as well as the lack of a friendly framework to simplify the programming of such devices. Many works ([1][2][3][4][5][6]) have addressed these issues, but none of them provided a final solution, especially in terms of ease and speed of programming. Moreover a great design challenge for programmers is also the definition of the best parallel paradigm for a piece of software to be run on NPs. Such a decision can vastly influence the performance and extensibility of the application, and must be made with great care. In order to face these programmability issues, some NP vendors such as EZ-chip [7] limit the flexibility of their processors by constraining the applications to be a composition of already-made and working pieces of software from a vendorprovided set. While this approach is certainly effective, it also limits the programmer s possibilities. On the other hand, NP makers that stuck to the paradigm of total flexibility, such as Intel [8], do not provide significant solutions to ease the NP programmers life. In both cases, no aid is given in order to solve the design issue of finding efficient parallel schemes. In this paper, we will focus on Intel NPs because of their widespread diffusion over both the academia and industries. The technique we propose is nevertheless generic, and can be applied to many other NPs and reconfigurable processing units. Intel calls microengines (µes) the atomic processor cores in its IXP family of NPs. Two programing languages are provided for the µes: the µ-assembly assembly language and a sort of C language called µ-c. While the presence of a C compiler can be an attraction and an advantage, a well-established choice in literature is to code in µ-assembly. The reason is simple: µ-c is a proprietary C with so many differences (both additions and restrictions) with respect to standard C that programmers usually decide to switch to a new low-level language which has the undoubtful advantage of allowing for very optimized and fast code. The optimization (and hence the speed) of a piece

2 of software on a Network Processor is actually an important factor that determines the real value of the overall application. In addition, the speed of an application also depends on the parallelism level of the code, whose optimal value is not easily determined before the actual implementation and testing. For these reasons, in this paper we propose a simplifying approach to NP programming by means of a small and programmer-friendly library, providing basic functionalities for parallel execution of code and for processor management on a NP platform. The library also allows for testing and analyzing the application code on a cluster of standard PCs, thanks to the ASSIST [9] environment. This approach results in a multi-phase development process and provides the capability to automatically determine the best code placement over the set of cores of the Network Processor and the optimal level of parallelism needed. The approach is extensible to other NP platforms, also because both the developed library and the ASSIST environment are easily portable. In order to prove the effectiveness of our methodology, we adopt as a case study a processing block which classifies packets according to a deterministic algorithm processing five header fields. As above mentioned, the greater ease of programming given by our approach is countervailed by a performance loss, which we evaluate by comparing the µ- C version of the packet classifier code with its µ-assembly version. We will see that the performance loss is mainly due to the use of a non-optimized µ-c compiler and can be thus reduced both by improving the compiler and by hand optimization of key application blocks. The paper is organized as follows. Sect. II describes the previous works in the area of NP programming. Sect. III illustrates the Intel IXP2XXX Network Processor platform, focusing on the IXP2400 where we performed our experiments. Sect. IV surveys the ASSIST programming environment which we use as a parallel framework. In Sect. V our methodology is described and Sect. VI explains in detail the development phases of parallel application for complex multicores, as well as the µal library, which provides compatibility between µ-c and C code. Finally, Sect. VII presents the case study application and shows the first results. II. RELATED WORK Programming NPs involves dealing with multiple generalpurpose and specialized processor cores, as well as multiple memory and interconnection technologies both on- and off-chip. There is extreme architectural heterogeneity across vendors and products, and several studies ([1][2]) already observed a critical lack of tools supporting an integrated, portable approach to partitioning and scheduling code blocks and managing data transfer within NP platforms. In one sentence, NPs tend to share programming complexity and the lack of programmer-friendly development tools. Intel s MicroACE [3] targets products of the Intel Internet Exchange Architecture (IXA) family, starting from the IXP1200 model. In MicroACE, proxy-like software elements on the IXP1200 s StrongARM control processor (ACEs) are linked to blocks of code on the microengines, so that loading an ACE element causes the corresponding microblock to be transparently loaded onto a µe. Microblocks implementing the fast path for data can choose to bounce packets to their associated ACE blocks on the slow path to deal with special cases. MicroACE provides a useful abstraction, but it is limited to IXA products. Microblocks are combined only according to topologies chosen by their authors, and dynamic software reconfiguration is not possible. The Intel Architecture Tool, which is also part of the Intel SDK, provides an Eclipse plugin compatible with the Developer Workbench, simplifying creation and composition of building blocks into applications. However block programming is still bound to µ-c or µ- assembly, and the product only supports a subset of the IPX2XXX family, e.g. not including the 1200 and 2400 models. Teja NP [4] targets the IXP1200 and IBM PowerNP series. It focuses on the provision of an integrated tool chain and development environment, but provides minimal architectural abstraction and therefore minimal design portability. NetBind [5] provides the abstraction of a set of packetprocessing components that can be bound into a data path. It separates the raw functionality of a microblock from the way it is composed with others, and also allows to dynamically reconfigure microblock compositions. However, it does not provide any abstraction over the NPs interconnects or over different sorts of processors. Other approaches have been proposed to accommodate architectural complexity and heterogeneity, and to support design portability, high performance and runtime reconfigurability by providing more effective abstractions. NP-Click [6] is also a component-based programming model for NPs, employing a richer component model than Net- Bind and providing greater flexibility in linking microblocks, but it still falls short of providing a generic approach to NP programming. OpenCOM [2] is a low-overhead generic programming model for NPs, which is based on a run-time software component model. The approach promotes design portability across a wide variety of processors, while simultaneously exploiting hardware assists that are specific to individual NPs. It features a distributed runtime with low memory footprint, employs formally specified interfaces, supports components written in different programming languages, and uniformly abstracts over different processor types and different inter-processor communication mechanisms. Although very promising, OpenCOM requires reimplementing the component runtime from scratch on any new architecture, and only experiments on the IXP1200 and IXP2400 NPs are reported in the literature. The PacLang [10] approach tries to provide a framework useful to program Network Processor that mimics already known programming tools. The PacLang compiler takes two files as input: 1) a high level PacLang program; 2) an Architecture Mapping Script (AMS). The former does not contain any architecture-specific details, and it is written in a C-like syntax extended with a special type system for packets. The

3 Figure 1. Scheme of the IXP2400. latter tells the compiler how to transform the code into one that runs efficiently on the target architecture. The clean separation improves both code readability and application portability, as changing the target only needs to modify the AMS file. The PacLang approach achieves very high performances on the IXP2400, close to those obtained by using µ-assembly. Our approach is similar to PacLang in easing the programming for NPs, but we introduce different fundamental features. Our library allows to program NPs by means of a subset of the C syntax, and by using ASSIST in the early development we allow programmers to start from off-the-shelf code, and to test and evaluate different parallel solutions long before beginning to work on the NP board. III. EXPERIMENTAL PLATFORM Intel IXP2XXX is a family of Network Processors which are designed to perform a wide range of functionalities, including multi-service switches, routers, broadband access devices and wireless infrastructure systems. The differences among models are related to the number of processing units, or the availability of specific functionalities (for instance, units which allow for encryption algorithms). The IXP2XXX are fully programmable and implement a high-performance parallel processing architecture on a single chip suitable for processing complex algorithms, deep packet inspection and traffic management at wire speed. They have been granted to Universities through the Intel IXA University Program, along with the development environment, which includes the IXA Portability Framework. A. Hardware Platform Fig. 1 shows a scheme of the functional units and internal connections of the IXP2400, the NP used in this work. It contains 9 programmable processors: an Intel XScale and 8 units called µes, which are divided in 2 cluster of 4 (ME 0:0... ME 1:3). The figure also shows the different available memories (SRAM, DRAM, scratchpad), which have different sizes and operational features, and the presence of shared functional units with specific purposes, e.g. I/O interfaces and hardware-accelerated units computing hash functions. The XScale processor supports real-time operating systems for embedded systems like VxWorks or Linux, therefore it can take advantage of C/C++ compilers available in these environments. The µes have a specific instruction set for processing packets, comprising 50 different instructions including ALU (Arithmetic Logic Unit) opcodes which work on bits, bytes and longwords, and can introduce shift or rotations in a single operation. There is no support to integer division or floating point operations. In the IXP2400, both the XScale core and the µes run at 600 Mhz. Each µe is a 6-stage pipelined processor, and fetches one instruction per cycle from its instruction store, which contains up to 4K of 40bit instructions. The XScale is in charge of loading the code in the instruction stores at startup. Jumps or context swaps can result in a performance reduction, due to pipeline refilling. To reduce the branch and memory access overhead, each µe can run up to 8 threads with hardware support to zero-overhead context switch. 1) Memory banks: IXP2XXX Network Processors can access 4 different types of memory, listed in Table I, from the fastest to the slowest: local memory, scratchpad, external SRAM, and DRAM. The local memory consists of bitwords, and can be accessed only by the single microengine that contains it, while the other memories are shared. The scratchpad is an on-chip SRAM memory that is mainly adopted for the management of FIFO circular queues (called rings) which are the main communication channel among µes. 2) NP evaluation board: The target platform of this work is the Radisys ENP-2611 board [11]. It is a PCI-X card, equipped with the IXP2400 NP, which can sustain a nominal line rate of 2.5Gbps. The card provides 8 Mbytes of SRAM and 256 Mbytes of DRAM. B. Software Tools The Intel programming model allows for and encourages the development of modular software. The two hierarchical layers of the NP require two separate programming environments and two different languages: µ-c or µ-assembly at the µe level and standard C or C++ at the XScale level. The development of the µe code is done through the Developer Workbench [12], which is an Integrated Development Environment (IDE) supporting both µ-c and µ-assembly programming, an ARMtargeting compiler for the XScale core, and a µe simulator. The modular structure of the software, which enables code portability, is based on two types of building blocks: core components at the XScale level and microblocks at microengines level. Each building block provides a packet pro- Memory Table I RADISYS ENP-2611 MEMORIES. Logical Width Size Latency (bytes) (bytes) (clock cycles) Local Memory Scratchpad 4 16k 60 SRAM 4 8M 90 DRAM 8 256M 120

4 cessing functionality (e.g. NAT, forwarding, Ethernet bridging, etc.). Programmers can use these elements, build new ones and combine them to create an application. Some blocks are called driver-blocks and perform those operations most dependent by the underlying hardware architecture, such as packet reception, transmission or queue handling. When custom µe blocks are produced, the compilation process of a piece of µ-assembly requires several passes (preprocessor, assembler and an optimizing allocator) in order to produce an executable image for a µe. The IDE provided with Intel SDK, the Developer s Workbench, allows for writing code and debugging of assembly or microengine-c in a visual environment in Microsoft Windows. A clock-cycle-accurate simulator of the µes is also available for debugging purposes. It precisely recreates the system behavior and is a valuable tool for testing applications prototypes, saving on the need to port the code onto the NP hardware just for the sake of performing accurate measurements. IV. HIGH LEVEL TOOLS FOR PARALLEL PROGRAMMING ASSIST (A Software development System based on Integrated Skeleton Technology) is a parallel programming environment based on skeleton and coordination language technology [9]. Structured parallel programming, based on the skeletons model, is an approach to deal with the complexity in the design of high-performance applications. The validity of this approach has been proved, in the last years, for homogeneous parallel machines (MPP, clusters) and Grid platforms ([13][14]). In the ASSIST environment, applications are developed by means of a coordination language (ASSIST-CL), in which sequential as well as parallel computation units can be expressed. The ASSIST environment provides the programmers with an integrated set of compiling tools (astcc, loader) and a portable and very efficient run time support. In ASSIST a parallel program can be structured as a generic graph, whose nodes are parallel or sequential modules, and arcs are the communication channels where streams of typed values are transmitted between modules. Sequential modules are basically procedure-like wrappings of sequential code written in C, C++ or Fortran. Parallel activities are designed by following regular parallelism exploitation patterns (i.e. skeletons), and are expressed in the ASSIST-CL syntax by the parmod construct. The parmod is a sort of generic skeleton that can emulate the classical parallel skeletons without significant performance degradation. Parallel activities in a parmod are decomposed in sequential indivisible units, which are then assigned to abstract executors, called virtual processors. The ASSIST run-time support allows them to communicate in a controlled and consistent way to access the shared computation state, which is partitioned among the virtual processors. Beside being skeleton-oriented, the coordination model is based on the concept of stream, an ordered sequence of typed values. Streams connect modules in generic graphs, and drive the activation of their computations. Each module can have any number of different input and output streams. By default, input streams of a module are used in a data-flow way (all inputs are required before module activation), but parmod modules allow selecting inputs in a CSP-like fashion, the language syntax explicitly defining 1) stream associations 2) prioritised and program-drive input choice 3) non-deterministic, guardcontrolled behaviour. The programming model of ASSIST includes also the concept of external objects, as a mean of inter-module interactions, and of interaction with resources outside the AS- SIST run-time (such as parallel file-systems, user interfaces, databases). A software shared memory library is integrated in the environment as an external object, based on Ad-HOC data service [15] and extending the language with a set of operations to effectively express allocation strategies, access and synchronization patterns. The pointers to shared memory data structures (called references) are an ASSIST basic data type and can be sent over the streams. The two typical stream parallelism exploitation patterns we employed in this work are the pipeline and the task farm (also referred to as master-worker). In a pipeline a set of input tasks are each one processed by a sequence of stages. Each stage just computes a function out of the result provided by the previous stage and delivers result to the immediately following one. In a task farm, a set, or a stream, of independent tasks are computed to obtain a set of results. A single program or function is used to compute all the results out of the input tasks, selecting for each task a new executor from the pool of available ones. V. METHODOLOGY The ASSIST environment simplifies the definition of a pipeline graph whose stages are task-farms or sequential modules. Since the stream paradigm perfectly fits packet processing, it appears natural to employ ASSIST as a programming framework for applications on Network Processors. In the ASSIST environment, the programmer has a greater extent of possibilities and more flexibility with respect to straight µ-c or µ-assembly programming. This flexibility presents its true values especially when experimenting with different ways to express the parallel program, transforming a sequential module to a task-farm one or adding/removing modules from the program graph. Furthermore the ASSIST environment simplifies the change of the parallelism degree in a task-farm modules over different runs of the same parallel program. When developing software for the IXP NPs, the parallel program is structured as a graph of modules (in our case it is typically a pipeline of sequential or task-farm modules) where the links among stages are streams of network packets references, as depicted in fig. 5. The sequential functions in each module are coded using the µ-c language and a specifically designed library we named µal (from µe ASSIST Layer). µal provides all its functionalities by exploiting the IXP2XXX hardware, and is thus able to hide all the low-level architecture-dependent details to the application programmer.

5 Figure 2. The three phases development model. ASSIST already provides support to the C, C++, and F77 sequential languages; for the purposes of this work, we extended the astcc compiler to also support µ-c sequential code. The ASSIST compiler just performs simple code transformations in order to let the µ-c code be compiled by a standard C compiler when compiling for an ordinary CPU. As a result of the compilation process, each module of the ASSIST pipeline has an input and an output ring. The connection between the ASSIST streams and the internal rings is transparently managed by the ASSIST compiler and the runtime support. All the threads inside each module can see only the input and the output rings, while it is up to the run-time support to fetch a task from the input stream and put it into the input ring (and vice versa for the output ring). The threads concur in accessing the data in the rings; the programmer can control the scheduling of the threads by accessing the rings through specific functions provided by the µal library. VI. DEVELOPMENT PHASES Our proposal for the development of parallel application for complex multicore architectures (like the Intel IXP2XXX Network Processors) is synthetically shown in fig. 2. The application design life-cycle proceeds through an iterative cycle involving three main phases: high-level ASSIST prototyping, clock-cycle-accurate simulation, fine-grained optimizations. a) Phase 1: The programmer uses the ASSIST environment for a rapid prototyping of the parallel application by writing the sequential code of each module in µ-c, with the code of the 8 threads in each module expressed using the wellknown SPMD model (Single Program Multiple Data). The threads access the memory hierarchy and all NP hardware by using the functions provided by the µal library. For this first stage, the code is compiled to run on a cluster of PCs, employing a C++ compiler and a specific version of the µal library in order to produce ordinary ASSIST parallel executables. Beside simplifying functional debugging of the program, as result of a first run this approach gathers a number of valuable measurements on the execution, such as the number of accesses to each memory level, the threads idle time, input and output throughput per stage. These first results can, in turn, help evaluate the choices made when parallelizing of the application, and allow for evolving its structure much more quickly than with direct µ-c programming. It should be noticed that (a) the programmer is also free to employ off-the-shelf C and C++ code within the modules, enjoying a smooth transition in porting a conventional code base to the NP platform, and (b) the development is not yet tied to the specific version of the IXP NP. As soon as the application is completely coded in µ-c and the test results are good, the programmer can move on to the next phase. b) Phase 2: Once the prototype has been tested and debugged on a cluster of PCs, the code of each module, produced by the ASSIST compiler and provided with all the necessary NP management functionalities from µal, is imported without any modification into the Developer s Workbench simulator. The aim of the simulation, by evaluating the latency in clockcycles of each module, is to allow for a finer performance tuning, ensuring that the service time of all modules satisfies the requirements of the application. According to the results provided by the simulation phase, the programmer may still choose to split (or merge) stages and/or transform a sequential stage in a task-farm module, possibly reverting back to the first development phase to restructure the code. c) Phase 3: Only at the end of the second phase, and only if it is strictly necessary in order to meet the application requirements, the code is manually optimized by rewriting critical code blocks, or whole modules in µ-assembly. Even in the case that this choice is mandated, still the developer retains several advantages from the approach: µ-assembly is used only when really needed, after compiled µ-c code inspection detects inefficiencies, and according to a sound performance analysis; there is reference µ-c code with correct functional behavior to help debug the µ-assembly; all functionalities of the µal are also directly available from µ-assembly, providing efficient and correct implementations of code block cooperation and simplifying the optimization phase. A. The µal library As previously said, the µal library has been specifically designed to hide all the low-level details of the underlying NP architecture. The library interface is generic enough to be easily ported to other different architectures and the functionalities provided can be divided in the following categories: Memory access functions: there are distinct functions for different memory hierarchy: SRAM, DRAM and scratchpad. All the functions read or write contiguous block of data of any size both in a synchronous or asynchronous way. The local memory of each µe is left to the management of the µ-c compiler. Synchronization primitives among threads: the library provides the user with intra-µes and inter-µes synchronization functions. Lock/unlock and test-and-set methods have been implemented both in scratchpad and SRAM memory. Ring access functions: the library implements initialization and get/put methods. Signals management functions:

6 /************************************** * Function that reads and analyzes * the first node of the classifier **************************************/ int find_sa1 (int field){ lm_t result; /* local memory location */ lm_t mask = 0xFFFFFF; sram_r_t reg_read; /* SRAM read transfer register*/ sram_address_t address; /* SRAM address */ copy_buffer_from_sram(&reg_read,class_base[0],1); result=reg_read>>24; if(result!=255){ result=result<<24; go_up[0]=(result mask)&go_up[0]; } address=class_base[0]+field; copy_buffer_from_sram(&reg_read, address, 1); result=reg_read>>24; return result; } Figure 4. Example of C code using the µal library. Figure 3. Schematic view of the software development phases. the Intel IXP2XXXX processor family uses a nonpreemptive scheduling, where signals are adopted to determine when memory accesses are concluded or generally to deal with context-switches. Thread scheduling control functions: these functions allow for a correct scheduling order among the 8 threads in a µe (for example in a roundrobin fashion). The order of threads practically reflects packet ordering. Various utility functions: these functions support the rest of the interface. The µal library allows for the use of a C-like code that can be compiled by the ASSIST tools (see Fig. 4). This is obtained through a series of functions that hide the use of signals and mask the effect of different type qualifiers specifying the memory placement of variables. This is an essential advantage on the IXPXXXX platform, where proper SRAM and DRAM synchronization is demanded to the µ-assembly programmer. The library interface has been implemented in two different versions: the first one in plain C++ as part of the ASSIST runtime environment, the second version in µ-c as integration of the Intel SDK environment. As sketched in fig. 3, the ASSIST compiler compiles the code of each stage and produces a set of source and include files in µ-c, with calls to some functions of the µal library. The files are compiled with the standard C compiler and linked against the µal assist library in order to build the executable program of each stage of the pipeline. At the same time, the set of files produced by the compiler can be imported into a Developer s Workbench project and compiled with the µ-c compiler linking the µal ixp.c file Figure 5. Implementation templates of the ASSIST graph for the Intel IXP2XXX processor family. which implements the µal functions in µ-c. Each stage of the ASSIST pipeline is mapped on a single µe according to the pipeline implementation template described in the next subsection. B. Implementing the pipeline and farm on the Intel IXP NP The implementation templates of the ASSIST pipeline for the Intel IXP2XXX NP family is shown in fig. 5. Each stage of the pipeline is mapped on a single µe and the communication channel between two stages is implemented with a ring. A task-farm module comprises a pool of µes, each one executing the same worker code. Each worker has its own input and output rings. Packets are distributed to any of the workers according to scheduling functions by the stage preceding the task-farm module in the pipeline, that we call scheduler stage. The threads in the scheduler stage call a scheduling function to choose the ring in which to put the output packet. It is possible to implement very complex scheduling functions, but for the sake of simplicity we have chosen the fast and simple round-robin scheduling. The same policy is also applied when collecting packets from the output rings of each worker, in the module that follows the task-farm one (the collector stage). A more advanced implementation template for the task-farm

7 Figure 6. Schematic view of the Classifier test application. module exploiting a more responsive scheduling function is also possible, but to allow packet reordering after the farm, an additional ring between the scheduler stage and the collector stage is needed, where the index of the scheduled workers are put. VII. APPLICATION AND FIRST RESULTS In this section, we show the adoption of our approach in a real case study: the development of a packet classifier. A. Packet Classifier description Basically, our packet classifier is an application that labels traffic flows according to the value of the 5-tuple of elements: IP Source Address, IP Destination Address, Layer 4 Protocol, Source Port and the Destination Port. Packet classification plays an important role in computer networks and is ubiquitous in many scenarios (such as traffic filtering, access control, billing, etc.). Therefore classification speed is a critical issue for any implementation and a wellperforming classifier requires a great effort of optimized development. Both the challenging and useful task and the sensibility of the performance on the code optimization motivate our choice to adopt a packet classifier as reference application. Specifically, the reference classifier we adopt in these tests is based on a multi-dimensional multi-bit trie algorithm as described in [16]. The code has been written both in µ-assembly and in µ-c with the µal library. The assembly version of the classifier has been extensively tested and evaluated in previous works [17], [18] confirming its high performance and capability to sustain up to 3 million packets per second. Since in the worst case 11 memory accesses to the classification table have to be performed per packet, in order to reduce the processing burden the classification table is stored in the external SRAM (i.e., the fastest among the external memories), which is large enough to keep a large number of classification rules (on the order of hundreds of thousands). The classifier presented in [17] is reconfigurable at runtime in the set of rules and provides a configurable cache in order to reduce the classification cost in terms of memory accesses. However, since in this paper we do only aim at providing a proof-of-concept to highlight the advantages of the methodology we propose, we will focus on a simplified version of the classifier that does not include advanced reconfiguration features. B. Implementation and Results In order to evaluate the overhead introduced by the proposed methodology (especially by the µ-c and the µal library), we have developed a classifier prototype by coding the application in ASSIST (adopting the µallibrary) and then by testing the code produced by the compiler on the IXP platform. We have compared the performance of the application on a single µe, with a native implementation written in µ-assembly. As shown in fig. 6, the application is a pipeline whose first and last stages read/write packets from/to network interfaces (or from/to files in the cluster-based testing). The intermediate stage applies the classification algorithm by retrieving the needed data from the memory hierarchy through the µal library functions. The first pipeline stage stores the entire packet data in DRAM and the header and the metadata in SRAM, while the reference of the packet is put into the output ring. Conversely, the last stage receives packet references from the input ring, retrieves the packet data from the shared memory and send them out on the network. In ASSIST, the input and output rings of each module are implemented around the stream concept and the accesses are completely masked by the µal library interface. Subsequently the code has been tested and debugged in a cluster of PCs: we have taken the files produced by the ASSIST compiler for the intermediate stage and recompiled in the Intel SDK environment by linking the IXP version of the µal library. On the IXP platform, the RX and TX codes, which read and send packets from/to the interfaces, are standard driver-microblocks provided by Radisys [11]. The resulting values for performance and code metrics are compared in tab. II and III. Specifically, the first table shows a comparison between the µ-assembly and the µ-c code (optimized for speed or size when specified) in terms of three code metrics: Instruction Store Words: the actual code size within the instruction store, measured in 40-bit machine instructions; Minimum Available Registers: the minimum number of available registers within any section of the code; this field reports the number of General Purpose (G), SRAM transfer (S) and DRAM Transfer (D) registers; Available Local Memory: the number of available 32-bit words in the µe after the compilation. A first general comment on the values of tab. II is that, as intuition suggests, the µ-assembly code is much more compact than its µ-c counterpart. Indeed, the µ-c classifier can be from 2 (code optimized for size) to 3 times (unoptimized code) larger than the µ-assembly version. This code inflation can introduce some difficulties when developing larger software running on the same µe. The adoption of the µ-c compiler and its runtime support library, which use Local Memory as temporary storage for small amounts of data, explains the different register usage reported, as well as part of the performance loss. A slight reduction (<6%) of the amount of Local Memory available for other code corresponds to more general purposes registers and less SRAM registers being available, in spite of the fact that µal exploits global registers. Performance results corresponding to the code metrics discussed are shown in table III. The table reports results in terms of maximum packet rate (in thousands of packets per second: Kpps) and processing delay (in clock cycles: cc) for the same

8 versions of the classifier reported in table II. Results confirm the intuition that µ-c code is not as fast as its µ-assembly counterpart. The non-optimal µ-c machine code generation also implies that processing delay is larger for the µ-c code. The speed-optimized µ-c classifier is fast enough to compete with the µ-assembly version and be useful in practice (about 10% slower in terms of Kpps), but clearly improvements in the µ-c compiler and runtime would make a stronger case for our approach. VIII. CONCLUSIONS AND FUTURE WORK Network Processors are powerful and flexible architectures for packet processing at high speed. However, their complex programming (often associated with the challenges in the definition of efficient parallel design) discourages their widespread adoption on network appliances. In this paper we have presented a novel approach to provide an easier way to program Intel IXP NPs. This approach, which can be easily extended to other multicore systems, takes advantage of the ASSIST environment as an aid to the choice of the best parallel paradigm for an application and as a reference for designing a parallel run-time library for NP platforms. When adding support for the Intel IXP programming languages to the ASSIST environment, we provided tools to simulate the behavior of an NP application on a cluster of standard PCs. Thus, execution statistics can be gathered and rough performance figures can be extrapolated before the actual porting of the code to the NP. The actual coding is performed in standard C for all or most of the development and is supported by a library of C functions which simplifies the programming phase. It is easy to test and evaluate different parallel decompositions of the application before reaching hardware-dependent development phase. The consequent cost reductions in application development, thanks to shorter time Table II COMPILATION RESULTS FOR THE µ-c AND µ-assembly VERSION OF THE PACKET CLASSIFIER CODE Code µ-assembly µ-c Opt. Instr. Store Minimum Available Words Avail. Regs LM (words) No 486 G:9 S:8 D:8 640 Yes 448 G:9 S:8 D:8 640 No 1225 G:11 S:0 D:8 603 Speed 1154 G:15 S:0 D:8 603 Size 913 G:15 S:0 D:8 603 Table III PERFORMANCE RESULTS OF THE µ-c AND µ-assembly VERSION OF THE PACKET CLASSIFIER CODE Code Opt. Max Pkt rate (Kpps) Proc. delay (cc) µ-assembly µ-c No Yes No Speed Size and reduced complexity, ease the economic exploitation of NP platforms. Experimental results on a multi-dimensional multi-bit trie packet classifier have shown limited performance differences with respect to a carefully hand-optimized version of the code which has been already employed in several projects. Even better results can be obtained by integrating in our approach high-level languages that are well optimized for the target NP architecture, thus avoiding low-level coding at all. Indeed, our approach does not rule out integration with languages specifically designed to this purpose (such as PacLang), as our run-time support has been designed to be small and general, in order to allow its porting to other NP architectures and tool-chains with low effort and good performance. ACKNOWLEDGEMENTS This work has been financed by Fondazione della Cassa di Risparmio di Pisa as part of the project FRINP ( Reconfigurable Firewall on Network Processor platforms ). REFERENCES [1] C. Kulkarni, M. Gries, C. Sauer, and K. Keutzer, Programming challenges in network processor deployment, in CASES 03. New York, NY, USA: ACM, 2003, pp [2] K. Lee and G. Coulson, Supporting runtime reconfiguration on network processors, in 20th Int.l Conf. on Advanced Information Networking and Applications. IEEE Computer Society, 2006, pp [3] Intel IXP1200 Processor-based MicroACE Test Framework. [Online]. Available: [4] Teja, Teja NP: The first software platform for multiprocessor systemon-chip architectures, [5] A. Campbell, S. Chou, M. Kounavis, V. Stachtos, and J. Vicente, Netbind: a binding tool for constructing data paths in network processorbased routers, in OpenArch. IEEE, 2002, pp [6] N. Shah, W. Plishker, K. Ravindran, and K. Keutzer, Np-click: A productive software development approach for network processors, IEEE Micro, vol. 24, no. 5, pp , [7] EZ-Chip Network Processors, [8] Intel Network Processors, intel.com/design/network/products/npfamily. [9] M. Vanneschi, The programming model of assist, an environment for parallel and distributed portable applications, Parallel Comput., vol. 28, no. 12, pp , [10] R. J. Ennals, R. W. Sharp, and A. Mycroft, Task partitioning for multicore network processors, in Compiler Construction: 14th International Conference, ser. LNCS, no Springer, April 2005, pp [11] Radisys ENP [Online]. Available: [12] Intel R, Ixp2400/2800 developer s tool user guide. [13] M. Aldinucci, M. Coppola, M. Danelutto, N. Tonellotto, M. Vanneschi, and C. Zoccolo, High level grid programming with ASSIST, Computational Methods in Science and Technology, vol. 12, no. 1, [14] M. Aldinucci, A. Petrocelli, E. Pistoletti, M. Torquati, M. Vanneschi, L. Veraldi, and C. Zoccolo, Dynamic reconfiguration of grid-aware applications in ASSIST, in Proc. of 11th Intl. Euro-Par 2005 Parallel Processing, ser. LNCS, J. C. Cunha and P. D. Medeiros, Eds., [15] M. Aldinucci, M. Danelutto, G. Giaccherini, M. Torquati, and M. Vanneschi, Towards a distributed scalable data service for the grid, in PARCO 2005, Malaga, ser. NIC, vol. 33. John von Neumann Institute for Computing, Dec. 2005, pp [16] S. Giordano, G. Procissi, F. Rossi, and F. Vitucci, Design of a multidimensional packet classifier for network processors, in ICC 06, vol. 2, June 2006, pp [17] D. Ficara, S. Giordano, F. Rossi, and F. Vitucci, Refine: The reconfigurable packet filtering on network processor, International Journal of Communication Systems, vol. 21, no. 11, pp , [18] D. Ficara, S. Giordano, F. Oppedisano, G. Procissi, and F. Vitucci, A cooperative pc/network-processor architecture for multi gigabit traffic analysis, in IT-NEWS th International, Feb. 2008, pp

Ruler: High-Speed Packet Matching and Rewriting on Network Processors

Ruler: High-Speed Packet Matching and Rewriting on Network Processors Ruler: High-Speed Packet Matching and Rewriting on Network Processors Tomáš Hrubý Kees van Reeuwijk Herbert Bos Vrije Universiteit, Amsterdam World45 Ltd. ANCS 2007 Tomáš Hrubý (VU Amsterdam, World45)

More information

Top-down definition of Network Centric Operating System features

Top-down definition of Network Centric Operating System features Position paper submitted to the Workshop on Network Centric Operating Systems Bruxelles 16-17 march 2005 Top-down definition of Network Centric Operating System features Thesis Marco Danelutto Dept. Computer

More information

Topic & Scope. Content: The course gives

Topic & Scope. Content: The course gives Topic & Scope Content: The course gives an overview of network processor cards (architectures and use) an introduction of how to program Intel IXP network processors some ideas of how to use network processors

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi Lecture - 34 Compilers for Embedded Systems Today, we shall look at the compilers, which

More information

Design of a Multi-Dimensional Packet Classifier for Network Processors

Design of a Multi-Dimensional Packet Classifier for Network Processors Design of a Multi-Dimensional Packet Classifier for Network Processors Stefano Giordano, Gregorio Procissi, Federico Rossi, Fabio Vitucci Dept. of Information Engineering, University of Pisa, ITALY E-mail:

More information

CIS 1.5 Course Objectives. a. Understand the concept of a program (i.e., a computer following a series of instructions)

CIS 1.5 Course Objectives. a. Understand the concept of a program (i.e., a computer following a series of instructions) By the end of this course, students should CIS 1.5 Course Objectives a. Understand the concept of a program (i.e., a computer following a series of instructions) b. Understand the concept of a variable

More information

Advanced Databases: Parallel Databases A.Poulovassilis

Advanced Databases: Parallel Databases A.Poulovassilis 1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger

More information

Implementation and Analysis of Large Receive Offload in a Virtualized System

Implementation and Analysis of Large Receive Offload in a Virtualized System Implementation and Analysis of Large Receive Offload in a Virtualized System Takayuki Hatori and Hitoshi Oi The University of Aizu, Aizu Wakamatsu, JAPAN {s1110173,hitoshi}@u-aizu.ac.jp Abstract System

More information

Towards High-performance Flow-level level Packet Processing on Multi-core Network Processors

Towards High-performance Flow-level level Packet Processing on Multi-core Network Processors Towards High-performance Flow-level level Packet Processing on Multi-core Network Processors Yaxuan Qi (presenter), Bo Xu, Fei He, Baohua Yang, Jianming Yu and Jun Li ANCS 2007, Orlando, USA Outline Introduction

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Implementation of Adaptive Buffer in Video Receivers Using Network Processor IXP 2400

Implementation of Adaptive Buffer in Video Receivers Using Network Processor IXP 2400 The International Arab Journal of Information Technology, Vol. 6, No. 3, July 2009 289 Implementation of Adaptive Buffer in Video Receivers Using Network Processor IXP 2400 Kandasamy Anusuya, Karupagouder

More information

Road Map. Road Map. Motivation (Cont.) Motivation. Intel IXA 2400 NP Architecture. Performance of Embedded System Application on Network Processor

Road Map. Road Map. Motivation (Cont.) Motivation. Intel IXA 2400 NP Architecture. Performance of Embedded System Application on Network Processor Performance of Embedded System Application on Network Processor 2006 Spring Directed Study Project Danhua Guo University of California, Riverside dguo@cs.ucr.edu 06-07 07-2006 Motivation NP Overview Programmability

More information

Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck

Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck Volker Lindenstruth; lindenstruth@computer.org The continued increase in Internet throughput and the emergence of broadband access networks

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Netronome NFP: Theory of Operation

Netronome NFP: Theory of Operation WHITE PAPER Netronome NFP: Theory of Operation TO ACHIEVE PERFORMANCE GOALS, A MULTI-CORE PROCESSOR NEEDS AN EFFICIENT DATA MOVEMENT ARCHITECTURE. CONTENTS 1. INTRODUCTION...1 2. ARCHITECTURE OVERVIEW...2

More information

Introduction to Network Processors: Building Block for Programmable High- Speed Networks. Example: Intel IXA

Introduction to Network Processors: Building Block for Programmable High- Speed Networks. Example: Intel IXA Introduction to Network Processors: Building Block for Programmable High- Speed Networks Example: Intel IXA Shiv Kalyanaraman Yong Xia (TA) shivkuma@ecse.rpi.edu http://www.ecse.rpi.edu/homepages/shivkuma

More information

OpenMP for next generation heterogeneous clusters

OpenMP for next generation heterogeneous clusters OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

NOWADAYS, pattern matching is required in an increasing

NOWADAYS, pattern matching is required in an increasing IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 19, NO. 3, JUNE 2011 683 Differential Encoding of DFAs for Fast Regular Expression Matching Domenico Ficara, Member, IEEE, Andrea Di Pietro, Student Member, IEEE,

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

INF5060: Multimedia data communication using network processors Memory

INF5060: Multimedia data communication using network processors Memory INF5060: Multimedia data communication using network processors Memory 10/9-2004 Overview!Memory on the IXP cards!kinds of memory!its features!its accessibility!microengine assembler!memory management

More information

Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach

Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach Tiziano De Matteis, Gabriele Mencagli University of Pisa Italy INTRODUCTION The recent years have

More information

Marco Danelutto. May 2011, Pisa

Marco Danelutto. May 2011, Pisa Marco Danelutto Dept. of Computer Science, University of Pisa, Italy May 2011, Pisa Contents 1 2 3 4 5 6 7 Parallel computing The problem Solve a problem using n w processing resources Obtaining a (close

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

A Streaming Multi-Threaded Model

A Streaming Multi-Threaded Model A Streaming Multi-Threaded Model Extended Abstract Eylon Caspi, André DeHon, John Wawrzynek September 30, 2001 Summary. We present SCORE, a multi-threaded model that relies on streams to expose thread

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed

Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed Marco Aldinucci Computer Science Dept. - University of Torino - Italy Marco Danelutto, Massimiliano Meneghin,

More information

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Ketan Kulkarni and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {knkulkarni,gabriel}@cs.uh.edu

More information

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email

More information

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * Hsin-Ta Chiao and Shyan-Ming Yuan Department of Computer and Information Science National Chiao Tung University

More information

Design of a System-on-Chip Switched Network and its Design Support Λ

Design of a System-on-Chip Switched Network and its Design Support Λ Design of a System-on-Chip Switched Network and its Design Support Λ Daniel Wiklund y, Dake Liu Dept. of Electrical Engineering Linköping University S-581 83 Linköping, Sweden Abstract As the degree of

More information

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

A closer look at network structure:

A closer look at network structure: T1: Introduction 1.1 What is computer network? Examples of computer network The Internet Network structure: edge and core 1.2 Why computer networks 1.3 The way networks work 1.4 Performance metrics: Delay,

More information

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading

More information

Cor Meenderinck, Ben Juurlink Nexus: hardware support for task-based programming

Cor Meenderinck, Ben Juurlink Nexus: hardware support for task-based programming Cor Meenderinck, Ben Juurlink Nexus: hardware support for task-based programming Conference Object, Postprint version This version is available at http://dx.doi.org/0.479/depositonce-577. Suggested Citation

More information

Part IV. Chapter 15 - Introduction to MIMD Architectures

Part IV. Chapter 15 - Introduction to MIMD Architectures D. Sima, T. J. Fountain, P. Kacsuk dvanced Computer rchitectures Part IV. Chapter 15 - Introduction to MIMD rchitectures Thread and process-level parallel architectures are typically realised by MIMD (Multiple

More information

Towards Effective Packet Classification. J. Li, Y. Qi, and B. Xu Network Security Lab RIIT, Tsinghua University Dec, 2005

Towards Effective Packet Classification. J. Li, Y. Qi, and B. Xu Network Security Lab RIIT, Tsinghua University Dec, 2005 Towards Effective Packet Classification J. Li, Y. Qi, and B. Xu Network Security Lab RIIT, Tsinghua University Dec, 2005 Outline Algorithm Study Understanding Packet Classification Worst-case Complexity

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Main Points of the Computer Organization and System Software Module

Main Points of the Computer Organization and System Software Module Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a

More information

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT Hardware Assisted Recursive Packet Classification Module for IPv6 etworks Shivvasangari Subramani [shivva1@umbc.edu] Department of Computer Science and Electrical Engineering University of Maryland Baltimore

More information

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA M. GAUS, G. R. JOUBERT, O. KAO, S. RIEDEL AND S. STAPEL Technical University of Clausthal, Department of Computer Science Julius-Albert-Str. 4, 38678

More information

NoC Test-Chip Project: Working Document

NoC Test-Chip Project: Working Document NoC Test-Chip Project: Working Document Michele Petracca, Omar Ahmad, Young Jin Yoon, Frank Zovko, Luca Carloni and Kenneth Shepard I. INTRODUCTION This document describes the low-power high-performance

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Chapter 4 NETWORK HARDWARE

Chapter 4 NETWORK HARDWARE Chapter 4 NETWORK HARDWARE 1 Network Devices As Organizations grow, so do their networks Growth in number of users Geographical Growth Network Devices : Are products used to expand or connect networks.

More information

An efficient Unbounded Lock-Free Queue for Multi-Core Systems

An efficient Unbounded Lock-Free Queue for Multi-Core Systems An efficient Unbounded Lock-Free Queue for Multi-Core Systems Authors: Marco Aldinucci 1, Marco Danelutto 2, Peter Kilpatrick 3, Massimiliano Meneghin 4 and Massimo Torquati 2 1 Computer Science Dept.

More information

Top-Level View of Computer Organization

Top-Level View of Computer Organization Top-Level View of Computer Organization Bởi: Hoang Lan Nguyen Computer Component Contemporary computer designs are based on concepts developed by John von Neumann at the Institute for Advanced Studies

More information

The MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola

The MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola The MPI Message-passing Standard Practical use and implementation (I) SPD Course 2/03/2010 Massimo Coppola What is MPI MPI: Message Passing Interface a standard defining a communication library that allows

More information

Network Processors. Douglas Comer. Computer Science Department Purdue University 250 N. University Street West Lafayette, IN

Network Processors. Douglas Comer. Computer Science Department Purdue University 250 N. University Street West Lafayette, IN Network Processors Douglas Comer Computer Science Department Purdue University 250 N. University Street West Lafayette, IN 47907-2066 http://www.cs.purdue.edu/people/comer Copyright 2003. All rights reserved.

More information

Memory. Objectives. Introduction. 6.2 Types of Memory

Memory. Objectives. Introduction. 6.2 Types of Memory Memory Objectives Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured. Master the concepts

More information

Real-Time Mixed-Criticality Wormhole Networks

Real-Time Mixed-Criticality Wormhole Networks eal-time Mixed-Criticality Wormhole Networks Leandro Soares Indrusiak eal-time Systems Group Department of Computer Science University of York United Kingdom eal-time Systems Group 1 Outline Wormhole Networks

More information

The Optimal CPU and Interconnect for an HPC Cluster

The Optimal CPU and Interconnect for an HPC Cluster 5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance

More information

Milind Kulkarni Research Statement

Milind Kulkarni Research Statement Milind Kulkarni Research Statement With the increasing ubiquity of multicore processors, interest in parallel programming is again on the upswing. Over the past three decades, languages and compilers researchers

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

School of Computer and Information Science

School of Computer and Information Science School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast

More information

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Euro-Par Pisa - Italy

Euro-Par Pisa - Italy Euro-Par 2004 - Pisa - Italy Accelerating farms through ad- distributed scalable object repository Marco Aldinucci, ISTI-CNR, Pisa, Italy Massimo Torquati, CS dept. Uni. Pisa, Italy Outline (Herd of Object

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Client Server & Distributed System. A Basic Introduction

Client Server & Distributed System. A Basic Introduction Client Server & Distributed System A Basic Introduction 1 Client Server Architecture A network architecture in which each computer or process on the network is either a client or a server. Source: http://webopedia.lycos.com

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Introduction to parallel computing

Introduction to parallel computing Introduction to parallel computing 2. Parallel Hardware Zhiao Shi (modifications by Will French) Advanced Computing Center for Education & Research Vanderbilt University Motherboard Processor https://sites.google.com/

More information

CS 101, Mock Computer Architecture

CS 101, Mock Computer Architecture CS 101, Mock Computer Architecture Computer organization and architecture refers to the actual hardware used to construct the computer, and the way that the hardware operates both physically and logically

More information

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation

More information

A Process Model suitable for defining and programming MpSoCs

A Process Model suitable for defining and programming MpSoCs A Process Model suitable for defining and programming MpSoCs MpSoC-Workshop at Rheinfels, 29-30.6.2010 F. Mayer-Lindenberg, TU Hamburg-Harburg 1. Motivation 2. The Process Model 3. Mapping to MpSoC 4.

More information

Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Distributed transactions (quick refresh) Layers of an information system

Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Distributed transactions (quick refresh) Layers of an information system Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter 2 Distributed Information Systems Architecture Chapter Outline

More information

Real-time grid computing for financial applications

Real-time grid computing for financial applications CNR-INFM Democritos and EGRID project E-mail: cozzini@democritos.it Riccardo di Meo, Ezio Corso EGRID project ICTP E-mail: {dimeo,ecorso}@egrid.it We describe the porting of a test case financial application

More information

Concurrent/Parallel Processing

Concurrent/Parallel Processing Concurrent/Parallel Processing David May: April 9, 2014 Introduction The idea of using a collection of interconnected processing devices is not new. Before the emergence of the modern stored program computer,

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

A hardware operating system kernel for multi-processor systems

A hardware operating system kernel for multi-processor systems A hardware operating system kernel for multi-processor systems Sanggyu Park a), Do-sun Hong, and Soo-Ik Chae School of EECS, Seoul National University, Building 104 1, Seoul National University, Gwanakgu,

More information

Multicore Computing and Scientific Discovery

Multicore Computing and Scientific Discovery scientific infrastructure Multicore Computing and Scientific Discovery James Larus Dennis Gannon Microsoft Research In the past half century, parallel computers, parallel computation, and scientific research

More information

Achieving Predictable Multicore Execution of Automotive Applications Using the LET Paradigm

Achieving Predictable Multicore Execution of Automotive Applications Using the LET Paradigm Achieving Predictable Multicore Execution of Automotive Applications Using the LET Paradigm Alessandro Biondi and Marco Di Natale Scuola Superiore Sant Anna, Pisa, Italy Introduction The introduction of

More information

Network Processors Outline

Network Processors Outline High-Performance Networking The University of Kansas EECS 881 James P.G. Sterbenz Department of Electrical Engineering & Computer Science Information Technology & Telecommunications Research Center The

More information

Network protocols and. network systems INTRODUCTION CHAPTER

Network protocols and. network systems INTRODUCTION CHAPTER CHAPTER Network protocols and 2 network systems INTRODUCTION The technical area of telecommunications and networking is a mature area of engineering that has experienced significant contributions for more

More information

19/05/2010 SPD 09/10 - M. Coppola - The ASSIST Environment 28 19/05/2010 SPD 09/10 - M. Coppola - The ASSIST Environment 29. <?xml version="1.0"?

19/05/2010 SPD 09/10 - M. Coppola - The ASSIST Environment 28 19/05/2010 SPD 09/10 - M. Coppola - The ASSIST Environment 29. <?xml version=1.0? Overall picture Core technologies Deployment, Heterogeneity and Dynamic Adaptation ALDL Application description GEA based deployment Support of Heterogeneity Support for Dynamic Adaptive behaviour Reconfiguration

More information

1. Introduction to the Common Language Infrastructure

1. Introduction to the Common Language Infrastructure Miller-CHP1.fm Page 1 Wednesday, September 24, 2003 1:50 PM to the Common Language Infrastructure The Common Language Infrastructure (CLI) is an International Standard that is the basis for creating execution

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS INSTRUCTOR: Dr. MUHAMMAD SHAABAN PRESENTED BY: MOHIT SATHAWANE AKSHAY YEMBARWAR WHAT IS MULTICORE SYSTEMS? Multi-core processor architecture means placing

More information

A Simulation: Improving Throughput and Reducing PCI Bus Traffic by. Caching Server Requests using a Network Processor with Memory

A Simulation: Improving Throughput and Reducing PCI Bus Traffic by. Caching Server Requests using a Network Processor with Memory Shawn Koch Mark Doughty ELEC 525 4/23/02 A Simulation: Improving Throughput and Reducing PCI Bus Traffic by Caching Server Requests using a Network Processor with Memory 1 Motivation and Concept The goal

More information

AOSA - Betriebssystemkomponenten und der Aspektmoderatoransatz

AOSA - Betriebssystemkomponenten und der Aspektmoderatoransatz AOSA - Betriebssystemkomponenten und der Aspektmoderatoransatz Results obtained by researchers in the aspect-oriented programming are promoting the aim to export these ideas to whole software development

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

Process size is independent of the main memory present in the system.

Process size is independent of the main memory present in the system. Hardware control structure Two characteristics are key to paging and segmentation: 1. All memory references are logical addresses within a process which are dynamically converted into physical at run time.

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),

More information

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Module 17: Interconnection Networks Lecture 37: Introduction to Routers Interconnection Networks. Fundamentals. Latency and bandwidth Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012

More information

Hardware Acceleration in Computer Networks. Jan Kořenek Conference IT4Innovations, Ostrava

Hardware Acceleration in Computer Networks. Jan Kořenek Conference IT4Innovations, Ostrava Hardware Acceleration in Computer Networks Outline Motivation for hardware acceleration Longest prefix matching using FPGA Hardware acceleration of time critical operations Framework and applications Contracted

More information

10 Steps to Virtualization

10 Steps to Virtualization AN INTEL COMPANY 10 Steps to Virtualization WHEN IT MATTERS, IT RUNS ON WIND RIVER EXECUTIVE SUMMARY Virtualization the creation of multiple virtual machines (VMs) on a single piece of hardware, where

More information

DESIGN AND IMPLEMENTATION OF OPTIMIZED PACKET CLASSIFIER

DESIGN AND IMPLEMENTATION OF OPTIMIZED PACKET CLASSIFIER International Journal of Computer Engineering and Applications, Volume VI, Issue II, May 14 www.ijcea.com ISSN 2321 3469 DESIGN AND IMPLEMENTATION OF OPTIMIZED PACKET CLASSIFIER Kiran K C 1, Sunil T D

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors

On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors Govindan Ravindran Newbridge Networks Corporation Kanata, ON K2K 2E6, Canada gravindr@newbridge.com Michael

More information

Teaching Network Systems Design With Network Processors:

Teaching Network Systems Design With Network Processors: Teaching Network Systems Design With Network Processors: Challenges And Fun With Networking Douglas Comer Computer Science Department Purdue University 250 N. University Street West Lafayette, IN 47907-2066

More information