A Register Allocation Framework for Banked Register Files with Access Constraints

Size: px

Start display at page:

Download "A Register Allocation Framework for Banked Register Files with Access Constraints"

Aubrie Palmer
5 years ago
Views:

1 A Register Allocation Framework for Banked Register Files with Access Constraints Feng Zhou 1,2, Junchao Zhang 1, Chengyong Wu 1, and Zhaoqing Zhang 1 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {zhoufeng, jczhang, cwu, zqzhang}@ict.ac.cn 2 Graduate School of the Chinese Academy of Sciences, Beijing, China Abstract. Banked register file has been proposed to reduce die area, power consumption, and access time. Some embedded processors, e.g. Intel s IXP network processors, adopt this organization. However, they expose some access constraints in ISA, which complicates the design of register allocation. In this paper, we present a register allocation framework for banked register files with access constraints for the IXP network processors. Our approach relies on the estimation of the costs and benefits of assigning a virtual register to a specific bank, as well as that of splitting it into multiple banks via copy instructions. We make the decision of bank assignment or live range splitting based on analysis of these costs and benefits. Compared to previous works, our framework can better balance the register pressure among multiple banks and improve the performance of typical network applications. 1 Introduction Network processors have been widely adopted as a flexible and cost-efficient solution in building today s network processing systems. To meet the challenging functionality and performance requirements of network applications, network processors incorporate some unconventional, irregular architectural features, e.g. multiple heterogeneous processing cores with hardware multithreading, exposed memory hierarchy, banked register files, etc. [1]. These features present new challenges for optimizing compilers. Instead of building a monolithic register file, banked register file has been proposed to provide the same bandwidth with fewer read/write ports[2]. The reduction of number of ports lowers the complexity of the register file, which further leads to the reduction of die area, power consumption, and access time[3][4]. For banked register files, however, there may be conflicts when the number of simultaneous accesses to a single bank exceeds the number of ports of the bank. While superscalar designs typically solve bank conflicts with additional logic, embedded processors mostly left the problem to programmer or compiler via exposing some access constraints in ISA, therefore simplify the hardware design. For compiler, this complicates the problem of register allocation since in addition to the interferences between two virtual registers, there may be bank conflicts between their uses which may further limit the allocation of registers to them. T. Srikanthan et al. (Eds.): ACSAC 2005, LNCS 3740, pp , c Springer-Verlag Berlin Heidelberg 2005

2 270 F. Zhou et al. In this paper, we present a register allocation framework for banked register files with access constraints for the IXP network processors [5][6]. Our approach relies on estimation of the costs and benefits of assigning a virtual register to a specific bank, as well as that of splitting it into multiple banks via copy instructions. We make the decision of bank assignment or live range splitting based on analysis of these costs and benefits. This helps to balance the register pressures among the banks. When splitting a live range, we use copy instructions instead of loads/stores and force the split live ranges to different banks. Though this may introduce additional copies, it can reduce the number of memory accesses significantly. The rest of this paper is organized as follows. Section 2 introduces some related architectural features of the IXP network processor, especially its banked organization of the register files and the access constraints. Section 3 describes the compilation flow and the proposed register allocation framework. It also provides further details on cost and benefit estimation and live range splitting. Section 4 presents the experimental results. Section 5 describes related works, and section 6 concludes the paper. 2 IXP Architecture and Register File Organization The IXP network processor family [5] was designed as core processing component for a wide range of network equipments including multi-service switches, routers, etc. It s a heterogeneous chip multiprocessor consisting of an XScale processor core and an array of MicroEngines (MEs). The XScale core is used mainly for control path processing while the MEs for data path processing. IXP has a multi-level, exposed memory hierarchy, consisting of local memory, scratchpad memory, SRAM, and DRAM. Each ME has its own local memory, while scratchpad memory, SRAM, and DRAM are shared by all MEs. The stack was implemented using both local memory and SRAM, starting from local memory and growing into SRAM. The MEs have hardware support for multi-threading to hide the latency of memory accesses. To handle the large amount of packet data and to service the multiple threads, IXP provides several large register files. Figure 1 is a diagram of MEs register files. Each ME has four register files: a general purpose register file (GPR) which are mainly used by ALU operations, two transfer register files for exchanging data with memory and other I/O devices, and a next neighbor register file for efficient communications between MEs. These register files, except next neighbor, are all partitioned into banks. The GPR is divided into two banks: GPR A and GPR B, each have 128 registers. The transfer register files are partitioned into four banks: SRAM transfer in, SRAM transfer out, DRAM transfer in, and DRAM transfer out. ME instructions can have two register source operands. We refer to them as A and B operand. Therere some restrictions, which are called two sourceoperand selection rule in [7], on where the two operands of an instruction can come from. They are summarized as follows [7]:

3 A Register Allocation Framework for Banked Register Files 271 From Previous ME D-Push Bus S-Push Bus Local Memory 640 words 128 GPR A 128 GPR B 128 Next Neighbor 128 D Xfer IN 128 S Xfer IN 4K Instruction Store A_Operand B_Operand Multiply 32-bit Find first bit Execution CAM Add, shift, logical Data Path ALU_Out 128 D Xfer Out D-Pull Bus 128 S Xfer Out S-Pull Bus Fig. 1. IXP 2400 Register File Organization An instruction can t use the same register bank for both A and B operands. An instruction can t use any of SRAM Transfer, DRAM Transfer and Next Neighbor as both A and B operands. An instruction can t use immediate as both A and B operands. If an instruction does not conform to the constraints listed above, we say the instruction has bank conflict and it s a conflict instruction. This puts new challenges to register allocation since it has to deal with the bank assignment for each virtual register. In this paper, we focus on GPR s bank conflict problem but the proposed technique applied to other register classes as well. 3 A Register Allocation Framework Solving Bank Conflicts We designed a register allocation framework as part of the Shangri-La infrastructure, which is a programming environment for IXP [8]. Shangri-La encompasses a domain-specific programming language named Baker for packet processing applications, a compilation system that automatically restructures and optimizes the applications to keep the IXP running at line speed, and a runtime system that performs resource management and runtime adaptation. The compilation system consists of three components: the profiler, the pipeline compiler, and the aggregate compiler. The work presented here is part of the aggregate compiler, which takes aggregate definitions and memory mappings from pipeline compiler and generates optimized code for each of the target processing cores (e.g. MicroEngines and XScale). It also performs machine dependent and independent optimizations, as well as domain-specific transformations to maximize the throughput of the aggregates. Figure 2 illustrates the compilation flow of Shangri-La and highlights some phases related to register allocation. Our register allocation framework is based on the priority-based coloring approach [6] in that the virtual registers are processed in the priority order with

4 272 F. Zhou et al.... Baker Language Baker Parser Profiler Pipeline Compiler Aggregate Compiler Run-Time System Instruction Selection... Register Class Identification Live Range Information Construction Interference Graph Construction RCG Building Bank Conflict Resolving Register Allocation... Fig. 2. Shangri-La Compilation Flow the priorities computed in the same way as described in [6]. As illustrated in Fig. 2, we first perform instruction selection. Then in the phase of register class identification, we analyze each instruction to identify the register classes/files that each symbolic register could reside in. We then build the live ranges and the interference graph. These information are needed for both bank conflicts resolving and register allocation. To resolve bank conflicts, we first build a register conflict graph (RCG) [9]. RCG is an undirected graph where the nodes represent the virtual registers and an edge indicates that two virtual register can not be assigned to the same register bank. Based on the RCG, we assign the register bank to each virtual register (algorithm shown in Fig. 3). For each virtual register, we estimate the costs and benefits of assigning it to a specific bank. We also estimate the cost of splitting it into multiple banks via copy instructions, which is the total cost of the generated sub-liveranges plus the cost of the inserted inter-bank copy operations. The cost of a sub-liverange is the minimum of the costs of assigning it to bank A or bank B and can be estimated through invoking the EstimateCost function recursively. However, we limit the depth of recursion with a small threshold here. We then determine bank assignment for the virtual register based on analysis of these costs. The AssignRegBank function assigns the current node to the given bank, mark it as spill if needed and update the CountOfSpill variable. When all virtual registers have been assigned banks in this way, some instructions may have both operands been assigned to a same register bank, causing bank conflicts. We resolve these conflicts by inserting inter-bank copy instructions before each conflicting instruction. E.g. for an instruction r=op(a,b) in

5 A Register Allocation Framework for Banked Register Files 273 1: procedure BANKCONFRESOLVING(RegisterConflictGraph) 2: CountOfSpills = 0 // the number of nodes that have been marked spill so far 3: for all node v of RegisterConflictGraph do 4: CostA = ESTIMATECOST(v, BANK A) // Calculate the cost of assigning v to GPR A 5: CostB = ESTIMATECOST(v, BANK B) 6: SplitCost = SPLITCOST(v) 7: if SplitCost is minimum in these three costs then 8: SPLITNODE(v) 9: else if CostA >= CostB then 10: ASSIGNREGBANK(v,BANK B) 11: else 12: ASSIGNREGBANK(v,BANK A) 13: end if 14: end for Fig. 3. Resolving Bank Conflicts which a and b are the two register source operands and assigned to a same bank, we first introduce a new virtual register c and insert c=a before this instruction, then we assign c to the other bank than the one assigned to b and rename a in r=op(a,b) to c. This guarantees that all conflicts are resolved. The traditional register allocation phase follows to allocate registers per bank for all virtual registers per bank. 3.1 Cost and Benefit Analysis Bank assignment decisions are based on analysis of the costs and benefits of assigning a virtual register to a specific register bank. The compiler can then use these results to trade-off between the two banks. Below we describe how the cost-benefit estimation function calculates the impact based on the following factors: Conflict-Resolving Cost: Bank conflicts between two virtual registers need to be resolved through inserting copies before conflict instructions. The copy operation costs one cycle, so the total number of inserted copy represents the conflict-resolving cost. The ConflictResolvingCost function shown in Fig. 4 computes the number of instructions that use both the given virtual register and a virtual register that conflicts with it on RCG. Spill Cost: Though each ME has 256 GPR on it, each thread has only 32 GPRs (in two banks) when ME runs in 8-threads mode. Assigning too many virtual registers to a single bank may cause spills in that bank while leaving the other bank underutilized. Balancing the register pressure between two banks is an important consideration in our framework. The SpillCost function estimates this cost for a virtual register. We first check the number of live ranges that have higher priorities and being assigned the same register bank. If the number is larger than the number of allocatable register, we treat it as going to be spilled and compute the corresponding spill/reload cost. Otherwise, the spill cost is zero. Coalescing Benefit: The source and result operand of copy instructions can reside in any bank if they are both of GPR type. If they reside in the same bank, latter phases (e.g. register coalescing [10]) may have an opportunity to

6 274 F. Zhou et al. 1: procedure ESTIMATECOST(vr,regbank) 2: SpillCost = SPILLCOST(vr,regbank) 3: ConfResolvCost = CONFLICTRESOLVINGCOST(vr,regbank) 4: CoaleseBenefit = COALESCINGBENEFIT(vr,regbank) 5: return SpillCost + ConfResolvCost - CoaleseBenefit 7: procedure CONFLICTRESOLVINGCOST(vr,regbank) 8: cost = 0 9: for all edges incident to vr in RCG do 10: vr1 = the other end vertex of the edge 11: if vr1's register bank is regbank then 12: cost += the number of instructions referring both vr and vr1 as source operand 13: end if 14: end for 15: return cost 16: procedure SPILLCOST(vr,regbank) 17: NumOfInterferences = the number of vr s interfering live ranges whose register bank is regbank and priority higher than vr and has not been marked spill 18: if NumOfInterferences >= NumOfAllocatableRegister then 19: if CountOfSpills > local memory spill threshold then 20: return vr s spill/reload count * SRAM Latency 21: else return vr s spill/reload count * Local Memory Latency 22: end if 23: else return 0 24: end if Fig. 4. Cost Estimation remove the copy instructions. To indicate this possibility, we add preference to each RCG node. When we see a copy instruction like a = b,weadda to b s preference and vice versa. The CoalescingBenefit calculates the preference cost. It is essentially a product of a given weight and the number of elements that has been assigned a register bank in this set. 3.2 Live Range Splitting Live range splitting is traditionally performed when failing to allocate a register to a live range, through inserting stores and reloads. However, in our framework, we prefer to do splitting at an earlier stage when we found that assigning the live range to any bank incurs a high cost. Instead of load/store, we use copy instruction to implement splitting and force the partitioned live ranges to different banks[11]. Compared to traditional splitting, this may result in additional copy instructions. However, it can further balance the register pressures between the two banks and reduce the number of loads/stores, which are much more expensive than copies. To split the live range of a virtual register, we first build an induced graph of the region of the control flow graph (CFG) in which the virtual register is live. We check each connected component [12] of the subgraph to see if it can be allocated a register through comparing the number of available registers and the number of live ranges interfered with it. If a component does not seem to be able to get a register, we compute a min-cut for it using the method described in [13]. We add the cut edges to CutEdgeSet and insert compensation copies on

7 A Register Allocation Framework for Banked Register Files 275 1: procedure SPLITTING(vr) 2: build the induced graph for vr 3: CutEdgeSet = NULL 4: UPDATECOMPONENTINFO 5: while not all component allocatable do 6: for those components that are not allocatable do 7: perform min-cut operation on this component 8: add the cross edges to CutEdgeSet 9: delete the cross edges from induced graph 10: end for 11: UPDATECOMPONENTINFO 12: end while 13: Assign each component a register bank based on the cost-benefit analysis 14: Insert copy operations according to CutEdgeSet Fig. 5. Live Range Splitting BB3 BB1 DEF VR1 DEF VR1 BB2 BB1 5 BB3 BB2 7 3 BB4 BB3 BB1 DEF VR1 DEF VR2 BB2 BB8 VR1=VR2 BB4 10 BB4 BB5 BB6 USE VR1 BB BB6 BB5 BB6 BB9 USE VR1 VR3=VR1 BB7 USE VR1 4 BB7 BB7 USE VR3 (a) Original CFG (b) The induced graph (c) Splitting Result Fig. 6. An Example of Live Range Splitting these edges later. This process iterated until all components becomes allocatable. After that, we rename each component with a new symbolic register, assign it to a register bank, and insert the corresponding copy operations based on the CutEdgeSet. This algorithm is show in Fig. 5. Figure 6 shows an example of live range splitting. Fig. 6(a) shows the region of control flow graph in which VR1 is live. Fig. 6(b) shows the induced graph for VR1. The numbers on the edges are the frequencies of the control flowing through the edges. These numbers are obtained through profiling. The induced subgraph is connected, so we get the following partition: P 1:{BB1,BB2}, P2:{BB3,BB4,BB5,BB6,BB7}

8 276 F. Zhou et al. The CutEdgeSet contains BB2 BB4. Then, we apply the while loop again to these two partitions. Partition P 1 is allocatable, while partition P 2 is not. So we further cut it into two partitions: P 3:{BB3,BB4,BB5,BB6}, P4:{BB7} and the CutEdgeSet now changes to BB2 BB4,BB6 BB7.Thethird iteration of the while loop gets that all partitions are allocatable. The result is shown in Fig. 6(c). 4 Experimental Results We evaluated our approach using three typical network applications written in Baker [8]: L3-switch: performs L2 bridging or L3 forwarding of IP packets, depends on whether the source and destination of packet locates in a same virtual LAN. Multi-protocol Label Switching (MPLS): routes packets on labels instead of destination IPs. This simplifies the processing of the packets and facilitates high-level traffic management. MPLS shares a large portion of code with L3- Switch. Firewall: performs ordered rule-based classification to filter out unwanted packets. This application first assign flow IDs to packets according to userspecified rules and then drop packets for specified flow IDs. The flow IDs are stored in a hash table. Table 1 shows some statistics of the benchmark applications. The data are gathered with a complete set of scalar optimizations and domain specific optimizations turned on [8]. The second column shows the lines of code information of a Baker implementation of these applications while the third column shows the number of instructions before bank conflict resolving. Only the instructions on the hot path and will be executed on ME are counted here. Column 4 gives the total number of GPR type virtual registers while column 5 shows the number of bank conflict. We compared our approach with Zhuang s pre-ra bank conflict resolving method [9]. Table 2 shows the number of copy instructions and spills generated in the three benchmarks. As can be seen, Zhuang s method performs better in bank conflict resolving. After checking the RCGs, we find that most of the Table 1. Benchmark Application Status Application LOC # ofinstr# ofvr# bank conflicts L3-Switch MPLS Firewall

9 A Register Allocation Framework for Banked Register Files 277 Table 2. Copy and Spill Status # ofcopyinstrs# Spill Operations Pre-RA Our Pre-RA Our L3-Switch MPLS Firewall Table 3. Distribution of Register Pressure Difference Register Pressure L3-Switch MPLS Firewall Difference Pre-RA Our Pre-RA Our Pre-RA Our % 5.80% 8.13% 16.25% 6.67% 11.67% % 21.73% 5.00% 32.50% 3.33% 35.00% % 34.78% 6.25% 15.00% 5.00% 28.33% % 16.67% 30.63% 25.00% 10.00% 16.67% % 10.14% 13.13% 6.88% 3.33% 8.33% % 6.52% 7.50% 2.50% 6.67% 0.00% % 2.90% 10.00% 0.63% 3.33% 0.00% > % 1.45% 19.38% 1.25% 61.67% 0.00% Weighted Mean RCGs have only one or two nodes. Those RCGs with more than two nodes are essentially trees that do not have any cycles. On the other hand, our method outperforms Zhuang s method in that we generate fewer spills, which can be much slower than the copy instructions. Table 3 shows the detailed distribution of the difference of register pressure between the two banks. The register pressure of a basic block is measured by the number of live ranges that live across that basic block. The data show the percentages of BBs with different register pressures. The last row shows the weighted mean of the register pressure difference between GPR bank A and GPR bank B. As can been seen, our approach can better balance the register pressure between the two banks. 5 Related Work Banked register architecture has been used in VLIW processors to reduce the cycle time. [14][15]studied the bank assignment problem on such architectures based on the register component graph. The register component graph is a graph whose nodes are symbolic registers and arcs are annotated with the affinity that two registers have to be placed in the same register bank. After the register component graph being built, the problem becomes finding a min-cut of the graph so that the cost of the inter-bank copy is minimized. The bank constraints

10 278 F. Zhou et al. in these architectures are different from that of IXP in 1) they do not have the two source-operand selection rule ; 2) the inter-bank register copy instruction in these architectures is very expensive. [16][17] discussed the memory bank conflict problem on some DSP processors. Many DSP processors, such as Analog Device ADSP2100, DSP Group PineD- SPCore, Motorola DSP5600 and NEC upd77016, etc. adopt banked memory organization. Such memory systems can fetch multiple data in a single cycle; given the data locate in different banks. Though compiler could optimize the allocation of the variables to avoid the delay caused by accessing a same bank in a single instruction, it s not mandatory. Intel s MicroEngine C[18] is a C-like programming language designed for programming the IXP network processors in a relatively low level. It adds some extensions to C. One related to register bank assignment is the declspec directive, which could be used to specify the allocation of the variables in the memory hierarchy. By default (without any declspec qualifier), all the variables will be put in GPR. But this will increase the register pressure of GPR and cause spills in turn, which could be very expensive since MicroEngine C compiler would put them to SRAM. The programmers can do memory allocation manually using the declspec. However, this puts too much burden on the programmer and is error-prone. L. George, et al. [19] designed a new programming language named Nova for IXP 1200 network processor. They used integer linear programming to solve the bank conflict problem on IXP. While this method provides an upper bound on the performance benefit, the time complexity is too high to be practical. X. T. Zhuang, et al. [9] discussed the register bank assignment problem for the IXP 1200 network processor. They proposed three approaches to solve the problem: performing bank assignment before register allocation, after register allocation, or at the same time in a combined way. They first build a register conflict graph (RCG) to represent the bank conflicts between symbolic registers. They showed that determining whether the virtual registers could be assigned banks without introducing copy instructions is equal to determining whether the RCG is bipartite. They proved the problem of making RCG bipartite with minimal cost is NP-complete by reducing the maximal bipartite sub-graph problem to it and suggested heuristic methods to solve the problem. In[20],J.Parketal.presentedaregister allocation method for banked register file, in which only one register bank could be active at one time and registers are addressed using the register number in conjunction with bank number. No instructions except the inter-bank copy instruction can simultaneously access two banks. To solve this problem, they first divide the program into several allocation regions and then perform local register allocation using the secondary bank on these regions if they are deemed beneficial. Finally, the global register allocation would be performed on the primary bank and inter-bank copy operations would be inserted on the allocation region boundaries.

11 6 Conclusions A Register Allocation Framework for Banked Register Files 279 In this paper, we present a register allocation framework for banked register files with access constraints for the IXP network processors. Our approach relies on estimation of the costs and benefits of assigning a virtual register to a specific bank, as well as that of splitting it into multiple banks via copy instructions. We make the decision of bank assignment or live range splitting based on analysis of these costs and benefits. This helps to balance the register pressures among the banks. When splitting a live range, we use copy instructions instead of loads/stores and force the split live ranges to different banks. Though this may introduce additional copies, it can reduce the number of memory accesses significantly. Preliminary experiments show that compared with previous work, our framework can better balance the register pressure and reduce the number of spills, which in turn results in performance improvement. References 1. Huang, J.H.: Network processor design. In: ASIC, Proceedings. 5th International Conference on. Volume Vol (2003) Tseng, J.H., Asanović, K.: Banked multiported register files for high-frequency superscalar microprocessors. In: Proceedings of the 30th annual international symposium on Computer architecture. (2003) 3. Cruz, J.L., González, A., Valero, M., Topham, N.P.: Multiple-banked register file architectures. In: Proceedings of 27th annual international symposium on Computer Architecture. (2000) Balasubramoniany, R., Dwarkadasy, S., Albonesi, D.H.: Reducing the complexity of the register file in dynamic superscalar processors. In: 34th International Symposium on Microarchitecture (MICRO-34). (2001) Intel: Intel IXP2400 Network Processor Hardware Reference Manual. Intel Corporation. (2003) 6. Chow, F.C., Hennessy, J.L.: The priority-based coloring approach to register allocation. ACM Transactions on Programming Languages and Systems (TOPLAS) 12 (1990) 7. Intel: Intel IXP2400/IXP2800 Network Processor Programmers Reference Manual. Intel Corporation. (2003) 8. Chen, M.K., Li, X.F., Lian, R., Lin, J.H., Liu, L., Liu, T., Ju, R.: Shangri-la: achieving high performance from compiled network applications while enabling ease of programming. In: PLDI 05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, New York, NY, USA, ACM Press (2005) Zhuang, X., Pande, S.: Resolving register bank con icts for a network processor. In: Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques(PACT 03). (2003) 10. George, L., Appel, A.W.: Iterated register coalescing. In: POPL 96: Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, New York, NY, USA, ACM Press (1996) Farkas, K.I.: Memory-System Design Considerations For Dynamically-Scheduled Microprocessors. PhD thesis, University of Toronto (1997)

12 280 F. Zhou et al. 12. Sedgewick, R.: Algorithms in C++ Parts 5:Graph Algorithms. Addison Wesley/Pearson (2001) 13. Stoer, M., Wagner, F.: A simple min-cut algorithm. Journal of the ACM (JACM) (1997) 14. Hiser, J., Carr, S.: Global register partitioning. In: International Conference on Parallel Architectures and Compilation Techniques. (2000) 15. Jang, S., Carr, S., Sweany, P., Kuras, D.: A code generation framework for vliw architectures with partitioned register banks. In: Third International Conference on Massively Parallel Computing Systems. (1998) 16. Cho, J., Paek, Y., Whalley, D.: Efficient register and memory assignment of non-orthogonal architectures via graph coloring and mst algorithms. In: LCTES- SCOPES. (2002) 17. Keyngnaert, P., Demoen, B., de Sutter, B., de Bus, B., et al.: Con ict graph based allocation of static objects to memory banks. In: Semantics, Program Analysis, and Computing Environments for Memory Management. (2001) 18. Johnson, E.J., Kunze, A.R.: IXP 2400/2800 Programming: The Complete Microengine Coding Guide. Intel Press (2003) 19. George, L., Blume, M.: Taming the IXP network processor. In: PLDI. (2003) 20. Park, J., Lee, J.H., Moon, S.M.: Register allocation for banked register file. In: Language, Compiler and Tool Support for Embedded Systems LCTES. (2001) 39-47

Road Map. Road Map. Motivation (Cont.) Motivation. Intel IXA 2400 NP Architecture. Performance of Embedded System Application on Network Processor

Road Map. Road Map. Motivation (Cont.) Motivation. Intel IXA 2400 NP Architecture. Performance of Embedded System Application on Network Processor Performance of Embedded System Application on Network Processor 2006 Spring Directed Study Project Danhua Guo University of California, Riverside dguo@cs.ucr.edu 06-07 07-2006 Motivation NP Overview Programmability