A Register Allocation Framework for Banked Register Files with Access Constraints

Size: px
Start display at page:

Download "A Register Allocation Framework for Banked Register Files with Access Constraints"

Transcription

1 A Register Allocation Framework for Banked Register Files with Access Constraints Feng Zhou 1,2, Junchao Zhang 1, Chengyong Wu 1, and Zhaoqing Zhang 1 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {zhoufeng, jczhang, cwu, zqzhang}@ict.ac.cn 2 Graduate School of the Chinese Academy of Sciences, Beijing, China Abstract. Banked register file has been proposed to reduce die area, power consumption, and access time. Some embedded processors, e.g. Intel s IXP network processors, adopt this organization. However, they expose some access constraints in ISA, which complicates the design of register allocation. In this paper, we present a register allocation framework for banked register files with access constraints for the IXP network processors. Our approach relies on the estimation of the costs and benefits of assigning a virtual register to a specific bank, as well as that of splitting it into multiple banks via copy instructions. We make the decision of bank assignment or live range splitting based on analysis of these costs and benefits. Compared to previous works, our framework can better balance the register pressure among multiple banks and improve the performance of typical network applications. 1 Introduction Network processors have been widely adopted as a flexible and cost-efficient solution in building today s network processing systems. To meet the challenging functionality and performance requirements of network applications, network processors incorporate some unconventional, irregular architectural features, e.g. multiple heterogeneous processing cores with hardware multithreading, exposed memory hierarchy, banked register files, etc. [1]. These features present new challenges for optimizing compilers. Instead of building a monolithic register file, banked register file has been proposed to provide the same bandwidth with fewer read/write ports[2]. The reduction of number of ports lowers the complexity of the register file, which further leads to the reduction of die area, power consumption, and access time[3][4]. For banked register files, however, there may be conflicts when the number of simultaneous accesses to a single bank exceeds the number of ports of the bank. While superscalar designs typically solve bank conflicts with additional logic, embedded processors mostly left the problem to programmer or compiler via exposing some access constraints in ISA, therefore simplify the hardware design. For compiler, this complicates the problem of register allocation since in addition to the interferences between two virtual registers, there may be bank conflicts between their uses which may further limit the allocation of registers to them. T. Srikanthan et al. (Eds.): ACSAC 2005, LNCS 3740, pp , c Springer-Verlag Berlin Heidelberg 2005

2 270 F. Zhou et al. In this paper, we present a register allocation framework for banked register files with access constraints for the IXP network processors [5][6]. Our approach relies on estimation of the costs and benefits of assigning a virtual register to a specific bank, as well as that of splitting it into multiple banks via copy instructions. We make the decision of bank assignment or live range splitting based on analysis of these costs and benefits. This helps to balance the register pressures among the banks. When splitting a live range, we use copy instructions instead of loads/stores and force the split live ranges to different banks. Though this may introduce additional copies, it can reduce the number of memory accesses significantly. The rest of this paper is organized as follows. Section 2 introduces some related architectural features of the IXP network processor, especially its banked organization of the register files and the access constraints. Section 3 describes the compilation flow and the proposed register allocation framework. It also provides further details on cost and benefit estimation and live range splitting. Section 4 presents the experimental results. Section 5 describes related works, and section 6 concludes the paper. 2 IXP Architecture and Register File Organization The IXP network processor family [5] was designed as core processing component for a wide range of network equipments including multi-service switches, routers, etc. It s a heterogeneous chip multiprocessor consisting of an XScale processor core and an array of MicroEngines (MEs). The XScale core is used mainly for control path processing while the MEs for data path processing. IXP has a multi-level, exposed memory hierarchy, consisting of local memory, scratchpad memory, SRAM, and DRAM. Each ME has its own local memory, while scratchpad memory, SRAM, and DRAM are shared by all MEs. The stack was implemented using both local memory and SRAM, starting from local memory and growing into SRAM. The MEs have hardware support for multi-threading to hide the latency of memory accesses. To handle the large amount of packet data and to service the multiple threads, IXP provides several large register files. Figure 1 is a diagram of MEs register files. Each ME has four register files: a general purpose register file (GPR) which are mainly used by ALU operations, two transfer register files for exchanging data with memory and other I/O devices, and a next neighbor register file for efficient communications between MEs. These register files, except next neighbor, are all partitioned into banks. The GPR is divided into two banks: GPR A and GPR B, each have 128 registers. The transfer register files are partitioned into four banks: SRAM transfer in, SRAM transfer out, DRAM transfer in, and DRAM transfer out. ME instructions can have two register source operands. We refer to them as A and B operand. Therere some restrictions, which are called two sourceoperand selection rule in [7], on where the two operands of an instruction can come from. They are summarized as follows [7]:

3 A Register Allocation Framework for Banked Register Files 271 From Previous ME D-Push Bus S-Push Bus Local Memory 640 words 128 GPR A 128 GPR B 128 Next Neighbor 128 D Xfer IN 128 S Xfer IN 4K Instruction Store A_Operand B_Operand Multiply 32-bit Find first bit Execution CAM Add, shift, logical Data Path ALU_Out 128 D Xfer Out D-Pull Bus 128 S Xfer Out S-Pull Bus Fig. 1. IXP 2400 Register File Organization An instruction can t use the same register bank for both A and B operands. An instruction can t use any of SRAM Transfer, DRAM Transfer and Next Neighbor as both A and B operands. An instruction can t use immediate as both A and B operands. If an instruction does not conform to the constraints listed above, we say the instruction has bank conflict and it s a conflict instruction. This puts new challenges to register allocation since it has to deal with the bank assignment for each virtual register. In this paper, we focus on GPR s bank conflict problem but the proposed technique applied to other register classes as well. 3 A Register Allocation Framework Solving Bank Conflicts We designed a register allocation framework as part of the Shangri-La infrastructure, which is a programming environment for IXP [8]. Shangri-La encompasses a domain-specific programming language named Baker for packet processing applications, a compilation system that automatically restructures and optimizes the applications to keep the IXP running at line speed, and a runtime system that performs resource management and runtime adaptation. The compilation system consists of three components: the profiler, the pipeline compiler, and the aggregate compiler. The work presented here is part of the aggregate compiler, which takes aggregate definitions and memory mappings from pipeline compiler and generates optimized code for each of the target processing cores (e.g. MicroEngines and XScale). It also performs machine dependent and independent optimizations, as well as domain-specific transformations to maximize the throughput of the aggregates. Figure 2 illustrates the compilation flow of Shangri-La and highlights some phases related to register allocation. Our register allocation framework is based on the priority-based coloring approach [6] in that the virtual registers are processed in the priority order with

4 272 F. Zhou et al.... Baker Language Baker Parser Profiler Pipeline Compiler Aggregate Compiler Run-Time System Instruction Selection... Register Class Identification Live Range Information Construction Interference Graph Construction RCG Building Bank Conflict Resolving Register Allocation... Fig. 2. Shangri-La Compilation Flow the priorities computed in the same way as described in [6]. As illustrated in Fig. 2, we first perform instruction selection. Then in the phase of register class identification, we analyze each instruction to identify the register classes/files that each symbolic register could reside in. We then build the live ranges and the interference graph. These information are needed for both bank conflicts resolving and register allocation. To resolve bank conflicts, we first build a register conflict graph (RCG) [9]. RCG is an undirected graph where the nodes represent the virtual registers and an edge indicates that two virtual register can not be assigned to the same register bank. Based on the RCG, we assign the register bank to each virtual register (algorithm shown in Fig. 3). For each virtual register, we estimate the costs and benefits of assigning it to a specific bank. We also estimate the cost of splitting it into multiple banks via copy instructions, which is the total cost of the generated sub-liveranges plus the cost of the inserted inter-bank copy operations. The cost of a sub-liverange is the minimum of the costs of assigning it to bank A or bank B and can be estimated through invoking the EstimateCost function recursively. However, we limit the depth of recursion with a small threshold here. We then determine bank assignment for the virtual register based on analysis of these costs. The AssignRegBank function assigns the current node to the given bank, mark it as spill if needed and update the CountOfSpill variable. When all virtual registers have been assigned banks in this way, some instructions may have both operands been assigned to a same register bank, causing bank conflicts. We resolve these conflicts by inserting inter-bank copy instructions before each conflicting instruction. E.g. for an instruction r=op(a,b) in

5 A Register Allocation Framework for Banked Register Files 273 1: procedure BANKCONFRESOLVING(RegisterConflictGraph) 2: CountOfSpills = 0 // the number of nodes that have been marked spill so far 3: for all node v of RegisterConflictGraph do 4: CostA = ESTIMATECOST(v, BANK A) // Calculate the cost of assigning v to GPR A 5: CostB = ESTIMATECOST(v, BANK B) 6: SplitCost = SPLITCOST(v) 7: if SplitCost is minimum in these three costs then 8: SPLITNODE(v) 9: else if CostA >= CostB then 10: ASSIGNREGBANK(v,BANK B) 11: else 12: ASSIGNREGBANK(v,BANK A) 13: end if 14: end for Fig. 3. Resolving Bank Conflicts which a and b are the two register source operands and assigned to a same bank, we first introduce a new virtual register c and insert c=a before this instruction, then we assign c to the other bank than the one assigned to b and rename a in r=op(a,b) to c. This guarantees that all conflicts are resolved. The traditional register allocation phase follows to allocate registers per bank for all virtual registers per bank. 3.1 Cost and Benefit Analysis Bank assignment decisions are based on analysis of the costs and benefits of assigning a virtual register to a specific register bank. The compiler can then use these results to trade-off between the two banks. Below we describe how the cost-benefit estimation function calculates the impact based on the following factors: Conflict-Resolving Cost: Bank conflicts between two virtual registers need to be resolved through inserting copies before conflict instructions. The copy operation costs one cycle, so the total number of inserted copy represents the conflict-resolving cost. The ConflictResolvingCost function shown in Fig. 4 computes the number of instructions that use both the given virtual register and a virtual register that conflicts with it on RCG. Spill Cost: Though each ME has 256 GPR on it, each thread has only 32 GPRs (in two banks) when ME runs in 8-threads mode. Assigning too many virtual registers to a single bank may cause spills in that bank while leaving the other bank underutilized. Balancing the register pressure between two banks is an important consideration in our framework. The SpillCost function estimates this cost for a virtual register. We first check the number of live ranges that have higher priorities and being assigned the same register bank. If the number is larger than the number of allocatable register, we treat it as going to be spilled and compute the corresponding spill/reload cost. Otherwise, the spill cost is zero. Coalescing Benefit: The source and result operand of copy instructions can reside in any bank if they are both of GPR type. If they reside in the same bank, latter phases (e.g. register coalescing [10]) may have an opportunity to

6 274 F. Zhou et al. 1: procedure ESTIMATECOST(vr,regbank) 2: SpillCost = SPILLCOST(vr,regbank) 3: ConfResolvCost = CONFLICTRESOLVINGCOST(vr,regbank) 4: CoaleseBenefit = COALESCINGBENEFIT(vr,regbank) 5: return SpillCost + ConfResolvCost - CoaleseBenefit 7: procedure CONFLICTRESOLVINGCOST(vr,regbank) 8: cost = 0 9: for all edges incident to vr in RCG do 10: vr1 = the other end vertex of the edge 11: if vr1's register bank is regbank then 12: cost += the number of instructions referring both vr and vr1 as source operand 13: end if 14: end for 15: return cost 16: procedure SPILLCOST(vr,regbank) 17: NumOfInterferences = the number of vr s interfering live ranges whose register bank is regbank and priority higher than vr and has not been marked spill 18: if NumOfInterferences >= NumOfAllocatableRegister then 19: if CountOfSpills > local memory spill threshold then 20: return vr s spill/reload count * SRAM Latency 21: else return vr s spill/reload count * Local Memory Latency 22: end if 23: else return 0 24: end if Fig. 4. Cost Estimation remove the copy instructions. To indicate this possibility, we add preference to each RCG node. When we see a copy instruction like a = b,weadda to b s preference and vice versa. The CoalescingBenefit calculates the preference cost. It is essentially a product of a given weight and the number of elements that has been assigned a register bank in this set. 3.2 Live Range Splitting Live range splitting is traditionally performed when failing to allocate a register to a live range, through inserting stores and reloads. However, in our framework, we prefer to do splitting at an earlier stage when we found that assigning the live range to any bank incurs a high cost. Instead of load/store, we use copy instruction to implement splitting and force the partitioned live ranges to different banks[11]. Compared to traditional splitting, this may result in additional copy instructions. However, it can further balance the register pressures between the two banks and reduce the number of loads/stores, which are much more expensive than copies. To split the live range of a virtual register, we first build an induced graph of the region of the control flow graph (CFG) in which the virtual register is live. We check each connected component [12] of the subgraph to see if it can be allocated a register through comparing the number of available registers and the number of live ranges interfered with it. If a component does not seem to be able to get a register, we compute a min-cut for it using the method described in [13]. We add the cut edges to CutEdgeSet and insert compensation copies on

7 A Register Allocation Framework for Banked Register Files 275 1: procedure SPLITTING(vr) 2: build the induced graph for vr 3: CutEdgeSet = NULL 4: UPDATECOMPONENTINFO 5: while not all component allocatable do 6: for those components that are not allocatable do 7: perform min-cut operation on this component 8: add the cross edges to CutEdgeSet 9: delete the cross edges from induced graph 10: end for 11: UPDATECOMPONENTINFO 12: end while 13: Assign each component a register bank based on the cost-benefit analysis 14: Insert copy operations according to CutEdgeSet Fig. 5. Live Range Splitting BB3 BB1 DEF VR1 DEF VR1 BB2 BB1 5 BB3 BB2 7 3 BB4 BB3 BB1 DEF VR1 DEF VR2 BB2 BB8 VR1=VR2 BB4 10 BB4 BB5 BB6 USE VR1 BB BB6 BB5 BB6 BB9 USE VR1 VR3=VR1 BB7 USE VR1 4 BB7 BB7 USE VR3 (a) Original CFG (b) The induced graph (c) Splitting Result Fig. 6. An Example of Live Range Splitting these edges later. This process iterated until all components becomes allocatable. After that, we rename each component with a new symbolic register, assign it to a register bank, and insert the corresponding copy operations based on the CutEdgeSet. This algorithm is show in Fig. 5. Figure 6 shows an example of live range splitting. Fig. 6(a) shows the region of control flow graph in which VR1 is live. Fig. 6(b) shows the induced graph for VR1. The numbers on the edges are the frequencies of the control flowing through the edges. These numbers are obtained through profiling. The induced subgraph is connected, so we get the following partition: P 1:{BB1,BB2}, P2:{BB3,BB4,BB5,BB6,BB7}

8 276 F. Zhou et al. The CutEdgeSet contains BB2 BB4. Then, we apply the while loop again to these two partitions. Partition P 1 is allocatable, while partition P 2 is not. So we further cut it into two partitions: P 3:{BB3,BB4,BB5,BB6}, P4:{BB7} and the CutEdgeSet now changes to BB2 BB4,BB6 BB7.Thethird iteration of the while loop gets that all partitions are allocatable. The result is shown in Fig. 6(c). 4 Experimental Results We evaluated our approach using three typical network applications written in Baker [8]: L3-switch: performs L2 bridging or L3 forwarding of IP packets, depends on whether the source and destination of packet locates in a same virtual LAN. Multi-protocol Label Switching (MPLS): routes packets on labels instead of destination IPs. This simplifies the processing of the packets and facilitates high-level traffic management. MPLS shares a large portion of code with L3- Switch. Firewall: performs ordered rule-based classification to filter out unwanted packets. This application first assign flow IDs to packets according to userspecified rules and then drop packets for specified flow IDs. The flow IDs are stored in a hash table. Table 1 shows some statistics of the benchmark applications. The data are gathered with a complete set of scalar optimizations and domain specific optimizations turned on [8]. The second column shows the lines of code information of a Baker implementation of these applications while the third column shows the number of instructions before bank conflict resolving. Only the instructions on the hot path and will be executed on ME are counted here. Column 4 gives the total number of GPR type virtual registers while column 5 shows the number of bank conflict. We compared our approach with Zhuang s pre-ra bank conflict resolving method [9]. Table 2 shows the number of copy instructions and spills generated in the three benchmarks. As can be seen, Zhuang s method performs better in bank conflict resolving. After checking the RCGs, we find that most of the Table 1. Benchmark Application Status Application LOC # ofinstr# ofvr# bank conflicts L3-Switch MPLS Firewall

9 A Register Allocation Framework for Banked Register Files 277 Table 2. Copy and Spill Status # ofcopyinstrs# Spill Operations Pre-RA Our Pre-RA Our L3-Switch MPLS Firewall Table 3. Distribution of Register Pressure Difference Register Pressure L3-Switch MPLS Firewall Difference Pre-RA Our Pre-RA Our Pre-RA Our % 5.80% 8.13% 16.25% 6.67% 11.67% % 21.73% 5.00% 32.50% 3.33% 35.00% % 34.78% 6.25% 15.00% 5.00% 28.33% % 16.67% 30.63% 25.00% 10.00% 16.67% % 10.14% 13.13% 6.88% 3.33% 8.33% % 6.52% 7.50% 2.50% 6.67% 0.00% % 2.90% 10.00% 0.63% 3.33% 0.00% > % 1.45% 19.38% 1.25% 61.67% 0.00% Weighted Mean RCGs have only one or two nodes. Those RCGs with more than two nodes are essentially trees that do not have any cycles. On the other hand, our method outperforms Zhuang s method in that we generate fewer spills, which can be much slower than the copy instructions. Table 3 shows the detailed distribution of the difference of register pressure between the two banks. The register pressure of a basic block is measured by the number of live ranges that live across that basic block. The data show the percentages of BBs with different register pressures. The last row shows the weighted mean of the register pressure difference between GPR bank A and GPR bank B. As can been seen, our approach can better balance the register pressure between the two banks. 5 Related Work Banked register architecture has been used in VLIW processors to reduce the cycle time. [14][15]studied the bank assignment problem on such architectures based on the register component graph. The register component graph is a graph whose nodes are symbolic registers and arcs are annotated with the affinity that two registers have to be placed in the same register bank. After the register component graph being built, the problem becomes finding a min-cut of the graph so that the cost of the inter-bank copy is minimized. The bank constraints

10 278 F. Zhou et al. in these architectures are different from that of IXP in 1) they do not have the two source-operand selection rule ; 2) the inter-bank register copy instruction in these architectures is very expensive. [16][17] discussed the memory bank conflict problem on some DSP processors. Many DSP processors, such as Analog Device ADSP2100, DSP Group PineD- SPCore, Motorola DSP5600 and NEC upd77016, etc. adopt banked memory organization. Such memory systems can fetch multiple data in a single cycle; given the data locate in different banks. Though compiler could optimize the allocation of the variables to avoid the delay caused by accessing a same bank in a single instruction, it s not mandatory. Intel s MicroEngine C[18] is a C-like programming language designed for programming the IXP network processors in a relatively low level. It adds some extensions to C. One related to register bank assignment is the declspec directive, which could be used to specify the allocation of the variables in the memory hierarchy. By default (without any declspec qualifier), all the variables will be put in GPR. But this will increase the register pressure of GPR and cause spills in turn, which could be very expensive since MicroEngine C compiler would put them to SRAM. The programmers can do memory allocation manually using the declspec. However, this puts too much burden on the programmer and is error-prone. L. George, et al. [19] designed a new programming language named Nova for IXP 1200 network processor. They used integer linear programming to solve the bank conflict problem on IXP. While this method provides an upper bound on the performance benefit, the time complexity is too high to be practical. X. T. Zhuang, et al. [9] discussed the register bank assignment problem for the IXP 1200 network processor. They proposed three approaches to solve the problem: performing bank assignment before register allocation, after register allocation, or at the same time in a combined way. They first build a register conflict graph (RCG) to represent the bank conflicts between symbolic registers. They showed that determining whether the virtual registers could be assigned banks without introducing copy instructions is equal to determining whether the RCG is bipartite. They proved the problem of making RCG bipartite with minimal cost is NP-complete by reducing the maximal bipartite sub-graph problem to it and suggested heuristic methods to solve the problem. In[20],J.Parketal.presentedaregister allocation method for banked register file, in which only one register bank could be active at one time and registers are addressed using the register number in conjunction with bank number. No instructions except the inter-bank copy instruction can simultaneously access two banks. To solve this problem, they first divide the program into several allocation regions and then perform local register allocation using the secondary bank on these regions if they are deemed beneficial. Finally, the global register allocation would be performed on the primary bank and inter-bank copy operations would be inserted on the allocation region boundaries.

11 6 Conclusions A Register Allocation Framework for Banked Register Files 279 In this paper, we present a register allocation framework for banked register files with access constraints for the IXP network processors. Our approach relies on estimation of the costs and benefits of assigning a virtual register to a specific bank, as well as that of splitting it into multiple banks via copy instructions. We make the decision of bank assignment or live range splitting based on analysis of these costs and benefits. This helps to balance the register pressures among the banks. When splitting a live range, we use copy instructions instead of loads/stores and force the split live ranges to different banks. Though this may introduce additional copies, it can reduce the number of memory accesses significantly. Preliminary experiments show that compared with previous work, our framework can better balance the register pressure and reduce the number of spills, which in turn results in performance improvement. References 1. Huang, J.H.: Network processor design. In: ASIC, Proceedings. 5th International Conference on. Volume Vol (2003) Tseng, J.H., Asanović, K.: Banked multiported register files for high-frequency superscalar microprocessors. In: Proceedings of the 30th annual international symposium on Computer architecture. (2003) 3. Cruz, J.L., González, A., Valero, M., Topham, N.P.: Multiple-banked register file architectures. In: Proceedings of 27th annual international symposium on Computer Architecture. (2000) Balasubramoniany, R., Dwarkadasy, S., Albonesi, D.H.: Reducing the complexity of the register file in dynamic superscalar processors. In: 34th International Symposium on Microarchitecture (MICRO-34). (2001) Intel: Intel IXP2400 Network Processor Hardware Reference Manual. Intel Corporation. (2003) 6. Chow, F.C., Hennessy, J.L.: The priority-based coloring approach to register allocation. ACM Transactions on Programming Languages and Systems (TOPLAS) 12 (1990) 7. Intel: Intel IXP2400/IXP2800 Network Processor Programmers Reference Manual. Intel Corporation. (2003) 8. Chen, M.K., Li, X.F., Lian, R., Lin, J.H., Liu, L., Liu, T., Ju, R.: Shangri-la: achieving high performance from compiled network applications while enabling ease of programming. In: PLDI 05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, New York, NY, USA, ACM Press (2005) Zhuang, X., Pande, S.: Resolving register bank con icts for a network processor. In: Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques(PACT 03). (2003) 10. George, L., Appel, A.W.: Iterated register coalescing. In: POPL 96: Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, New York, NY, USA, ACM Press (1996) Farkas, K.I.: Memory-System Design Considerations For Dynamically-Scheduled Microprocessors. PhD thesis, University of Toronto (1997)

12 280 F. Zhou et al. 12. Sedgewick, R.: Algorithms in C++ Parts 5:Graph Algorithms. Addison Wesley/Pearson (2001) 13. Stoer, M., Wagner, F.: A simple min-cut algorithm. Journal of the ACM (JACM) (1997) 14. Hiser, J., Carr, S.: Global register partitioning. In: International Conference on Parallel Architectures and Compilation Techniques. (2000) 15. Jang, S., Carr, S., Sweany, P., Kuras, D.: A code generation framework for vliw architectures with partitioned register banks. In: Third International Conference on Massively Parallel Computing Systems. (1998) 16. Cho, J., Paek, Y., Whalley, D.: Efficient register and memory assignment of non-orthogonal architectures via graph coloring and mst algorithms. In: LCTES- SCOPES. (2002) 17. Keyngnaert, P., Demoen, B., de Sutter, B., de Bus, B., et al.: Con ict graph based allocation of static objects to memory banks. In: Semantics, Program Analysis, and Computing Environments for Memory Management. (2001) 18. Johnson, E.J., Kunze, A.R.: IXP 2400/2800 Programming: The Complete Microengine Coding Guide. Intel Press (2003) 19. George, L., Blume, M.: Taming the IXP network processor. In: PLDI. (2003) 20. Park, J., Lee, J.H., Moon, S.M.: Register allocation for banked register file. In: Language, Compiler and Tool Support for Embedded Systems LCTES. (2001) 39-47

Road Map. Road Map. Motivation (Cont.) Motivation. Intel IXA 2400 NP Architecture. Performance of Embedded System Application on Network Processor

Road Map. Road Map. Motivation (Cont.) Motivation. Intel IXA 2400 NP Architecture. Performance of Embedded System Application on Network Processor Performance of Embedded System Application on Network Processor 2006 Spring Directed Study Project Danhua Guo University of California, Riverside dguo@cs.ucr.edu 06-07 07-2006 Motivation NP Overview Programmability

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup Yan Sun and Min Sik Kim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

Register Allocation. Register Allocation. Local Register Allocation. Live range. Register Allocation for Loops

Register Allocation. Register Allocation. Local Register Allocation. Live range. Register Allocation for Loops DF00100 Advanced Compiler Construction Register Allocation Register Allocation: Determines values (variables, temporaries, constants) to be kept when in registers Register Assignment: Determine in which

More information

Compiler Optimizations For Highly Constrained Multithreaded Multicore Processors

Compiler Optimizations For Highly Constrained Multithreaded Multicore Processors Compiler Optimizations For Highly Constrained Multithreaded Multicore Processors Xiaotong Zhuang IBM T.J. Watson Research Center 1 Agenda Overview of My Research Processor Model Dual-bank Register Allocation

More information

register allocation saves energy register allocation reduces memory accesses.

register allocation saves energy register allocation reduces memory accesses. Lesson 10 Register Allocation Full Compiler Structure Embedded systems need highly optimized code. This part of the course will focus on Back end code generation. Back end: generation of assembly instructions

More information

Register Allocation via Hierarchical Graph Coloring

Register Allocation via Hierarchical Graph Coloring Register Allocation via Hierarchical Graph Coloring by Qunyan Wu A THESIS Submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE IN COMPUTER SCIENCE MICHIGAN TECHNOLOGICAL

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Introduction to Network Processors: Building Block for Programmable High- Speed Networks. Example: Intel IXA

Introduction to Network Processors: Building Block for Programmable High- Speed Networks. Example: Intel IXA Introduction to Network Processors: Building Block for Programmable High- Speed Networks Example: Intel IXA Shiv Kalyanaraman Yong Xia (TA) shivkuma@ecse.rpi.edu http://www.ecse.rpi.edu/homepages/shivkuma

More information

Balanced bipartite graph based register allocation for network processors in mobile and wireless networks

Balanced bipartite graph based register allocation for network processors in mobile and wireless networks Mobile Information Systems 6 (2010) 65 83 65 DOI 10.3233/MIS-2010-0093 IOS Press Balanced bipartite graph based register allocation for network processors in mobile and wireless networks Feilong Tang a,b,

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Global Register Allocation via Graph Coloring

Global Register Allocation via Graph Coloring Global Register Allocation via Graph Coloring Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit permission

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,

More information

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing,

More information

The Impact of Parallel and Multithread Mechanism on Network Processor Performance

The Impact of Parallel and Multithread Mechanism on Network Processor Performance The Impact of Parallel and Multithread Mechanism on Network Processor Performance Chunqing Wu Xiangquan Shi Xuejun Yang Jinshu Su Computer School, National University of Defense Technolog,Changsha, HuNan,

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Code generation for modern processors

Code generation for modern processors Code generation for modern processors Definitions (1 of 2) What are the dominant performance issues for a superscalar RISC processor? Refs: AS&U, Chapter 9 + Notes. Optional: Muchnick, 16.3 & 17.1 Instruction

More information

Tree-Based Minimization of TCAM Entries for Packet Classification

Tree-Based Minimization of TCAM Entries for Packet Classification Tree-Based Minimization of TCAM Entries for Packet Classification YanSunandMinSikKim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington 99164-2752, U.S.A.

More information

Decoupled Software Pipelining in LLVM

Decoupled Software Pipelining in LLVM Decoupled Software Pipelining in LLVM 15-745 Final Project Fuyao Zhao, Mark Hahnenberg fuyaoz@cs.cmu.edu, mhahnenb@andrew.cmu.edu 1 Introduction 1.1 Problem Decoupled software pipelining [5] presents an

More information

Code generation for modern processors

Code generation for modern processors Code generation for modern processors What are the dominant performance issues for a superscalar RISC processor? Refs: AS&U, Chapter 9 + Notes. Optional: Muchnick, 16.3 & 17.1 Strategy il il il il asm

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS INSTRUCTOR: Dr. MUHAMMAD SHAABAN PRESENTED BY: MOHIT SATHAWANE AKSHAY YEMBARWAR WHAT IS MULTICORE SYSTEMS? Multi-core processor architecture means placing

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

Code Generation. CS 540 George Mason University

Code Generation. CS 540 George Mason University Code Generation CS 540 George Mason University Compiler Architecture Intermediate Language Intermediate Language Source language Scanner (lexical analysis) tokens Parser (syntax analysis) Syntactic structure

More information

Register Allocation (wrapup) & Code Scheduling. Constructing and Representing the Interference Graph. Adjacency List CS2210

Register Allocation (wrapup) & Code Scheduling. Constructing and Representing the Interference Graph. Adjacency List CS2210 Register Allocation (wrapup) & Code Scheduling CS2210 Lecture 22 Constructing and Representing the Interference Graph Construction alternatives: as side effect of live variables analysis (when variables

More information

Studying Optimal Spilling in the light of SSA

Studying Optimal Spilling in the light of SSA Studying Optimal Spilling in the light of SSA Quentin Colombet, Florian Brandner and Alain Darte Compsys, LIP, UMR 5668 CNRS, INRIA, ENS-Lyon, UCB-Lyon Journées compilation, Rennes, France, June 18-20

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI CHEN TIANZHOU SHI QINGSONG JIANG NING College of Computer Science Zhejiang University College of Computer Science

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Performance Characteristics. i960 CA SuperScalar Microprocessor

Performance Characteristics. i960 CA SuperScalar Microprocessor Performance Characteristics of the i960 CA SuperScalar Microprocessor s. McGeady Intel Corporation Embedded Microprocessor Focus Group i960 CA - History A highly-integrated microprocessor for embedded

More information

Stride- and Global History-based DRAM Page Management

Stride- and Global History-based DRAM Page Management 1 Stride- and Global History-based DRAM Page Management Mushfique Junayed Khurshid, Mohit Chainani, Alekhya Perugupalli and Rahul Srikumar University of Wisconsin-Madison Abstract To improve memory system

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI, CHEN TIANZHOU, SHI QINGSONG, JIANG NING College of Computer Science Zhejiang University College of Computer

More information

Compiler Architecture

Compiler Architecture Code Generation 1 Compiler Architecture Source language Scanner (lexical analysis) Tokens Parser (syntax analysis) Syntactic structure Semantic Analysis (IC generator) Intermediate Language Code Optimizer

More information

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto. Embedded processors Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.fi Comparing processors Evaluating processors Taxonomy of processors

More information

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture Motivation Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA BARC2004 Increasing demand on

More information

Stanford University Computer Systems Laboratory. Stream Scheduling. Ujval J. Kapasi, Peter Mattson, William J. Dally, John D. Owens, Brian Towles

Stanford University Computer Systems Laboratory. Stream Scheduling. Ujval J. Kapasi, Peter Mattson, William J. Dally, John D. Owens, Brian Towles Stanford University Concurrent VLSI Architecture Memo 122 Stanford University Computer Systems Laboratory Stream Scheduling Ujval J. Kapasi, Peter Mattson, William J. Dally, John D. Owens, Brian Towles

More information

Evolution of ISAs. Instruction set architectures have changed over computer generations with changes in the

Evolution of ISAs. Instruction set architectures have changed over computer generations with changes in the Evolution of ISAs Instruction set architectures have changed over computer generations with changes in the cost of the hardware density of the hardware design philosophy potential performance gains One

More information

Managing Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems

Managing Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems Managing Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems Zimeng Zhou, Lei Ju, Zhiping Jia, Xin Li School of Computer Science and Technology Shandong University, China Outline

More information

Outline. Register Allocation. Issues. Storing values between defs and uses. Issues. Issues P3 / 2006

Outline. Register Allocation. Issues. Storing values between defs and uses. Issues. Issues P3 / 2006 P3 / 2006 Register Allocation What is register allocation Spilling More Variations and Optimizations Kostis Sagonas 2 Spring 2006 Storing values between defs and uses Program computes with values value

More information

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Evaluation of NOC Using Tightly Coupled Router Architecture

Evaluation of NOC Using Tightly Coupled Router Architecture IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 01-05 www.iosrjournals.org Evaluation of NOC Using Tightly Coupled Router

More information

Register Allocation. Global Register Allocation Webs and Graph Coloring Node Splitting and Other Transformations

Register Allocation. Global Register Allocation Webs and Graph Coloring Node Splitting and Other Transformations Register Allocation Global Register Allocation Webs and Graph Coloring Node Splitting and Other Transformations Copyright 2015, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known

More information

Data Flow Graph Partitioning Schemes

Data Flow Graph Partitioning Schemes Data Flow Graph Partitioning Schemes Avanti Nadgir and Harshal Haridas Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802 Abstract: The

More information

Instruction Set Usage Analysis for Application-Specific Systems Design

Instruction Set Usage Analysis for Application-Specific Systems Design INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & COMPUTER SCIENCE, VOL. 7, NO. 2, JANUARY/FEBRUARY 2013, (ISSN: 2091-1610) 99 Instruction Set Usage Analysis for Application-Specific Systems Design Charles

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Lecture Compiler Backend

Lecture Compiler Backend Lecture 19-23 Compiler Backend Jianwen Zhu Electrical and Computer Engineering University of Toronto Jianwen Zhu 2009 - P. 1 Backend Tasks Instruction selection Map virtual instructions To machine instructions

More information

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15

More information

A hardware operating system kernel for multi-processor systems

A hardware operating system kernel for multi-processor systems A hardware operating system kernel for multi-processor systems Sanggyu Park a), Do-sun Hong, and Soo-Ik Chae School of EECS, Seoul National University, Building 104 1, Seoul National University, Gwanakgu,

More information

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals Performance Evaluations of a Multithreaded Java Microcontroller J. Kreuzinger, M. Pfeer A. Schulz, Th. Ungerer Institute for Computer Design and Fault Tolerance University of Karlsruhe, Germany U. Brinkschulte,

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Cache Justification for Digital Signal Processors

Cache Justification for Digital Signal Processors Cache Justification for Digital Signal Processors by Michael J. Lee December 3, 1999 Cache Justification for Digital Signal Processors By Michael J. Lee Abstract Caches are commonly used on general-purpose

More information

Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9

Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9 Memory Systems and Compiler Support for MPSoC Architectures Mahmut Kandemir and Nikil Dutt Cap. 9 Fernando Moraes 28/maio/2013 1 MPSoC - Vantagens MPSoC architecture has several advantages over a conventional

More information

Revisiting Join Site Selection in Distributed Database Systems

Revisiting Join Site Selection in Distributed Database Systems Revisiting Join Site Selection in Distributed Database Systems Haiwei Ye 1, Brigitte Kerhervé 2, and Gregor v. Bochmann 3 1 Département d IRO, Université de Montréal, CP 6128 succ Centre-Ville, Montréal

More information

Integrating MRPSOC with multigrain parallelism for improvement of performance

Integrating MRPSOC with multigrain parallelism for improvement of performance Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): Motivation for The Memory Hierarchy: { CPU/Memory Performance Gap The Principle Of Locality Cache $$$$$ Cache Basics:

More information

Hakam Zaidan Stephen Moore

Hakam Zaidan Stephen Moore Hakam Zaidan Stephen Moore Outline Vector Architectures Properties Applications History Westinghouse Solomon ILLIAC IV CDC STAR 100 Cray 1 Other Cray Vector Machines Vector Machines Today Introduction

More information

Linear Scan Register Allocation. Kevin Millikin

Linear Scan Register Allocation. Kevin Millikin Linear Scan Register Allocation Kevin Millikin Register Allocation Register Allocation An important compiler optimization Compiler: unbounded # of virtual registers Processor: bounded (small) # of registers

More information

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor The A High Performance Out-of-Order Processor Hot Chips VIII IEEE Computer Society Stanford University August 19, 1996 Hewlett-Packard Company Engineering Systems Lab - Fort Collins, CO - Cupertino, CA

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Design and Implementation of a Random Access File System for NVRAM

Design and Implementation of a Random Access File System for NVRAM This article has been accepted and published on J-STAGE in advance of copyediting. Content is final as presented. IEICE Electronics Express, Vol.* No.*,*-* Design and Implementation of a Random Access

More information

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Peter Brezany 1, Alok Choudhary 2, and Minh Dang 1 1 Institute for Software Technology and Parallel

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Introduction to Operating Systems. Chapter Chapter

Introduction to Operating Systems. Chapter Chapter Introduction to Operating Systems Chapter 1 1.3 Chapter 1.5 1.9 Learning Outcomes High-level understand what is an operating system and the role it plays A high-level understanding of the structure of

More information

CS 406/534 Compiler Construction Putting It All Together

CS 406/534 Compiler Construction Putting It All Together CS 406/534 Compiler Construction Putting It All Together Prof. Li Xu Dept. of Computer Science UMass Lowell Fall 2004 Part of the course lecture notes are based on Prof. Keith Cooper, Prof. Ken Kennedy

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Register Allocation. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Register Allocation. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice. Register Allocation Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP at Rice. Copyright 00, Keith D. Cooper & Linda Torczon, all rights reserved.

More information

The Implementation and Evaluation of a Low-Power Clock Distribution Network Based on EPIC

The Implementation and Evaluation of a Low-Power Clock Distribution Network Based on EPIC The Implementation and Evaluation of a Low-Power Clock Distribution Network Based on EPIC Rong Ji, Xianjun Zeng, Liang Chen, and Junfeng Zhang School of Computer Science,National University of Defense

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Implementation of Adaptive Buffer in Video Receivers Using Network Processor IXP 2400

Implementation of Adaptive Buffer in Video Receivers Using Network Processor IXP 2400 The International Arab Journal of Information Technology, Vol. 6, No. 3, July 2009 289 Implementation of Adaptive Buffer in Video Receivers Using Network Processor IXP 2400 Kandasamy Anusuya, Karupagouder

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #8 2/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

New Memory Organizations For 3D DRAM and PCMs

New Memory Organizations For 3D DRAM and PCMs New Memory Organizations For 3D DRAM and PCMs Ademola Fawibe 1, Jared Sherman 1, Krishna Kavi 1 Mike Ignatowski 2, and David Mayhew 2 1 University of North Texas, AdemolaFawibe@my.unt.edu, JaredSherman@my.unt.edu,

More information

Reconfigurable Multicore Server Processors for Low Power Operation

Reconfigurable Multicore Server Processors for Low Power Operation Reconfigurable Multicore Server Processors for Low Power Operation Ronald G. Dreslinski, David Fick, David Blaauw, Dennis Sylvester, Trevor Mudge University of Michigan, Advanced Computer Architecture

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

Instruction Set Architecture. "Speaking with the computer"

Instruction Set Architecture. Speaking with the computer Instruction Set Architecture "Speaking with the computer" The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture Digital Design

More information

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing Linköping University Post Print epuma: a novel embedded parallel DSP platform for predictable computing Jian Wang, Joar Sohl, Olof Kraigher and Dake Liu N.B.: When citing this work, cite the original article.

More information

REAL TIME DIGITAL SIGNAL PROCESSING

REAL TIME DIGITAL SIGNAL PROCESSING REAL TIME DIGITAL SIGNAL PROCESSING UTN - FRBA 2011 www.electron.frba.utn.edu.ar/dplab Introduction Why Digital? A brief comparison with analog. Advantages Flexibility. Easily modifiable and upgradeable.

More information

Implementation of Boundary Cutting Algorithm Using Packet Classification

Implementation of Boundary Cutting Algorithm Using Packet Classification Implementation of Boundary Cutting Algorithm Using Packet Classification Dasari Mallesh M.Tech Student Department of CSE Vignana Bharathi Institute of Technology, Hyderabad. ABSTRACT: Decision-tree-based

More information

378: Machine Organization and Assembly Language

378: Machine Organization and Assembly Language 378: Machine Organization and Assembly Language Spring 2010 Luis Ceze Slides adapted from: UIUC, Luis Ceze, Larry Snyder, Hal Perkins 1 What is computer architecture about? Computer architecture is the

More information

A Report on Coloring with Live Ranges Split

A Report on Coloring with Live Ranges Split A Report on Coloring with Live Ranges Split Xidong Wang Li Yang wxd@cs.wisc.edu yangli@cs.wisc.edu Computer Science Department University of Wisconsin Madison December 17, 2001 1 Introduction One idea

More information