A Report on Coloring with Live Ranges Split

Similar documents
Register Allocation via Hierarchical Graph Coloring

Global Register Allocation via Graph Coloring

Global Register Allocation Based on Graph Fusion

CSC D70: Compiler Optimization Register Allocation

General Objective:To understand the basic memory management of operating system. Specific Objectives: At the end of the unit you should be able to:

The C2 Register Allocator. Niclas Adlertz

Lecture 7. Memory Management

Code generation for modern processors

Code generation for modern processors

Caching for NASD. Department of Computer Science University of Wisconsin-Madison Madison, WI 53706

Architecture Tuning Study: the SimpleScalar Experience

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Frequency Oriented Scheduling on Parallel Processors

Mapping Vector Codes to a Stream Processor (Imagine)

Input/Output Management

Operating Systems Unit 6. Memory Management

Compiler Design. Register Allocation. Hwansoo Han

Operating Systems Virtual Memory. Lecture 11 Michael O Boyle

Graph Structure Over Time

CHAPTER 3: DAILY PROCEDURES

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced?

Chapter One. Concepts BACKUP CONCEPTS

16 Sharing Main Memory Segmentation and Paging

Rematerialization. Graph Coloring Register Allocation. Some expressions are especially simple to recompute: Last Time

VIDEO 1: WHY IS THE USER EXPERIENCE CRITICAL TO CONTEXTUAL MARKETING?

Rule partitioning versus task sharing in parallel processing of universal production systems

A Generalized Method to Solve Text-Based CAPTCHAs

COS 597C Project: Bitbank

15 Sharing Main Memory Segmentation and Paging

BFS preconditioning for high locality, data parallel BFS algorithm N.Vasilache, B. Meister, M. Baskaran, R.Lethin. Reservoir Labs

A Private Heap for HDF5 Quincey Koziol Jan. 15, 2007

Managing Storage: Above the Hardware

Cache Performance (H&P 5.3; 5.5; 5.6)

The data quality trends report

Virtual Memory. Chapter 8

EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System

Memory Management. Dr. Yingwu Zhu

Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes. Todd A. Whittaker Ohio State University

An Anomaly in Unsynchronized Pointer Jumping in Distributed Memory Parallel Machine Model

A Graph-based Approach to Compute Multiple Paths in Mobile Ad Hoc Networks

Optimising for the p690 memory system

I always recommend diversifying and testing more than one source, but make sure it is as targeted as possible.

Outline. Register Allocation. Issues. Storing values between defs and uses. Issues. Issues P3 / 2006

Virtualizing the SAP Infrastructure through Grid Technology. WHITE PAPER March 2007

VMem. By Stewart Lynch.

Chapter 3: Towards the Simplex Method for Efficient Solution of Linear Programs

11.1 Segmentation: Generalized Base/Bounds

COMP 346 WINTER 2018 MEMORY MANAGEMENT (VIRTUAL MEMORY)

10 Hidden IT Risks That Might Threaten Your Business

Adaptive-Mesh-Refinement Pattern

(Refer Slide Time: 00:51)

Workshop 1: Basic Skills

Barrelfish Project ETH Zurich. Message Notifications

A Survey of Software Packages for Teaching Linear and Integer Programming

Global Register Allocation - Part 2

SR college of engineering, Warangal, Andhra Pradesh, India 1

Research on the value of search engine optimization based on Electronic Commerce WANG Yaping1, a

SSA-Form Register Allocation

(INTERFERENCE AND CONGESTION AWARE ROUTING PROTOCOL)

// The Value of a Standard Schedule Quality Index

THE PRESIDENT S COMMISSION

Register Allocation. Register Allocation. Local Register Allocation. Live range. Register Allocation for Loops

SIS Operation & Maintenance 15 minutes

Chapter 3 - Memory Management

Chapter 9: Virtual Memory


register allocation saves energy register allocation reduces memory accesses.

Summary: Issues / Open Questions:

Register Allocation in Just-in-Time Compilers: 15 Years of Linear Scan

12: Memory Management

We have already seen the transportation problem and the assignment problem. Let us take the transportation problem, first.

How To Make 3-50 Times The Profits From Your Traffic

CS370 Operating Systems

Operating Systems. File Systems. Thomas Ropars.

Register Allocation. Stanford University CS243 Winter 2006 Wei Li 1

CS370 Operating Systems

Practice Exercises 449

Clustering Techniques A Technical Whitepaper By Lorinda Visnick

REGULATED DOMESTIC ROAMING RESEARCH REPORT 2017

Memory Management. Virtual Memory. By : Kaushik Vaghani. Prepared By : Kaushik Vaghani

CS370 Operating Systems

Chapter 9: Virtual Memory. Operating System Concepts 9 th Edition

FORE DGE COMPANY PROFILE. Services

Knowing something about how to create this optimization to harness the best benefits will definitely be advantageous.

Towards a Memory-Efficient Knapsack DP Algorithm

20-EECE-4029 Operating Systems Spring, 2013 John Franco

Submission guidelines. This guidebook will cover all the best practices and pitfalls while creating most compelling design presentations on Uni.

Survey: Users Share Their Storage Performance Needs. Jim Handy, Objective Analysis Thomas Coughlin, PhD, Coughlin Associates

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

UNIT - IV. What is virtual memory?

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

RiMOM Results for OAEI 2010

Promoting Component Architectures in a Dysfunctional Organization

Linear Scan Register Allocation. Kevin Millikin

Character Recognition

Concept: Solving Inequalities Name:

Website Designs Australia

Register Allocation. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Scalable Trigram Backoff Language Models

August 1994 / Features / Cache Advantage. Cache design and implementation can make or break the performance of your high-powered computer system.

Transcription:

A Report on Coloring with Live Ranges Split Xidong Wang Li Yang wxd@cs.wisc.edu yangli@cs.wisc.edu Computer Science Department University of Wisconsin Madison December 17, 2001 1 Introduction One idea to improve Chaitin-style coloring algorithm [2] in register allocation is to split live ranges. Chaitin-style algorithm we have implemented in assignment 2 simply gives up allocating a register to a live range, if the live range has too many neighbors. Intuitively the idea of live range split can increase the chance of allocating a register to a live range. Actually the smaller a live range is, the fewer neighbors it would interfere with and hence the more chance we would have to allocate a register to it. Priority-based algorithm [1] addressed this issue, but usually the solution can not achieve practically acceptable results as its live range split strategy only aims to color as many live ranges as possible, without taking into account if a split implies negative or positive gains. Current work in the field of live range split goes into three categories, depending on the place where live range split happens. 1) Pre-allocation split applies live range split before the register allocation phase. 2) Post-allocation split applies live range split after the register allocation phase. And 3) Interallocation split only applies live range split when it happens that a live range can not be allocated a register and has to be spilled. In this project we try to design and implement three live range split algorithms corresponding to the three different categories respectively and compare their optimization effects. The remaining sections are arranged as follows. In section 2, we describe our algorithms in details. Performance evaluation of the three algorithms is discussed after the experiment in section 3. 2 Algorithm Description We design and implement three different algorithms to achieve live range split. they are named as post-algorithm, pre-algorithm and inter-algorithm, corresponding to the three categories of live range split strategies that we listed

in the previous section. The algorithms are based on the following two facts that we believe existing in most programs: 1. For a live range, there exist some holes. By hole, we mean that the basic blocks in a live range are not contiguous. 2. The variable corresponding to a live range is not used evenly within the range. That means in some basic blocks, it is used more frequently than in others. One special case is loop-body. We expect to save much load/store overhead if a loop-body can be allocated a register. Similar techniques are used in these three algorithms. For a live range to be split, To save and restore the value of the variable belonging to the live range, some compensation code is instrumented on the boundary where a pending split will happen. Compensation code is referred as portal in the report. Given a live range, we come up with a evaluation model to determine whether or not to split it. First of all its boundary is recognized, and the benefit of register utilization and the cost of portal compensation code due to the pending split has to be compared in advance. This evaluation model is called benifit-cost model. Different from priority-based approach, we commit a pending split when it is beneficial of doing so. 2.1 Post-algorithm Imagine that some live range may have been allocated a register, but in some of its blocks, the register is not used very frequently. If some variables have not been assigned in those blocks but they are used very often, switching the register to those frequently used variables would be valuable. This algorithm is to find out this kind of live ranges. In a extreme, some registers are not assigned to variables in some blocks due to color register allocation algorithms, utilizing this kind of free registers would help improve performance. Below is the detailed algorithm description. 1. Recognize the holes in a long live range. In implementation, the maximum variable usage in a block is computed, then any block in which the variable usage is less than 10% of that maximum usage is considered as a hole. 2. Merge the holes into a continuous hole range, to decrease the number of possible portals. maximize the split benefit. 3. For each block in hole range, find a unassigned variable whose usage in this block is maximum, assign the free register to it. 4. Compute the benefit-cost of this split. Benefit is the decreased memory access of variables assigned free register, while cost is the save-restore happens in portals and memory access of that split variable victim. 2

2.2 Pre-algorithm The idea of our pre-algorithm is inspired by the breakdown of post-algorithm in most cases. We had expected that by using post-algorithm, we could use registers stolen from holes of live ranges for other basic blocks and gain some profit. But the result of experiments is somehow out of our expectation. After thinking it over, we recognized that most of basic blocks seen in programs are composed of only three or five instructions and what we can gain often can not balance the price we have to pay to save contents of free registers from original live ranges. We decide to try an algorithm to place our live range split process before register allocation. Pre-algorithm s key idea is based on an observation: in a program, most of memory access happens in a portion of its blocks, referred as core, which are usually located in blocks belonging to a loop structure. If registers are carefully scheduled and allocated in core range, the overall register allocation of the whole program will be improved. Therefore, if a live range spans a core and some other non-core blocks, the non-core blocks will interfere with the register allocation of core blocks, impacting the overall allocation performance. Based on the discussion, pre-algorithm is proposed. Live range is split aggressively upon the core boundary before register allocation phase. We use our usage model to decide how to split a live range, which usually splits a live range at the loop boundary. The flow of the pre-algorithm goes as follows after the initial Interference Graph is constructed. 1. For each live range in Interference Graph, mark each basic block in the live range as either core or non-core similarly as the hole-recognition in post-algorithm. 2. Split that live range into some cores and a non-core set. Core is usually split in loop boundary, or it is recognized as core in step 1. The other blocks go to a non- core set category. Cores and non-core set will be independently assigned register in allocation phase. 3. Rebuild Interference Graph. After all the live ranges have been visited, go to register allocation and code generation phase. 2.3 Inter-algorithm Post-algorithm and pre-algorithm both can not achieve optimal results. In pre-algorithm, the problem stems from the fact that we split live ranges too aggressively before register allocation. Actually before register allocation, we are lack of enough information to make a decision upon which live ranges should be split and which should not. This aggressive approach will split almost all live ranges with loops inside, even though some of them can actually be assigned a color. And in some cases, despite of the fact that one live range has been split, it can not be assigned a color yet, as other live ranges have much more cost than it does. All these circumstances will result in unbalanced overhead of a live range split. Based on the discussions above, we try to find a compromise 3

between post-algorithm and pre-algorithm, and put the live range split process during the course of register allocation. The idea of inter-algorithm is that, if a live range can not been assigned a register with graph coloring algorithm, and if it can be split into core and non-core sets, split that live range and redo the register allocation. Because the live range is minimized in length, we hope that it can be assigned a register in allocation phase. The flow of the inter-algorithm goes as follows after the initial Interference Graph is constructed. 1. Register allocation with graph coloring algorithm. 2. Once there is a live range that will have to be spilled and contains core, split it into cores and non-core set as in pre-algorithm 3. Repeat Step 2 until all live ranges, which cant be assigned a color, have been evaluated. 4. Redo the graph coloring register allocation. 3 Experiment Evaluation Yacc.c is taken as a typical benchmark and we have run three algorithms with real programs in assignment 2 test directory. They seem to share the same pattern, and therefore yacc is used to analyze algorithm performance. When only one or two machine registers are available, the competition of register allocation is high and live range split is likely to happen. 3.1 Memory access performance for post-algorithm In yacc, 9 live ranges are taken as split candidates, after benefit-cost computation, only 2 of them is split finally, saving 29 memory accesses totally. Considering yacc is a real program and the total memory access is in millisecond magnitude, the improvement is very poor. We tried on other test programs, the similar performance result is gathered. Two factors can contribute to the breakdown of post-algorithm. First, we have to admit, graph coloring algorithm is successful in register allocation, very few live ranges are long enough and skewed enough to be recognized as split candidates. Second, basic blocks are very short, usually 3-5 instructions each. Therefore portal compensation code is a big burden. We can see that only two out of nine candidates in yacc pass the benefit-cost model, the other seven candidates can not find out enough benefit from split to balance the portal cost. Plus these two negative factors, another phenomenon suggests that we stop post-algorithm and try other ways. The split live range does not split on loop boundary and let others to use free registers in loop range, therefore, the benefit from post-algorithm can not be large. All performance data seems to lead us to pre-algorithm. 4

3.2 Memory access performance for pre-algorithm In yacc, pre-algorithm aggressively splits live ranges before register allocation. Some better performance is achieved. In one procedure, 18282 memory accesses happen without pre-algorithm, while 17915 memory accesses happen with pre-algorithm. In another test case, 963 with pre-algorithm and 1111 without pre-algorithm. That is an encouraging result, showing that live range split is helpful in register allocation. However, we still have to say that the improvement is very little, no more than 1% of the total memory accesses. We analyzed the algorithm running in detail and found a couple of factors contributing to the poor improvement. First of all, the long live range usually is filled with holes, it is not common to see that 90% of a long live range is holes. However, the split of long live range can help. One reason is that machine registers are so few that in blocks within loop boundary, the registers have already fully utilized by graphing coloring algorithm, leave almost no space for split live range. Second, even if cores from long live range split are assigned register, the compensation cost in portal will offset the benefit. If we increase the amount of machine registers, the register usage pressure within loop boundary is decreased, then the split core can grab some unassigned machine register. But as the amount of register increase, the graph coloring algorithm will decrease the memory access hugely. The 17915/18282 case in two registers, mentioned above, is now 953/1415 if five registers can be allocated. That means live range split can improve the register allocation, but that improvement will not promote total performance a lot. 3.3 Memory access performance for inter-algorithm Similar performance data as pre-algorithm. The split does not bring about significant performance improvement, since the cores after split are not allocated registers. In pre- algorithm 17915/18282 case, the memory access number in inter-algorithm is 17893, a very little improvement. 4 Conclusion Live range split is hard. Although it is very intuitive that live range split is a good idea of optimization and also there are a lot of feasible ideas to implement live range split, all the three algorithms we have developed can not gain as much as what we expect. Algorithms of live range split are still good in some sense but they are intended to be programs-sensitive. That is to say, for some programs, they work better. Some degree of optimization can be achieved, but the profit is not noticeable. In some other cases, it could be even worse and the overhead of live range split can not be balanced by what we gain via live range split. It seems that graph coloring algorithm can work well in core parts of program, usually within the loop boundary. Therefore, even the long live range is split, the cores can not grab machine registers, therefore little improvement 5

can be provided. Also the compensation cost in portal, to store and re-load in order to keep value integrity, also is a big burden to benefit-cost model in live range split. Our three algorithms can provide some hints about live range split. Although the performance improvement is not significant, the prevailing holes in long live ranges make it a challenging issue to register allocation research still. References [1] Hennesey, Chow, The Priority-Based Coloring Approach to Register Allocation, ACM TOPLAS, Oct. 1990. [2] G. Chaitin, Register Allocation via Coloring, Computer Languages, 1981. [3] Guei-Yuan Lueh, Fusion-Based Register Allocation, Lueh School of Computer Science, CMU. [4] William M. Waite, Global Register Allocation, Department of Electrical and Computer Engineering, University of Colorado. 6