Architecture Cloning For PowerPC Processors. Edwin Chan, Raul Silvera, Roch Archambault IBM Toronto Lab Oct 17 th, 2005

Similar documents
A Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

A Framework for Safe Automatic Data Reorganization

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Efficient Program Compilation through Machine Learning Techniques

IBM pseries Compiler Roadmap

Efficient Program Compilation through Machine Learning Techniques

IBM pseries Compiler Roadmap

Outline. Speculative Register Promotion Using Advanced Load Address Table (ALAT) Motivation. Motivation Example. Motivation

Inlining Java Native Calls at Runtime

Breaking Cyclic-Multithreading Parallelization with XML Parsing. Simone Campanoni, Svilen Kanev, Kevin Brownell Gu-Yeon Wei, David Brooks

Workloads, Scalability and QoS Considerations in CMP Platforms

A Study of the Performance Potential for Dynamic Instruction Hints Selection

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Computer System. Performance

SimPoint 3.0: Faster and More Flexible Program Analysis

Decoupled Zero-Compressed Memory

EXPERT: Expedited Simulation Exploiting Program Behavior Repetition

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor

Continuous Adaptive Object-Code Re-optimization Framework

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

Data Hiding in Compiled Program Binaries for Enhancing Computer System Performance

COMPILER OPTIMIZATION ORCHESTRATION FOR PEAK PERFORMANCE

Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs

Many Cores, One Thread: Dean Tullsen University of California, San Diego

Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders

APPENDIX Summary of Benchmarks

A Software-Hardware Hybrid Steering Mechanism for Clustered Microarchitectures

1.6 Computer Performance

Picking Statistically Valid and Early Simulation Points

Software-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

The V-Way Cache : Demand-Based Associativity via Global Replacement

Computer Science 246. Computer Architecture

Reconfigurable STT-NV LUT-based Functional Units to Improve Performance in General-Purpose Processors

Exploiting Streams in Instruction and Data Address Trace Compression

Parallel Computing 38 (2012) Contents lists available at SciVerse ScienceDirect. Parallel Computing

Staged Tuning: A Hybrid (Compile/Install-time) Technique for Improving Utilization of Performance-asymmetric Multicores

CS61C : Machine Structures

Using Aggressor Thread Information to Improve Shared Cache Management for CMPs

Improvements to Linear Scan register allocation

Identifying and Exploiting Memory Access Characteristics for Prefetching Linked Data Structures. Hassan Fakhri Al-Sukhni

Reducing Overhead for Soft Error Coverage in High Availability Systems

Instruction Based Memory Distance Analysis and its Application to Optimization

Data Access History Cache and Associated Data Prefetching Mechanisms

Cache Optimization by Fully-Replacement Policy

Cache Insertion Policies to Reduce Bus Traffic and Cache Conflicts

Automatic Selection of Compiler Options Using Non-parametric Inferential Statistics

VSV: L2-Miss-Driven Variable Supply-Voltage Scaling for Low Power

Performance, Cost and Amdahl s s Law. Arquitectura de Computadoras

Low-Complexity Reorder Buffer Architecture*

Computing Architectural Vulnerability Factors for Address-Based Structures

Dataflow Dominance: A Definition and Characterization

Narrow Width Dynamic Scheduling

A Dynamic Compilation Framework for Controlling Microprocessor Energy and Performance

Analysis of Program Based on Function Block

A Framework For Trusted Instruction Execution Via Basic Block Signature Verification

Design of Experiments - Terminology

Loop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization

Binary Stirring: Self-randomizing Instruction Addresses of Legacy x86 Binary Code

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt

Kilo-instruction Processors, Runahead and Prefetching

EECS2021. EECS2021 Computer Organization. EECS2021 Computer Organization. Morgan Kaufmann Publishers September 14, 2016

Koji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency

An Analysis of Graph Coloring Register Allocation

Techniques and Tools for Dynamic Optimization

Review from last lecture. EECS 252 Graduate Computer Architecture. Lec 4 Memory Hierarchy Review. Outline. Example Standard Deviation: Last time

TraceBack: First Fault Diagnosis by Reconstruction of Distributed Control Flow

Boost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor

Eliminating Microarchitectural Dependency from Architectural Vulnerability

Exploring Wakeup-Free Instruction Scheduling

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation

ATOS introduction ST/Linaro Collaboration Context

Efficient and Effective Misaligned Data Access Handling in a Dynamic Binary Translation System

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors

Simple and Efficient Construction of Static Single Assignment Form

Performance of a Computer (Chapter 4)

POSH: A TLS Compiler that Exploits Program Structure

Characterizing the d-tlb Behavior of SPEC CPU2000 Benchmarks

Evaluating Benchmark Subsetting Approaches

Performance Prediction using Program Similarity

Microarchitectural Design Space Exploration Using An Architecture-Centric Approach

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

Lightweight Reference Affinity Analysis

HASS: A Scheduler for Heterogeneous Multicore Systems

A Co-designed HW/SW Approach to General Purpose Program Acceleration Using a Programmable Functional Unit

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Filtering of Unnecessary Branch Predictor Lookups for Low-power Processor Architecture *

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:

Punctual Coalescing. Fernando Magno Quintão Pereira

Quantifying Load Stream Behavior

Improving Cache Performance using Victim Tag Stores

IBM Research Report. Managing Peak System-level Power with Feedback Control. Xiaorui Wang Washington University in St. Louis

Statistical Simulation of Chip Multiprocessors Running Multi-Program Workloads

Transcription:

Architecture Cloning For PowerPC Processors Edwin Chan, Raul Silvera, Roch Archambault edwinc@ca.ibm.com IBM Toronto Lab Oct 17 th, 2005

Outline Motivation Implementation Details Results

Scenario Previously, only 2 solutions exist for the IBM XL Compiler to create an executable compatible with multiple PowerPC processors: Generate generic instructions Unable to take advantage of the latest hardware features Suboptimal performance on all platforms Recompile the application for different architectures Recompilation takes a long time Adds building complexity, more support headaches, longer time to ship Example: ISV (Independent Software Vendor)

Our Approach Architecture Cloning Introduced in the latest version of the XL compiler Allows the compiler to target more than one PowerPC processors Additional targets supported : Power4, Power5 and PPC970 Generates different instructions optimized for each target Inserts runtime check in program to select the appropriate code path according to the hardware platform To enable architecture cloning, one must compile with qipa and specify -qipa=clonearch= target on the link step

Outline Motivation Implementation Details Results

How Architecture Cloning Works Architecture Cloning is divided into 2 phases Analysis phase Transformation phase

Analysis Phase Goal: Minimize the impact of architecture cloning on link time and executable size by reducing the number of procedures to clone Examines each node in call graph to eliminate candidates First, it identifies the procedures that cannot be cloned Ex. Procedures not compiled with qipa, etc. Finally, avoid cloning unprofitable procedures Ex. Procedures marked as having low calling frequency in the call graph, etc.

How To Assist the Analysis Users can instruct the compiler which procedures it should clone or not clone With compiler suboptions -qipa=cloneproc= procname and qipa=nocloneproc= procname Helpful in cases where 10% of the code is being executed 90% of the time When PDF (Profile-Directed Feedback) is used the calling frequency is known and thus more accurate More aggressive analysis is performed where it selects from the hottest procedure until a threshold is reached

Transformation Phase Inserts a platform detection routine at the program s entry point Performs procedure cloning on the candidates Updates the call graph and inserts runtime checks in the program for selecting the right path Put the cloned procedures in a separate compilation unit

Insert Platform Detection Routine For the generated binary to determine the platform at runtime Identify the entry point of the program from the call graph Insert a platform detection routine at the beginning of the entry point This routine obtain processor and OS information from the system The returned result is stored into a global variable to be used for the runtime checks Ex. int main() { system_arch = xl_platform_detection().. if (system_arch == Pwr4) foo@pwr4() else foo().. }

Procedure Cloning Why create duplicate procedure copies? For TPO to apply different architectural-specific optimizations on each copy For TOBEY Backend to generate different instructions and scheduling for each copy The call graph is traversed from top down to find the candidate remap the parameters and duplicate the body of the procedure add a suffix to the cloned procedure to indicate its target

Update Call Graph Attempts to divide the call graph into different sub-graphs one sub-graph contains the cloned procedures the other sub-graph contains the original procedures In another words, the cloned callers invoke the cloned procedure directly instead of calling the original procedure The decision for selecting the code path is moved as high as possible in the call graph Therefore less runtime checks are inserted, and they are unlikely to be placed in the hot procedures

Final Step of Transformation Phase Put the cloned procedures in a separate compilation unit TPO applies architectural specific optimizations differently on those cloned procedures TPO sends a separate Wcode with a different architecture setting for this compilation unit to TOBEY TOBEY generates and schedules the instructions based on the architecture setting from the given Wcode The resulting code is partitioned in memory such that the procedures for each target are contiguous minimizes the performance impact due to code growth with demand paging

Outline Motivation Implementation Details Results

Runtime Comparison : Power4 Runtime % Speedup vs. Generic Instns on Pwr4 Machine 30.00% 25.00% 20.00% % speedup clonearch vs ppc % speedup pwr4 vs ppc 15.00% 10.00% 5.00% 0.00% bzip2 crafty gzip mcf parser twolf ammp art equake facerec fma3d galgel lucas sixtrack swim -5.00% -10.00%

Runtime Comparison : Power5 Runtime % Speedup vs. Generic Instns on Pwr5 Machine 35.00% 30.00% 25.00% % speedup clonearch vs. ppc % speedup pwr5 vs. ppc 20.00% 15.00% 10.00% 5.00% 0.00% bzip2 crafty gzip mcf parser twolf ammp art equake facerec fma3d galgel lucas sixtrack swim -5.00% -10.00%

Runtime Comparison : PPC970 VMX Runtime % Speedup vs. Generic Instns on PPC970 Machine 35.00% 30.00% 25.00% % speedup clonearch vs ppc % speedup ppc970 vmx vs. ppc 20.00% 15.00% 10.00% 5.00% 0.00% bzip2 crafty gzip mcf parser twolf ammp art equake facerec fma3d galgel lucas sixtrack swim -5.00% -10.00%

Observations Architecture Cloning delivers similar performance compare to the binary optimized for one platform in most benchmarks across all 3 platforms crafty and parser under investigation Some benchmarks benefit tremendously with architecture-specific instructions and scheduling Ex. facerec, fma3d, lucas

Conclusions Architecture Cloning: Takes advantage of the latest PowerPC processor features Also maintains compatibility with older PowerPC processors All within a single code base and single executable

Questions?