Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1

Similar documents
15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX

Impact of Cache Coherence Protocols on the Processing of Network Traffic

A Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Continuous Adaptive Object-Code Re-optimization Framework

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Architecture Cloning For PowerPC Processors. Edwin Chan, Raul Silvera, Roch Archambault IBM Toronto Lab Oct 17 th, 2005

Many Cores, One Thread: Dean Tullsen University of California, San Diego

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Identifying and Exploiting Memory Access Characteristics for Prefetching Linked Data Structures. Hassan Fakhri Al-Sukhni

Workloads, Scalability and QoS Considerations in CMP Platforms

Optimizing Memory Bandwidth

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure

Exploiting Streams in Instruction and Data Address Trace Compression

Software-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

Decoupled Zero-Compressed Memory

Microarchitecture Overview. Performance

Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs

Instruction Based Memory Distance Analysis and its Application to Optimization

Low-Complexity Reorder Buffer Architecture*

Porting GCC to the AMD64 architecture p.1/20

Instruction Set Architectures

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Execution-based Prediction Using Speculative Slices

Dynamic Translation for EPIC Architectures

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

Microarchitecture Overview. Performance

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale

COMPILER OPTIMIZATION ORCHESTRATION FOR PEAK PERFORMANCE

Precise Instruction Scheduling

Computer System. Performance

The V-Way Cache : Demand-Based Associativity via Global Replacement

Boost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor

Build a program in Release mode and an executable file project.exe is created in Release folder. Run it.

Analysis of Program Based on Function Block

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor

Power Estimation of UVA CS754 CMP Architecture

Accelerating and Adapting Precomputation Threads for Efficient Prefetching

A Study of the Performance Potential for Dynamic Instruction Hints Selection

DynamoRIO Examples Part I Outline

A Co-designed HW/SW Approach to General Purpose Program Acceleration Using a Programmable Functional Unit

Data Access History Cache and Associated Data Prefetching Mechanisms

Pentium 4 Processor Block Diagram

The Gap Between the Virtual Machine and the Real Machine. Charles Forgy Production Systems Tech

APPENDIX Summary of Benchmarks

Narrow Width Dynamic Scheduling

X86 Addressing Modes Chapter 3" Review: Instructions to Recognize"

Breaking Cyclic-Multithreading Parallelization with XML Parsing. Simone Campanoni, Svilen Kanev, Kevin Brownell Gu-Yeon Wei, David Brooks

The Instruction Set. Chapter 5

SimPoint 3.0: Faster and More Flexible Program Analysis

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Low Level Programming Lecture 2. International Faculty of Engineerig, Technical University of Łódź

Data Prefetch and Software Pipelining. Stanford University CS243 Winter 2006 Wei Li 1

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

x86 assembly CS449 Spring 2016

Program Phase Directed Dynamic Cache Way Reconfiguration for Power Efficiency

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

Instruction Set Architectures

Design of Experiments - Terminology

Efficient, Transparent, and Comprehensive Runtime Code Manipulation

ATOS introduction ST/Linaro Collaboration Context

Cache Optimization by Fully-Replacement Policy

CSC 8400: Computer Systems. Machine-Level Representation of Programs

EECE.3170: Microprocessor Systems Design I Summer 2017 Homework 4 Solution

Towards the Hardware"

A Power and Temperature Aware DRAM Architecture

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science

Lecture 15 Intel Manual, Vol. 1, Chapter 3. Fri, Mar 6, Hampden-Sydney College. The x86 Architecture. Robb T. Koether. Overview of the x86

Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors

Introduction to IA-32. Jo, Heeseung

INTRODUCTION TO IA-32. Jo, Heeseung

Efficient Program Compilation through Machine Learning Techniques

Dynamically Controlled Resource Allocation in SMT Processors

Complex Instruction Set Computer (CISC)

2D-Profiling: Detecting Input-Dependent Branches with a Single Input Data Set

CSC 2400: Computer Systems. Towards the Hardware: Machine-Level Representation of Programs

Efficient Architecture Support for Thread-Level Speculation

Computer Science 246. Computer Architecture

Chapter-5 Memory Hierarchy Design

Procedure Calls. Young W. Lim Sat. Young W. Lim Procedure Calls Sat 1 / 27

Computing Architectural Vulnerability Factors for Address-Based Structures

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Fast and Effective Orchestration of Compiler Optimizations for Automatic Performance Tuning

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

The von Neumann Machine

Cache Insertion Policies to Reduce Bus Traffic and Cache Conflicts

Using MMX Instructions to Perform Simple Vector Operations

x86 assembly CS449 Fall 2017

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

Lab 3: Defining Data and Symbolic Constants

Outline. Speculative Register Promotion Using Advanced Load Address Table (ALAT) Motivation. Motivation Example. Motivation

Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders

Lecture 13 - VLIW Machines and Statically Scheduled ILP

Computer Labs: Profiling and Optimization

Efficient Program Compilation through Machine Learning Techniques

Transcription:

I Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL Inserting Prefetches IA-32 Execution Layer - 1

Agenda IA-32EL Brief Overview Prefetching in Loops IA-32EL Prefetching in Loops Implementation Performance # s Related and Future Work Inserting Prefetches IA-32 Execution Layer - 2

What is IA-32 EL? A dynamic binary translator Enables the execution of IA-32 applications on Intel Itanium Processor (IPF) platforms Only user-level, 32-bit protected flat applications Runs above the IPF Operating System Entirely in user mode Invoked by the OS (entirely transparent to user, OS must be 32bit aware) 99% of the translator is OS-independent, with a small OS-specific glue-layer SW only solution NO IA-32 instructions executed on dedicated IA-32 IPF hardware Inserting Prefetches IA-32 Execution Layer - 3

What is IA-32 EL? (2) Three phase translation Interpreter, Cold, Hot Interpretation phase The first n times a basic-block is executed it is executed by the interpreter Saves cost of translation of blocks that are executed <= n times Cold phase minimal optimizations pre-prepared per IA-32 instruction IPF machine code used/taken count instrumentation Hot phase cold block whose used-count reaches a threshold trace-hyperblock formation sequence of cold blocks that flow into each other with high prob May form Loop per IA-32 instruction Intermdiate Language (IL) optimizations on IL (including prefetch), scheduling IL IPF machine code Inserting Prefetches IA-32 Execution Layer - 4

Prefetching Early transfer of data from memory into the data cache so at access time it is in the cache => know far enough ahead the data access address. Loops may provide opportunity Inserting Prefetches IA-32 Execution Layer - 5

Software Prefetching in Loops issuing a prefetch instruction in iteration i, in order that data accessed in iteration i+k will be cache resident in iteration i+k ld8 [r35] lfetch [r35+distance(access[i],access[i+k])] Determine k Determine distance Inserting Prefetches IA-32 Execution Layer - 6

Software Prefetching in Loops (2) k (#iterations ahead) Is a function of the #cycles in an iteration of the loop, and the cache-miss penalty cache_miss_penalty/iteration_cycles Assume miss goes to memory distance (from current access to prefetch access) Need to determine how the access changes over the iterations When register(s) used in the address computation are changing by a constant in each iteration Strided access => distance = k*stride Inserting Prefetches IA-32 Execution Layer - 7

Loops in IA-32EL Hyperblock recognition phase Sequence of cold (basic) blocks whose used/taken counters indicate a high probability of flow in one direction 80% 90% B A 20% C 10% Merge basic blocks into hyperblock A B D D E F 100% 90% F G Inserting Prefetches IA-32 Execution Layer - 8

Analysis for Inserting Prefetches Determine strides use only info from analysis of IA-32 instructions in the loop For IL generation, IA-32 instructions in the loop are decoded Record information about all IA-32 registers used in address calculations, in order to determine how those registers change in the loop-hyper-block. At end of IL generation use the information to insert prefetch ILs No IA-32 prefetch instructions from source binary are translated Inserting Prefetches IA-32 Execution Layer - 9

Analysis for Inserting Prefetches (2) IA-32 integer registers are classified as one of the following for each data access: 1) Invariant 2) Changed by add/sub of explicit constant, invariant register, or invariant memory 3) Loaded from apparent invariant memory both location and content 4) Changed in a manner that can not be determined Inserting Prefetches IA-32 Execution Layer - 10

Analysis for Inserting Prefetches (3) IA-32 memory access of form [base_reg, index_reg*scale] offset Candidate for constant stride prefetching base_reg is invariant, index_reg is changing by constant (explicit, register or memory) or vice-versa Inserting Prefetches IA-32 Execution Layer - 11

Prefetch Example(1) Address is being changed by explicit constant fld qword ptr [eax] // eax is += 4 addp4 r35 = r.eax, r0 // zxt32 ldfd r.fp.stack0 = [r35] add r32 = r35,(k*4) // k is the computed #iterations // ahead to prefetch for the fld lfetch.nt1 [r32] Inserting Prefetches IA-32 Execution Layer - 12

Prefetch Examples (2) Address is being changed by register fld qword ptr [eax] // eax is += edx addp4 r35 = r.eax, r0 ldfd r.fp.stack0 = [r35] shladd r32 = r.edx,z,r35 // z is the log2 of the // computed #iterations ahead to // prefetch for the fld lfetch.nt1 [r32] Inserting Prefetches IA-32 Execution Layer - 13

Prefetch Examples (3) Address is being changed by invariant memory fld qword ptr [eax] ::::::: add eax, [esp+0xc] // [esp+0xc] is most likely // loop invaraint addp4 r35 = r.eax, r0 ldfd r.fp.stack0 = [r35] ::: add r34 = r.esp + 0xc ld4 r33 = [r34] add r.eax = r.eax, r33 shladd r32 = r33,z,r35 // z is the log2 of the computed // #iterations ahead to prefetch for the fld lfetch.nt1 [r32] Inserting Prefetches IA-32 Execution Layer - 14

Prefetch Examples (4) Unoptimized code Mov edx, DWORD PTR [ebp-32] Add edx, edi Mov eax, DWORD PTR [ebp-32] Add eax, DWORD PTR [ebp-96] Fld st(0) Mov esi, DWORD PTR [ebp+020h] Fmul QWORD PTR [esi+eax*8-8] Mov eax, DWORD PTR [ebp+034h] Fadd QWORD PTR [eax+edx*8-8] Fstp QWORD PTR [eax+edx*8-8] Inc DWORD PTR[ebp-32] Dec ecx Jns $-35 Inserting Prefetches IA-32 Execution Layer - 15

Analysis for Inserting Prefetches (4) IA-32 memory access of form [base_reg, index_reg*scale] offset Candidate for speculative stride prefetching base_reg is loaded and that load is an explicit constant stride e.g. array of pointers Speculatively load pointer of iteration i+k Prefetch the location that pointer i+k points to Inserting Prefetches IA-32 Execution Layer - 16

Prefetch Examples (5) Speculative Prefetch mov eax, dword ptr[ebp] fld qword ptr [eax] add ebp, 4 add r32 = r.ebp,(k*4) // k is the computed #iterations ahead // to prefetch for the fld ld4.s r32 = [r32] lfetch.nt1 [r32] Inserting Prefetches IA-32 Execution Layer - 17

Analysis for Inserting Prefetches (5) IA-32 memory access of form [base_reg, index_reg*scale] offset Candidate for dynamic stride detection prefetch If (stride on iteration[i-1] == stride on iteration[i]) Prefetch using stride FP access which is not constant or speculative, and in heavy FP loop Inserting Prefetches IA-32 Execution Layer - 18

Prefetch Examples (6) Dynamic Prefetch fld qword ptr [eax] // compute current stride sub r32 = r.eax, r.prev_addr // compare current stride to previous stride cmp.eq.unc p44, p.true = r32, r.prev_stride // z is the log2 of the computed // #iterations ahead to prefetch for the fld shladd r33 = r.prev_stride, z, r.eax // record current address or r.prev_addr = 0x0, r.eax // record current stride or r.prev_stride = 0x0, r32 (p44) lfetch.nt1 [r33] Inserting Prefetches IA-32 Execution Layer - 19

Analysis for Inserting Prefetches (6) Candidates grouped can be serviced by same prefetch when the address range of the group does not exceed 0.5 of the L3 cache line size [eax] [eax]+8 Inserting Prefetches IA-32 Execution Layer - 20

Analysis for Inserting Prefetches (7) Select candidate groups to prefetch for Determine how many iterations prefetch can cover Explicit constant stride total access length over n iterations is <= 0.25 L3 cache line size Prefetch is predicated accordingly or omitted in unrolled iterations Determine how many iterations ahead to prefetch Iteration_cycles is guestimate derived from the estimates of the cycles of the cold blocks that compose the hyperblock Inserting Prefetches IA-32 Execution Layer - 21

CPU2K Performance Results ~15% improvement on FP for IA-32 Binaries with NO prefetch instructions ~10% improvement on FP for IA-32 Binaries with prefetch instructions (average_wpf in graph) Naïve translation of IA-32 prefetch instructions VS Generated prefetches only No significant affect on INT L3 Cache miss cost percentage Inserting Prefetches IA-32 Execution Layer - 22

CPU2K FP Prefetch Impact. %. %. %. %. %. %. % -. % -. % wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma spec# improvement l miss decrease sixtrack d aspi average average_wpf Inserting Prefetches IA-32 Execution Layer - 23

CPU2K INT Prefetch Impact. %. %. %. %. %. % gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip twolf average -. % spec# improvement l miss decrease Inserting Prefetches IA-32 Execution Layer - 24

Performance # s Overhead of analysis for inserting prefetches is negligible Dynamic stride prefetch significant gains on 183.equake and 189.lucas No losses on others Speculative prefetch no gains or losses needs further analysis Inserting Prefetches IA-32 Execution Layer - 25

Related Work Based on prefetch literature Dynamic and Speculative stride are new Other Approach Instrument IA32-EL hot loops to gather stride info dynamically After specified number of instrumented iterations- Regenerate the hot loop, using the discovered strides for prefetching Inserting Prefetches IA-32 Execution Layer - 26

Future Work Sample hardware performance registers to determine strides and misses (DATA_EAR events) Use the collected data Offline binary optimizers do this More complex static stride recognition Strides that result from a number of operations within the loop Prefetch for low iteration nested inner loops based on outer loop Inserting Prefetches IA-32 Execution Layer - 27

Backup Inserting Prefetches IA-32 Execution Layer - 28

CPU2K FP L3 Miss Events l miss events.wupwise.swim.mgrid.applu.mesa.galgel.art.equake.facerec.ammp.lucas.fma.sixtrack d.aspi Inserting Prefetches IA-32 Execution Layer - 29

CPU2K INT L3 Miss Events l miss events.gzip.vpr.gcc.mcf.crafty.parser.eon.perlbmk.gap.vortex.bzip.twolf Inserting Prefetches IA-32 Execution Layer - 30

Translation Framework Predication Loop unrolling IA-32 Decoding FP stack, EFLAGS, misalign Address CSE & disambiguation Value Tracking, Pre-fetching FP: FRA, fxchg, *+=fma, norm SSE format tracking Build Dependency Graph Dead-Code Removal Reorder + Speculation Register Renaming Prepare for State Reconstruction IA-32 Image Trace Selection IL Generation Scheduling and more optimization Encoding Code Interpreter Translated Code Manager Translated Code Caches Counters Basic Blocks Templates and fixups Use, edge, and misalignment counters Code Discovery and Traversal Analyze EFLAGS & FP Stack Initial (cold) Code Generation including instrumentation Translated Code Blocks Inserting Prefetches IA-32 Execution Layer - 31

IA-32EL System View BTGeneric Initialize Translator Check versions Translate Execute Reconstruct IA-32 state Allocate Memory Simulate IA-32 System Service Exception Load BTGeneric Check versions BTLib Delegate to Operating System Delegate to BTGeneric Run IA-32 Application Handle Exception OS OS OS OS Inserting Prefetches IA-32 Execution Layer - 32