An Assembler for the MSSP Distiller Eric Zimmerman University of Illinois, Urbana Champaign

Similar documents
Chapter 2A Instructions: Language of the Computer

Master/Slave Speculative Parallelization

55:132/22C:160, HPCA Spring 2011

A TASK OPTIMIZATION FRAMEWORK FOR MSSP RAHUL ULHAS JOSHI

Instruction Set Principles and Examples. Appendix B

Execution-based Prediction Using Speculative Slices

COSC 6385 Computer Architecture. Instruction Set Architectures

Chapter 2: Instructions How we talk to the computer

Chapter 4. MARIE: An Introduction to a Simple Computer. Chapter 4 Objectives. 4.1 Introduction. 4.2 CPU Basics

Rui Wang, Assistant professor Dept. of Information and Communication Tongji University.

Overview. EE 4504 Computer Organization. Much of the computer s architecture / organization is hidden from a HLL programmer

A Mechanism for Verifying Data Speculation

Master/Slave Speculative Parallelization with Distilled Programs

Lectures 3-4: MIPS instructions

MIPS PROJECT INSTRUCTION SET and FORMAT

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 11 Instruction Sets: Addressing Modes and Formats

CSIS1120A. 10. Instruction Set & Addressing Mode. CSIS1120A 10. Instruction Set & Addressing Mode 1

2. Define Instruction Set Architecture. What are its two main characteristics? Be precise!

Unsigned Binary Integers

Unsigned Binary Integers

COMPUTER ORGANIZATION & ARCHITECTURE

Instruction Set Architecture (ISA)

Page 1. Structure of von Nuemann machine. Instruction Set - the type of Instructions

Lecture 4: Instruction Set Architecture

Introduction to Computer Engineering. CS/ECE 252, Fall 2016 Prof. Guri Sohi Computer Sciences Department University of Wisconsin Madison

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

EC-801 Advanced Computer Architecture

Introduction to Computer. Chapter 5 The LC-3. Instruction Set Architecture

CS4617 Computer Architecture

COMPSCI 313 S Computer Organization. 7 MIPS Instruction Set

I-Format Instructions (3/4) Define fields of the following number of bits each: = 32 bits

COMPUTER ORGANIZATION AND DESIGN

Instructions: MIPS ISA. Chapter 2 Instructions: Language of the Computer 1

Instruction-set Design Issues: what is the ML instruction format(s) ML instruction Opcode Dest. Operand Source Operand 1...

CPU Structure and Function

The Software Stack: From Assembly Language to Machine Code

ELEC / Computer Architecture and Design Fall 2013 Instruction Set Architecture (Chapter 2)

Register Allocation. CS 502 Lecture 14 11/25/08

Computer Science and Engineering 331. Midterm Examination #1. Fall Name: Solutions S.S.#:

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Instruction-set Design Issues: what is the ML instruction format(s) ML instruction Opcode Dest. Operand Source Operand 1...

Chapter 2. Instructions: Language of the Computer. Adapted by Paulo Lopes

Introduction to the MIPS. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Chapter 3 : Control Unit

DC57 COMPUTER ORGANIZATION JUNE 2013

MIPS R-format Instructions. Representing Instructions. Hexadecimal. R-format Example. MIPS I-format Example. MIPS I-format Instructions

Machine Language Instructions Introduction. Instructions Words of a language understood by machine. Instruction set Vocabulary of the machine

BEAMJIT, a Maze of Twisty Little Traces

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015

Chapter 12. CPU Structure and Function. Yonsei University

Computer Organization II CMSC 3833 Lecture 33

Instruction Set Architecture. "Speaking with the computer"

CSCI 402: Computer Architectures

Understanding the Backward Slices of Performance Degrading Instructions

Gechstudentszone.wordpress.com

Lecture 2: The Instruction Set Architecture

ECE 486/586. Computer Architecture. Lecture # 8

Agenda. CSE P 501 Compilers. Java Implementation Overview. JVM Architecture. JVM Runtime Data Areas (1) JVM Data Types. CSE P 501 Su04 T-1

sarm User Guide Note that a space must appear before the operation field because any word beginning in the first column is a label

CS/COE1541: Introduction to Computer Architecture

Chapter 5. A Closer Look at Instruction Set Architectures. Chapter 5 Objectives. 5.1 Introduction. 5.2 Instruction Formats

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts

Computer Systems Architecture I. CSE 560M Lecture 3 Prof. Patrick Crowley

Reconfigurable Computing Systems ( L) Fall 2012 Tiny Register Machine (TRM)

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

AN OVERVIEW OF THE IMPACT x86 BINARY REOPTIMIZATION FRAMEWORK MATTHEW C. MERTEN AND MICHAEL S. THIEMS. ECE 498 Independent Study Report Spring, 1998

Chapter 5. A Closer Look at Instruction Set Architectures

Lecture 4: MIPS Instruction Set

Lecture Topics. Branch Condition Options. Branch Conditions ECE 486/586. Computer Architecture. Lecture # 8. Instruction Set Principles.

Computer Architecture

ENGN1640: Design of Computing Systems Topic 03: Instruction Set Architecture Design

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery

2 MARKS Q&A 1 KNREDDY UNIT-I

ISA and RISCV. CASS 2018 Lavanya Ramapantulu

Programmable Machines

Chapter 2 Instruction Set Architecture

Programmable Machines

Instruction Set Architecture

3.0 Instruction Set. 3.1 Overview

Intermediate representation

ENGN1640: Design of Computing Systems Topic 03: Instruction Set Architecture Design

Control Flow Integrity for COTS Binaries Report

EC 413 Computer Organization

2/6/2018. Let s Act Like Compilers! ECE 220: Computer Systems & Programming. Decompose Finding Absolute Value

PROGRAM ORIENTEERING NAVEEN NEELAKANTAM

CHAPTER 5 A Closer Look at Instruction Set Architectures

Computer Organization and Technology Processor and System Structures

Module 5 - CPU Design

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

ECE232: Hardware Organization and Design. Computer Organization - Previously covered

Review Questions. 1 The DRAM problem [5 points] Suggest a solution. 2 Big versus Little Endian Addressing [5 points]

Lecture 4: Instruction Set Design/Pipelining

UNIT-II. Part-2: CENTRAL PROCESSING UNIT

COSC 6385 Computer Architecture - Instruction Set Principles

2 TEST: A Tracer for Extracting Speculative Threads

From CISC to RISC. CISC Creates the Anti CISC Revolution. RISC "Philosophy" CISC Limitations

ShortCut: Architectural Support for Fast Object Access in Scripting Languages

Computer Architecture

Programming Level A.R. Hurson Department of Computer Science Missouri University of Science & Technology Rolla, Missouri

Transcription:

An Assembler for the MSSP Distiller Eric Zimmerman University of Illinois, Urbana Champaign Abstract It is important to have a means of manually testing a potential optimization before laboring to fully implement it in the MSSP distiller. To achieve this, we have written a tool to generate disassembled output from a distilled program. This allows the compiler architect to test potential enhancements by making changes at the assembly code level. The tool then reassembles the modified code into the distiller s intermediate representation. Finally, the rebuilt instruction stream is run on the MSSP simulator where the optimized performance impact can be gauged. Introduction The MSSP distiller is an optimizing compiler that generates approximate code through profileguided transformations. Approximate code is run on a high performance master processor in an MSSP configuration [Zilles 2002]. Because the distiller operates on a program binary without access to its source code, a program s behavior must be determined empirically from the execution profile before good optimizations can be made. While some optimizations like efficient code layout have obvious performance benefits, the potential benefit of other enhancements is not as clear. After the distilled code is generated, it is difficult to modify. Instruction opcodes, register references, and branch offsets are fixed and encoded in the 32 bit instruction words used by the MSSP simulation infrastructure. We have written a tool to simplify the process of modifying distilled code by allowing changes to be made at the assembly level. This will allow quicker prototyping and evaluation of optimization strategies in the distiller. Our MSSP simulation infrastructure implements the HP Alpha Instruction Set Architecture (ISA). The Alpha is a 64 bit RISC architecture designed for high speed implementations. The instructions are very simple, with all memory operations in the form of loads and stores, and all data manipulation done between registers [Witek, et al. 1998]. In many ways this tool is like a typical assembler targeting an Alpha processor although there are some key distinctions. An important property of MSSP programs is they exist in two versions: original and distilled. Various mappings are required to compare distilled execution on the master processor to original execution on the slaves; special care is required to preserve the mapping between these two versions. Assembly Format The assembly format is similar to that of traditional Alpha assembly code. Several annotations are added to the code to preserve the internal attributes of the distilled program. For example, to represent the mapping between distilled program addresses and original program addresses, each distilled block has the corresponding program counter (PC) of the original program as part of its header label. It is also important to preserve information about which basic blocks serve as speculative entries, function entries, path stubs, task boundaries, and verification checkpoints.

This information is written in the header label tag and the hexadecimal flag field of the block header. A sample function listing is shown in Figure 1. Figure 1. Example of assembly format func_entry_0x120012ca0: (flags: 0x0) entry_block_0x120012ca0: (flags: 0x3) fork_0x120012ca0: (flags: 0x40) C fork 0,verify_0x120012ca0 spec_entry_0x120012ca0: (flags: 0x808810) ldq r27,16(r10) lda r4,4108(r4) cmplt r4,r0,r17 bne r17,fork_0x120012ca0 D block_0x120012cdc: (flags: 0x26) jsr r26,(r27) Callee: func_entry_0x120030e00 Callee: func_entry_0x120031260 Callee: func_entry_0x1200327b0 block_0x120012cf4: (flags: 0x400000) master_end verify_0x120012ca0: (flags: 0x80) verify A B A. Block properties are represented in a 32-bit hexadecimal flag field. Block header names serve as a more readable indicator of the block type (e.g. entry_block). Also, the address of this block in the original program is included in the label. B. The target of the fork instruction is the verify block below. Control flow references are resolved during the assembly and code layout steps. C. Empty blocks are skipped during code layout, so their block flags are "pushed" to the next non-empty block. In this instance, the block appears to hold speculative entry code, when in fact the spec_entry block is an empty predecessor that has forwarded its block flags to a normal block. The user can add instructions to a speculative entry block by creating a block outside the control flow path. Flags must set for speculative entry and the block must be terminated with an unconditional branch to the successor. D. Indirect jumps and calls are annotated with a list of potential targets in order to restore control flow edges in the IR. Most instructions are disassembled using the md_print_insn function from the SimpleScalar supporting library [Burger and Austin 1997]. Branch target offsets are replaced with their corresponding block label, leaving the assembler to perform the work of recalculating instruction offsets.

A special set of control instructions is the group of indirect jumps and calls. Since the target offset is based on the contents of a register, the jump destination is not clear from the assembly itself. Instead the distiller performs an analysis to generate a list of possible jump destinations. During disassembly, we iterate through this list, placing each target block label on a line below the indirect jump or call instruction. Reading in Assembly Once the desired changes are made to the disassembled code, the assembler must read in the file and reconstruct the distiller s internal representation (IR) of the modified trace. This procedure is outlined in Figure 2. Figure 2. Reconstructing IR from assembly file. For each function in assembly file: For each basic block in function: Build a new block with correct flags Add block to parent function Encode instructions for block according to machine definition If needed, add a fall-through edge to succeeding block If last instruction has a control flow target, record target label to be resolved later Match control flow references with target blocks, adding CFG edges For each function read in: Perform code layout on internal representation The Alpha instructions are defined and implemented in the MSSP simulator s machine definition files. In order to parse the disassembled instructions, we used this information to record each instruction name, opcode, assembly format string, and input/output register set in a structure. When reading in the assembly file, the scanner searches this structure for the current instruction name. After locating the name, the assembler can find the format of the current instruction to properly encode the input and output register and immediate parameters. The last important role of the assembler is to order the basic blocks into the correct control flow arrangement. Because the assembly code has each branch and function call offset replaced with a string referencing its target, the assembler must provide a way to resolve each target label reference to its block object in the internal representation. This is done by storing for each basic block a <string, basic block> pair that associates the target block s text label with its C++ class object. These pairs are stored in an STL map indexed by the target label string. Each block that terminates with a control instruction also encodes a <string, basic block> pair associating the source block object with the text label of the target block. The source block pairs are stored in an STL multimap since there can be multiple source edges to a given target. Branch edges are resolved by iterating through the map of target blocks and adding

an edge for each source block in the source multimap that references that target. Call edges are inserted in a similar manner. Most basic blocks need a fall through control flow edge to the succeeding block. Exceptions to this include jump, tail call, master end, and verify instructions. Unconditional branches are treated uniquely, since they are inserted only when the code layout routine cannot place the fall through block immediately after its predecessor. Upon encountering an unconditional branch instruction, the assembler does not insert the instruction in the basic block, but rather adds a control flow edge to its target. Inserting the branch instruction is left to the code layout phase. After basic blocks have been created, instructions encoded, and control flow set, the code layout phase can write the new distilled trace to memory. Finally, the MSSP simulator executes the trace and reports the performance statistics. To demonstrate the integrity of the reconstructed trace, we tested the code and were able to achieve equivalent results when tasks were not modified. Experimental Results To illustrate a situation where the assembler is useful, we identified a candidate function sub_penal from the SPEC CPU2000 twolf benchmark (Figure 3). This function has a loop that takes the same inner branch on all but the first and last iteration. Because of the inner branch s unique structure, we were able to "peel" the initial branch above the loop and the final branch below the loop by modifying the disassembled code of the distilled trace. To stay consistent with the original code, we replicated the fork instruction at the head of each peeled iteration. We took advantage of this peeled loop structure by inserting a check in the spec_entry block to branch to the current iteration s code. This avoids needlessly repeating cycles of the loop when a task misspeculation occurs on the final branch, for example. Another performance bottleneck we identified was that every iteration applied a pair of absolute value operations. For each absolute value, the compiler inserted conditional branches to determine the sign of the operand. Our profile data show these branches suffer frequent mispredictions. We modified the disassembled loop to replace the absolute value computations with conditional move instructions. Finally, we assembled these changes and tested them in the simulator. In our test run of one hundred million instructions, sub_penal accounted for only 4.3% of the dynamic function calls; however, the performance impact of these changes was a 1.4% speedup overall. When used with good profile data, the assembler can be used to quickly evaluate the effectiveness of a performance hypothesis. This will be useful as the distiller is extended to handle more aggressive optimizations for approximate code.

Figure 3. C code listing of function sub_penal from SPEC CPU2000 twolf benchmark. sub_penal( startx, endx, block, LoBin, HiBin ) int startx, endx, block, LoBin, HiBin ; { BINPTR bptr ; int bin ; if( LoBin == HiBin ) { bptr = binptr[block][lobin] ; bptr->nupenalty -= endx - startx ; else { for( bin = LoBin ; bin <= HiBin ; bin++ ) { bptr = binptr[block][bin] ; if( bin == LoBin ) { bptr->nupenalty -= bptr->right - startx ; else if( bin == HiBin ) { bptr->nupenalty -= endx - bptr->left ; else { bptr->nupenalty -= binwidth ; References Burger, D. C. and T. M. Austin. "The SimpleScalar Tool Set, Version 2.0". Technical Report CS TR 97 1342, University of Wisconsin Madison, June 1997. Systems Performance Evaluation Corporation. SPEC benchmarks. http://www.spec.org. Witek, Richard T. and Alpha Architecture Committee. Alpha Architecture Reference Manual. 3rd edition. Maynard: Digital Press, 1998. Zilles, Craig. "Master/slave speculative parallelization and approximate code," Ph.D. dissertation, Computer Sciences Department, University of Wisconsin Madison, Aug. 2002.