Where we are. Instruction selection. Abstract Assembly. CS 4120 Introduction to Compilers

Similar documents
Instruction Selection. Problems. DAG Tiling. Pentium ISA. Example Tiling CS412/CS413. Introduction to Compilers Tim Teitelbaum

CS412/CS413. Introduction to Compilers Tim Teitelbaum. Lecture 21: Generating Pentium Code 10 March 08

Administration. Where we are. Canonical form. Canonical form. One SEQ node. CS 412 Introduction to Compilers

Where we are. What makes a good IR? Intermediate Code. CS 4120 Introduction to Compilers

Compilers and Code Optimization EDOARDO FUSELLA

Code Generation. Lecture 30

CSE P 501 Compilers. x86 Lite for Compiler Writers Hal Perkins Autumn /25/ Hal Perkins & UW CSE J-1

CS356 Unit 12a. Logic Circuits. Combinational Logic Gates BASIC HW. Processor Hardware Organization Pipelining

Lecture Outline. Code Generation. Lecture 30. Example of a Stack Machine Program. Stack Machines

We ve written these as a grammar, but the grammar also stands for an abstract syntax tree representation of the IR.

Variables vs. Registers/Memory. Simple Approach. Register Allocation. Interference Graph. Register Allocation Algorithm CS412/CS413

Review Questions. 1 The DRAM problem [5 points] Suggest a solution. 2 Big versus Little Endian Addressing [5 points]

CSE351 Spring 2018, Midterm Exam April 27, 2018

Compiler Construction D7011E

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2018 Lecture 4

CS 24: INTRODUCTION TO. Spring 2018 Lecture 3 COMPUTING SYSTEMS

Process Layout and Function Calls

How Software Executes

RISC I from Berkeley. 44k Transistors 1Mhz 77mm^2

CSC 2400: Computer Systems. Towards the Hardware: Machine-Level Representation of Programs

X86 Addressing Modes Chapter 3" Review: Instructions to Recognize"

CS Bootcamp x86-64 Autumn 2015

What is a Compiler? Compiler Construction SMD163. Why Translation is Needed: Know your Target: Lecture 8: Introduction to code generation

Intel x86-64 and Y86-64 Instruction Set Architecture

Compiler construction. x86 architecture. This lecture. Lecture 6: Code generation for x86. x86: assembly for a real machine.

12.1. CS356 Unit 12. Processor Hardware Organization Pipelining

Agenda. CSE P 501 Compilers. Big Picture. Compiler Organization. Intermediate Representations. IR for Code Generation. CSE P 501 Au05 N-1

Program Exploitation Intro

CSC 8400: Computer Systems. Machine-Level Representation of Programs

Interfacing Compiler and Hardware. Computer Systems Architecture. Processor Types And Instruction Sets. What Instructions Should A Processor Offer?

Module 3 Instruction Set Architecture (ISA)

SOEN228, Winter Revision 1.2 Date: October 25,

EECE.3170: Microprocessor Systems Design I Summer 2017 Homework 4 Solution

How Software Executes

Code Generation. Lecture 31 (courtesy R. Bodik) CS164 Lecture14 Fall2004 1

W4118: PC Hardware and x86. Junfeng Yang

Millions of instructions per second [MIPS] executed by a single chip microprocessor

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering

Second Part of the Course

Administration CS 412/413. Why build a compiler? Compilers. Architectural independence. Source-to-source translator

CSCI 2121 Computer Organization and Assembly Language PRACTICE QUESTION BANK

Lecture 4: MIPS Instruction Set

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 5

Branching and Looping

Reverse Engineering Low Level Software. CS5375 Software Reverse Engineering Dr. Jaime C. Acosta

Where We Are. Optimizations. Assembly code. generation. Lexical, Syntax, and Semantic Analysis IR Generation. Low-level IR code.

Practical Malware Analysis

Lecture 2 Assembly Language

How Software Executes

Instruction Set Architecture

CS241 Computer Organization Spring 2015 IA

Instruction-set Design Issues: what is the ML instruction format(s) ML instruction Opcode Dest. Operand Source Operand 1...

Instruction Set Architecture

Lecture 10 Return-oriented programming. Stephen Checkoway University of Illinois at Chicago Based on slides by Bailey, Brumley, and Miller

CS152 Computer Architecture and Engineering January 29, 2013 ISAs, Microprogramming and Pipelining Assigned January 29 Problem Set #1 Due February 14

CISC 360 Instruction Set Architecture

16.317: Microprocessor Systems Design I Fall 2014

Instructions: Language of the Computer

CS 31: Intro to Systems ISAs and Assembly. Kevin Webb Swarthmore College February 9, 2016

Instruction Set Architecture

Computer Systems C S Cynthia Lee

Sample Exam I PAC II ANSWERS

Chapter 2. lw $s1,100($s2) $s1 = Memory[$s2+100] sw $s1,100($s2) Memory[$s2+100] = $s1

x86 assembly CS449 Fall 2017

CS 31: Intro to Systems ISAs and Assembly. Kevin Webb Swarthmore College September 25, 2018

Low-Level Essentials for Understanding Security Problems Aurélien Francillon

x86: assembly for a real machine Compiler construction 2012 x86 assembler, a first example Example explained Lecture 7

Instruction Set Architecture

Lecture 4: Instruction Set Design/Pipelining

2.7 Supporting Procedures in hardware. Why procedures or functions? Procedure calls

Winter Compiler Construction T11 Activation records + Introduction to x86 assembly. Today. Tips for PA4. Today:

mith College Computer Science CSC231 Assembly Week #9 Spring 2017 Dominique Thiébaut

Lab 3. The Art of Assembly Language (II)

THEORY OF COMPILATION

G Compiler Construction Lecture 12: Code Generation I. Mohamed Zahran (aka Z)

Inside VMProtect. Introduction. Internal. Analysis. VM Logic. Inside VMProtect. Conclusion. Samuel Chevet. 16 January 2015.

Instruction-set Design Issues: what is the ML instruction format(s) ML instruction Opcode Dest. Operand Source Operand 1...

x86 assembly CS449 Spring 2016

CS429: Computer Organization and Architecture

Function Call Convention

6.035 Project 3: Unoptimized Code Generation. Jason Ansel MIT - CSAIL

Slides for Lecture 6

CS165 Computer Security. Understanding low-level program execution Oct 1 st, 2015

CSE351 Autumn 2014 Midterm Exam (29 October 2014)

CSE 351 Midterm - Winter 2015 Solutions

Areas for growth: I love feedback

CSE 351 Midterm Exam Spring 2016 May 2, 2015

Lecture 3 CIS 341: COMPILERS

Understand the factors involved in instruction set

CSE 504: Compiler Design. Code Generation

Real instruction set architectures. Part 2: a representative sample

Outline. What Makes a Good ISA? Programmability. Implementability

Digital Forensics Lecture 3 - Reverse Engineering

15-213/18-243, Spring 2011 Exam 1

Assembly II: Control Flow. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Machine Language CS 3330 Samira Khan

Lecture 3: Instruction Set Architecture

Systems Architecture I

Credits and Disclaimers

Towards the Hardware"

Transcription:

Where we are CS 420 Introduction to Compilers Andrew Myers Cornell University Lecture 8: Instruction Selection 5 Oct 20 Intermediate code Canonical intermediate code Abstract assembly code Assembly code syntax-directed translation reordering with traces tiling dynamic programming register allocation CS 420 Introduction to Compilers 2 Abstract Assembly Abstract assembly = assembly code w/ infinite register set Canonical intermediate code = abstract assembly code except for expression trees (e, e 2 )! mov e, e2 JUMP(e)! jmp e (e,l)! cmp e, e2 [jne je jgt ] l CALL(e, e, )! push e; ; call e LABEL(l )! l: CS 420 Introduction to Compilers 3 Instruction selection Conversion to abstract assembly is problem of instruction selection for a single IR statement node Full abstract assembly code: glue translated instructions from each of the statements Problem: more than one way to translate a given statement. How to choose? CS 420 Introduction to Compilers 4

Example (TEMP(), TEMP() (TEMP(FP)4)) TEMP() TEMP() TEMP(fp) 4? mov %rbp, add $4, mov (),t3 add t3, add 4(%rbp), CS 420 Introduction to Compilers 5 x86-64 ISA Need to map IR tree to actual machine instructions need to know how instructions work A two-address CISC architecture (inherited from 4004, 8008, 8086 ) Typical instruction has opcode (mov, add, sub, shl, shr, mul, div, jmp, jcc, push, pop, test, enter, leave, &c.) destination (r,n,(r),k(r),(r,r2), (r,r2,w),k(r,r2,w) ) (may also be an operand) source (any legal destination, or a constant $k) opcode src src/dest mov $,%rax add %rcx,%rbz sub %rbp,%esi add %edi, (%rcx,%edi,6) je label jmp 4(%rbp) CS 420 Introduction to Compilers 6 Intel syntax: AT&T vs Intel opcode dest, src Registers rax, rbx, rcx,...r8,r9,...r5 constants k memory operands [n], [rk], [rw*r2],... AT&T syntax (gnu assembler default): opcode src, dest %rax, %rbx, constants $k memory operands n, k(r), (r,r2,w),... Tiling Idea: each Pentium instruction performs computation for a piece of the IR tree: a tile TEMP() TEMP() t3 TEMP(fp) 4 mov %rbp, add $4, mov (),t3 add t3, Tiles connected by new temporary registers (, t3) that hold result of tile CS 420 Introduction to Compilers 7 CS 420 Introduction to Compilers 8

Some tiles TEMP() e 2 mov t 2, t 2 t 2 CONST(i) t f mov,t f add t 2,t f mov $i,(,t 2 ) (t f a fresh temporary) Problem How to pick tiles that cover IR statement tree with minimum execution time? Need a good selection of tiles small tiles to make sure we can tile every tree large tiles for efficiency Usually want to pick large tiles: fewer instructions instructions cycles: RISC core instructions take cycle, other instructions may take more add %rax,4(%rcx) " mov 4(%rcx),%rdx add %rdx,%rax mov %rax,4(%rcx) CS 420 Introduction to Compilers 9 CS 420 Introduction to Compilers 0 An annoying instruction Pentium mul instruction multiples single operand by eax, puts result in eax (low 32 bits), edx (high 32 bits) Solution: add extra mov instructions, let register allocation deal with edx overwrite MUL t 2 mov eax, mul mov t f, eax Branches How to tile a conditional jump? Fold comparison operator into tile EQ l (l2) l (l2) test jnz l cmp, je l t 2 CS 420 Introduction to Compilers CS 420 Introduction to Compilers 2

More handy tiles lea instruction computes a memory address but doesn t actually load from memory t 2 MUL CONST(k ) lea (,t 2 ),t f lea (,t 2,k ),t f t 2 (t f a fresh temporary) (k one of 2,4,8) CS 420 Introduction to Compilers 3 Greedy tiling Assume larger tiles = better Greedy algorithm: start from top of tree and use largest tile that matches tree Tile remaining subtrees recursively 4 FP 8 MUL 4 FP 2 CS 420 Introduction to Compilers 4 How good is it? Very rough approximation on modern pipelined architectures: execution time is number of tiles Greedy tiling (Appel: maximal munch ) finds an optimal but not necessarily optimum tiling: cannot combine two tiles into a lower-cost tile We can find the optimum tiling using dynamic programming! CS 420 Introduction to Compilers 5 Instruction Selection Current step: converting canonical intermediate code into abstract assembly implement each IR statement with a sequence of one or more assembly instructions sub-trees of IR statement are broken into tiles associated with one or more assembly instructions CS 420 Introduction to Compilers 6

Tiles Another example mov, add, imm8 Tiles capture compiler s understanding of instruction set Each tile: sequence of instructions that update a fresh temporary (may need extra mov s) and associated IR tree All outgoing edges are temporaries x = x ; CS 420 Introduction to Compilers 7 CS 420 Introduction to Compilers 8 x = x ; Example ebp: Pentium frame pointer register Alternate (non-risc) tiling x = x ; mov, [ebpx] mov, add, mov [ebpx], add [ebpx], CONST(k) CS 420 Introduction to Compilers 9 CS 420 Introduction to Compilers 20

expression tiles mov, add, mov, add, imm32 t3 CONST(i) statement tiles Intel Architecture Manual, Vol 2, 3-7: add eax, imm32 add, imm32 add, imm8 add, r32 add r32, r32 CONST(k) CS 420 Introduction to Compilers 2 CS 420 Introduction to Compilers 22 Designing tiles Only add tiles that are useful to compiler Many instructions will be too hard to use effectively or will offer no advantage Need tiles for all single-node trees to guarantee that every tree can be tiled, e.g. More handy tiles lea instruction computes a memory address but doesn t actually load from memory t f lea t f, [ t 2 ] (t f a fresh temporary) mov, add, t3 t3 t 2 t f t 2 * lea t f, [ k *t 2 ] CONST(k ) (k one of 2,4,8,6) CS 420 Introduction to Compilers 23 CS 420 Introduction to Compilers 24

Matching for RISC As defined in lecture, have (cond, destination) Appel: (op, e, e2, destination) where op is one of ==,!=, <, <=, =>, > Our translates easily to RISC ISAs that have explicit comparison result < t3 NAME(n) cmplt, t3, br MIPS, n CS 420 Introduction to Compilers 25 Condition code ISA Appel s corresponds more directly to Pentium conditional jumps < NAME(n) cmp, jl n set condition codes test condition codes However, can handle Pentium-style jumps with lecture IR with appropriate tiles CS 420 Introduction to Compilers 26 Branches How to tile a conditional jump? Fold comparison operator into tile EQ t 2 l (l2) l (l2) test jnz l cmp, t 2 je l CS 420 Introduction to Compilers 27 Fixed-register instructions mul Sets eax to low 32 bits of eax * operand, edx to high 32 bits jecxz label Jump to label if ecx is zero add eax, Add to eax No fixed registers in IR except TEMP(FP)! CS 420 Introduction to Compilers 28

Strategies for fixed regs Use extra mov s and temporaries mov eax, mul t3 mov, eax Don t use instruction (jecxz) Let assembler figure out when to use (add eax, ), bias register allocator * t3 CS 420 Introduction to Compilers 29 Implementation Maximal Munch: start from statement node Find largest tile covering top node and matching all children Invoke recursively on all children of tile Generate code for this tile (code for children will have been generated already in recursive calls) How to find matching tiles? CS 420 Introduction to Compilers 30 Implementing tiles Explicitly building every tile: tedious Easier to write subroutines for matching Pentium source, destination operands Reuse matcher for all opcodes dest source source TEMP(t) source = CONST(i) CONST(i) CS 420 Introduction to Compilers 3 Matching tiles abstract class IR_Stmt { Assembly munch(); class IR_Move extends IR_Stmt { IR_Expr src, dst; Assembly munch() { if (src instanceof IR_Plus && ((IR_Plus)src).lhs.equals(dst) && is_regmem32(dst) { Assembly e = (IR_Plus)src).rhs.munch(); return e.append(new AddIns(dst, e.target())); else if... CS 420 Introduction to Compilers 32

Tile Specifications Improving instruction selection Previous approach simple, efficient, but hard-codes tiles and their priorities Another option: explicitly create data structures representing each tile in instruction set Tiling performed by a generic tree-matching and code generation procedure Can generate from instruction set description generic back end! For RISC instruction sets, over-engineering CS 420 Introduction to Compilers 33 Greedy tiling may not generate best code Always selects largest tile, not necessarily fastest instruction May pull nodes up into tiles when better to leave below Can do better using dynamic programming algorithm CS 420 Introduction to Compilers 34 Timing model Finding optimum tiling Idea: associate cost with each tile (proportional to # cycles to execute) caveat: cost is fictional on modern architectures Estimate of total execution time is sum of costs of all tiles Total cost: 5 2 FP CS 420 Introduction to Compilers 35 x FP x 2 Goal: find minimum total cost tiling of tree Algorithm: for every node, find minimum total cost tiling of that node and sub-tree. Lemma: once minimum cost tiling of all children of a node is known, can find minimum cost tiling of the node by trying out all possible tiles matching the node Therefore: start from leaves, work upward to top node CS 420 Introduction to Compilers 36

Dynamic programming: a[i] FP * CONST(4) CONST(a) mov, [bp a] mov, [bp i] mov t3, [ 4*] FP CONST(i) Recursive implementation Any dynamic programming algorithm equivalent to a memoized version of same algorithm that runs top-down For each node, record best tile for node Start at top, recurse: First, check in table for best tile for this node If not computed, try each matching tile to see which one has lowest cost Store lowest-cost tile in table and return Finally, use entries in table to emit code CS 420 Introduction to Compilers 37 CS 420 Introduction to Compilers 38 Greedy # Memoization class IR_Move extends IR_Stmt { IR_Expr src, dst; Assembly best; // initialized to null int opttilecost() { if (best!= null) return best.cost(); if (src instanceof IR_Plus && ((IR_Plus)src).lhs.equals(dst) && is_regmem32(dst)) { int src_cost = ((IR_Plus)src).rhs.optTileCost(); int cost = src_cost CISC COST; if (cost < best.cost()) best = new AddIns(dst, e.target); consider all other tiles return best.cost(); A small tweak to greedy algorithm! CS 420 Introduction to Compilers 39 Problems with model Modern processors: execution time not sum of tile times instruction order matters Processors is pipelining instructions and executing different pieces of instructions in parallel bad ordering (e.g. too many memory operations in sequence) stalls processor pipeline processor can execute some instructions in parallel (super-scalar) cost is merely an approximation instruction scheduling needed CS 420 Introduction to Compilers 40

Summary Can specify code generation process as a set of tiles that relate IR trees to instruction sequences Instructions using fixed registers problematic but can be handled using extra temporaries Greedy algorithm implemented simply as recursive traversal Dynamic programming algorithm generates better code, also can be implemented recursively using memoization Real optimization will require instruction scheduling CS 420 Introduction to Compilers 4