CS577 Modern Language Processors. Spring 2018 Lecture Interpreters

Similar documents
Compiling Techniques

Advanced Computer Architecture

Virtual Machines and Dynamic Translation: Implementing ISAs in Software

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters

CSc 453 Interpreters & Interpretation

Project. there are a couple of 3 person teams. a new drop with new type checking is coming. regroup or see me or forever hold your peace

Lecture Outline. Code Generation. Lecture 30. Example of a Stack Machine Program. Stack Machines

Code Generation. Lecture 30

Review Questions. 1 The DRAM problem [5 points] Suggest a solution. 2 Big versus Little Endian Addressing [5 points]

2.7 Supporting Procedures in hardware. Why procedures or functions? Procedure calls

point in worrying about performance. The goal of our work is to show that this is not true. This paper is organised as follows. In section 2 we introd

administrivia final hour exam next Wednesday covers assembly language like hw and worksheets

What is a Compiler? Compiler Construction SMD163. Why Translation is Needed: Know your Target: Lecture 8: Introduction to code generation

CS 252 Graduate Computer Architecture. Lecture 15: Virtual Machines

Final Lecture. A few minutes to wrap up and add some perspective

Mixed Mode Execution with Context Threading

Compiler Construction D7011E

Compiler-guaranteed Safety in Code-copying Virtual Machines

Chapter 7 The Potential of Special-Purpose Hardware

Code Generation Introduction

Computer Science and Engineering 331. Midterm Examination #1. Fall Name: Solutions S.S.#:

CS 152 Computer Architecture and Engineering. Lecture 22: Virtual Machines

Code Generation. Lecture 31 (courtesy R. Bodik) CS164 Lecture14 Fall2004 1

IMPLEMENTING PARSERS AND STATE MACHINES IN JAVA. Terence Parr University of San Francisco Java VM Language Summit 2009

ECE 486/586. Computer Architecture. Lecture # 7

Practical Malware Analysis

Branch Addressing. Jump Addressing. Target Addressing Example. The University of Adelaide, School of Computer Science 28 September 2015

Systems Architecture I

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

CSc 520 Principles of Programming Languages

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

Chapter 4. The Processor

Lecture 2: The Art of Emulation

Dynamic Control Hazard Avoidance

Emulation. Michael Jantz

CSc 453. Compilers and Systems Software. 18 : Interpreters. Department of Computer Science University of Arizona. Kinds of Interpreters

Static, multiple-issue (superscaler) pipelines

378: Machine Organization and Assembly Language

Real instruction set architectures. Part 2: a representative sample

CS429: Computer Organization and Architecture

Compiler construction 2009

W4118: PC Hardware and x86. Junfeng Yang

CS 24: INTRODUCTION TO. Spring 2018 Lecture 3 COMPUTING SYSTEMS

Spectre and Meltdown. Clifford Wolf q/talk

For our next chapter, we will discuss the emulation process which is an integral part of virtual machines.

Lecture 4: Instruction Set Architecture

administrivia today start assembly probably won t finish all these slides Assignment 4 due tomorrow any questions?

Computer Architecture. Chapter 2-2. Instructions: Language of the Computer

CSc 553. Principles of Compilation. 2 : Interpreters. Department of Computer Science University of Arizona

Instruction Set Architecture

Instruction Set Architecture

Lecture 13: Branch Prediction

YETI: a gradually Extensible Trace Interpreter

Java and C II. CSE 351 Spring Instructor: Ruth Anderson

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

CS Computer Architecture

Crusoe Reference. What is Binary Translation. What is so hard about it? Thinking Outside the Box The Transmeta Crusoe Processor

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture

C 1. Last time. CSE 490/590 Computer Architecture. Virtual Machines I. Types of Virtual Machine (VM) Outline. User Virtual Machine = ISA + Environment

Chapter 4. The Processor

CSCE 5610: Computer Architecture

Instruction Set Architecture. "Speaking with the computer"

CS252 Spring 2017 Graduate Computer Architecture. Lecture 18: Virtual Machines

Oak Intermediate Bytecodes

Topics. Block Diagram of Mic-1 Microarchitecture. Review Sheet. Microarchitecture IJVM ISA

Chapter 2. lw $s1,100($s2) $s1 = Memory[$s2+100] sw $s1,100($s2) Memory[$s2+100] = $s1

Homework. In-line Assembly Code Machine Language Program Efficiency Tricks Reading PAL, pp 3-6, Practice Exam 1

Outline. Unresolved references

Computer System Architecture Final Examination

Instruction Set Architecture

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2018 Lecture 6

Instruction Set Architecture part 1 (Introduction) Mehran Rezaei

Chapter 2. Instruction Set. RISC vs. CISC Instruction set. The University of Adelaide, School of Computer Science 18 September 2017

Review: Procedure Call and Return

Complex Instruction Sets

x86 assembly CS449 Spring 2016

Static Program Analysis

Pipelining, Branch Prediction, Trends

Chapter 2. Instructions: Language of the Computer. Adapted by Paulo Lopes

Lecture 2. Instructions: Language of the Computer (Chapter 2 of the textbook)

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

Chapter 4. The Processor Designing the datapath

Lecture 12. Motivation. Designing for Low Power: Approaches. Architectures for Low Power: Transmeta s Crusoe Processor

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.

Assembly Language Programming Introduction

Multiple Instruction Issue. Superscalars

White Paper SOFTWARE TECHNIQUES FOR MANAGING SPECULATION ON AMD PROCESSORS

Unit 8: Superscalar Pipelines

Instruction Set Architecture

An Introduction to Multicodes. Ben Stephenson Department of Computer Science University of Western Ontario

CISC 360 Instruction Set Architecture

CS412/CS413. Introduction to Compilers Tim Teitelbaum. Lecture 21: Generating Pentium Code 10 March 08

Plan for Today. Safe Programming Languages. What is a secure programming language?

CS252 Graduate Computer Architecture Midterm 1 Solutions

Tizen/Artik IoT Lecture Chapter 3. JerryScript Parser & VM

Question 4.2 2: (Solution, p 5) Suppose that the HYMN CPU begins with the following in memory. addr data (translation) LOAD 11110

EN164: Design of Computing Systems Topic 03: Instruction Set Architecture Design

Dispatch techniques and closure representations

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA MIPS ISA. In a CPU. (vonneumann) Processor Organization

COMPUTER ORGANIZATION AND DESIGN

Transcription:

CS577 Modern Language Processors Spring 2018 Lecture Interpreters 1

MAKING INTERPRETERS EFFICIENT VM programs have an explicitly specified binary representation, typically called bytecode. Most VM s can execute bytecode directly by interpretation. Interpretation is typically 1-2 orders of magnitude slower than compilation (but of course this depends on interpreter, compiler, target machine) So serious VM s usually do JIT compilation too Still, it is worthwhile to make interpreters efficient But it is also desirable to keep them portable PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 2

TARGET HARDWARE: FACTS OF LIFE Accessing memory is slow! Even a L1 cache hit typically costs several cycles. Cache misses cost 10 s-100 s of cycles. Must try to keep data in registers if possible. Small changes in data, code layout can have big effects Machines are deeply pipelined! Conditional branches are bad. Dynamic-target branches are probably worse. Should try to utilize hardware support tricks (e.g. branch target buffers). Recent machines (last 5 years) may have significantly improved branch prediction. PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 3

Interpreting an instruction requires: COSTS OF INTERPRETATION Dispatching the instruction: getting control to the code corresponding to the instruction Accessing the operands: getting the values of the parameters and arguments (and storing the result) Actually performing the computation. (Note: the longer this takes, the smaller the percentage overhead of interpretation!) PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 4

NAIVE INTERPRETER: SAMPLE INSTRUCTIONS u4 stack[stacksize]; u4* sp; void interp (Method *method) { u1 *pc = method->code; while (1) { switch (*pc) { case ICONST_0: { *(++sp) = (u4) 0; pc++; break; } case ISTORE_0: { locals[0] = *(sp--); pc++; break; } case IADD: { int32 v2 = (int32) (*(sp--)); int32 v1 = (int32) (*sp); *sp = (u4) (v1 + v2); pc++; break; }... }}} PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 5

SPEEDING UP OPERAND ACCESS First, let s consider just the cost of accessing stack elements: loads/stores to memory and sp adjustment. C code: case ICONST_0: { *(++sp) = (u4) 0; pc++; break; } X86 (32-bit) machine code (obtained using gcc -S) // initially -64(%ebp) = pc movl _sp, %eax // load value of sp from C global leal 4(%eax), %ebx // new sp movl %ebx, _sp // save new sp value into C global movl $0, 4(%eax) // *(new sp) = 0 incl -64(%ebp) // pc++ jmp top // break PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 6

RISCV VERSION RISCV-32 machine code (obtained using gcc -S) // initially s11 has %hi(sp) s9 has pc lw a5,%lo(sp)(s11) // load value of sp from C global addi s9,s9,1 // pc++ addi a2,a5,4 // new sp sw a2,%lo(sp)(s11) // store new sp into C global sw zero,4(a5) // *(new sp) = 0 j top RISCV features: load/store architecture about 30 registers PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 7

UTILIZING REGISTERS Keeping sp in a global memory location looks like a terrible idea, since it requires one load and one store per bytecode executed. Let s make it a local of interp instead. Then C compiler should be able to keep it in a register (if there are any available). New RISCV machine code: // initially s9 = pc, s11 = sp sw zero,4(s11) // *(new sp) = 0 addi s9,s9,1 // pc++ addi s11,s11,4 // new sp = sp + 4 j top // break PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 8

New X86-32 machine code: X86-32 VERSION // initially -28(%ebp) = sp -64(%ebp) = pc addl $4, -28(%ebp) // ++sp movl -28(%ebp), %ecx // load value of sp movl $0, (%ec) // *sp = 0; incl -64(%ebp) // pc++ jmp top Now sp lives in the local stack frame rather than in a global, but there still isn t a free register for it. PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 9

MORE CODE WITH LOCAL sp C code: case IADD: { int32 v2 = (int32) (*(sp--)); int32 v1 = (int32) (*sp); *sp = (u4) (v1 + v2); pc++; break; } RISCV code: // initially s11 = sp s9 = pc lw a5,-4(s11) // ld *(sp-1) lw a2,0(s11) // ld *sp addi s11,s11,-4 // sp-- addi s9,s9,1 // pc++ add a5,a5,a2 // add sw a5,0(s11) // store to *(new sp) j top // break Next obvious problem is that nearly every instruction loads and/or stores stack entries. PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 10

STACK CACHING Idea: what if we cache the top-of-stack in a local variable s0? (Assume that sp points to the top of the remainder of the stack.) This saves one load and one store for IADD: case IADD: {int32 v2 = (int32) s0; int32 v1 = (int32) (*(sp--)); s0 = (u4) (v1+v2); pc++; break; } Approximate RISCV code: // initially s11 = sp s9 = pc x9 = slot0 lw a2,0(s11) // ld *sp addi s11,s11,-4 // sp-- addi s9,s9,1 // pc++ add x9,x9,a2 // slot0 += a2 j top // break PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 11

CACHING ONE SLOT But it is a wash for the other two instructions because we have to keep s0 up-to-date. case ICONST_0: { *(++sp) = s0; s0 = (u4) 0; pc++; break; } Approximate RISCV code (still one store) // initially s11 = sp s9 = pc x9 = slot0 sw x9,4(s11) // *(new sp) = slot0 addi s9,s9,1 // pc++ addi s11,s11,4 // new sp = sp + 4 mv x9,zero // slot0 = 0 j top // break PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 12

CACHING ONE SLOT (CONTINUED) case ISTORE_0: { locals[0] = s0; s0 = *(sp--); pc++; break; } Approximate RISCV code (still one load and one store) // initially s11 = sp s9 = pc x8 = base of locals x9 = slot0 sw x9,0(x8) // locals[0] = slot0 addi s9,s9,1 // pc++ lw x9,0(s11) // slot0 = *sp addi s11,s11,-4 // sp-- j top // break PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 13

CACHING TWO SLOTS What if we keep two elements in local variables (registers) named s1 (top of stack) and s0 (next-to-top of stack)? case ISTORE_0: { locals[0] = s1; s1 = s0; s0 = *(sp--); pc++; break;} case ICONST_0: { *(++sp) = s0; s0 = s1; s1 = (u4) 0; pc++; break; } case IADD: { int32 v2 = (int32) s1; int32 v1 = (int32) s0; s1 = (u4) (v1+v2); s0 = *(sp--); pc++; break;} This just pushes off the problem: no improvement in number of loads and stores needed. New idea: let s keep a different number of cached stack slots at different points during execution. PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 14

GENERALIZED STACK CACHING Interpreter operates in one several different states corresponding to how many stack slots are cached. Each instruction (potentially) causes transition to a different state, according to what it does to the stack. For example: ICONST_0 moves to a state where more slots are cached; ISTORE_0 moves to one where fewer slots are cached. IADD moves to a state where one slot is cached. PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 15

GENERALIZED STACK CACHING (2) For JVM, 3 states are sufficient to handle all instruction types. State 0: no slots cached. State 1: top of stack is cached in variable s0. State 2: top of stack is cached in variable s1; next-to-top in s0. In all states, sp points to remainder of stack beyond cached slots. Sample code follows (in practice we may organize it differently)... PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 16

case IADD: { switch (state) { case 0: { int32 v2 = (int32) (*(sp--)); int32 v1 = (int32) (*(sp--)); s0 = (u4) (v1+v2); state = 1; break; } case 1: { int 32 v2 = (int32) s0; int32 v1 = (int32) (*(sp--)); s0 = (u4) (v1+v2); state = 1; break; } case 2: { int 32 v2 = (int32) s1; int32 v1 = (int32) s0; s0 = (u4) (v1+v2); state = 1; break; } pc++; break; } case ICONST_0: { switch (state) { case 0: s0 = 0; state = 1; break; case 1: s1 = 0; state = 2; break; case 2: *(++sp) = s0; s0 = s1; s1 = 0; state = 2; break; } pc++; break; } case ISTORE_0: { switch (state) { case 0: locals[0] = *(sp--); state = 0; break; case 1: locals[0] = s0; state = 0; break; case 2: locals[0] = s1; state = 1; break; } pc++; break; } PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 17

EXAMPLE SEQUENCE Consider a typical expression like b = a + 3 where we assume a is local variable 0 and b is local variable 1. (Assume we start with state = 0.) Bytecode Corresponding executed code ILOAD_0 s0 = locals[0]; state = 1; ICONST_3 s1 = 3; state = 2; IADD s0 = s1 + s0; state = 1; ISTORE_1 locals[1] = s0; state = 0; We do only the essential loads and stores no stack traffic at all! PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 18

3-STATE TRANSITION DIAGRAM More generally, instructions are classified by a pair: (# of stack slots they consume, # of stack slots they produce) For example: ISTORE_0 1,0 ICONST_0 0,1 IADD 2,1 2,2 0,0 1,0 2,0 0,1 1,1 2,1 0 1 2 1,0 2,0 3,0 2,0 3,0 0,0 1,1 2,1 1,0 2,1 0,1 2,2 0,0 0,1 1,1 2,2 PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 19

DYNAMIC VS. STATIC CACHING So far we ve described dynamic stack caching, where the interpreter keeps track of its current state. In practice, we implement this by having three complete sets of instruction implementations and dispatching to the correct one based on current state as well as opcode (more on this later). But it may seem like we should be able to predict the state at each program point statically (before execution). If so, we could simply have three variants of each opcode, and select the right one at compile time. This would be more efficient. Only problem: at join points in the code, the state may differ depending on the path by which the join point was reached. Must choose a convention for which state to use there, and add compensation code to the other branches; this is complex in practice. PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 20

ASIDE: WHY USE STACK-BASED VM S? Nearly all hardware processors use registers Each HW instruction is parameterized by its argument/result registers. Why is this good for hardware? Because the opcode and the argument registers can be decoded in parallel, and values can quickly be fetched from a small, fast register file. Why not try this in software machines too? Parameters must be fetched from the byte stream and decoded serially; for stack instructions, parameters are implicit. Instructions with parameters take more space. Software registers cannot easily be stored in hardware registers, because the latter can t be indexed. So software registers end up living in an in-memory array (just like stack slots). On the other hand, register architectures require fewer instructions; hence less dispatch. So maybe a worthwhile idea after all... PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 21

SPEEDING UP INSTRUCTION DISPATCH What does RISCV code look like now? // s10 = table s9 = pc top: lbu a1,0(s9) // *pc li a5,184 bgtu a1,a5,undefined // if > 184 or < 0, branch to "undefined" slli a5,a1,2 // scale *4 add a5,s10,a5 // add base of table lw a5,0(a5) // get snippet address jr a5 // jump to snippet table:.word nop_snippet.word aconst_null_snippet.word iconst_m1_snippet.word iconst_0_snippet....word goto_w_snippet undefined:...issue error and die... PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 22

Obvious performance problems: Unnecessary bounds check. THEADED CODE Two jumps per dispatch (counting the one back to top at the end of the previous instruction). First fix: (Indirect) Threaded Code If we can code our own indirect jumps, could Remove bounds check. Replicate dispatch at end of every snippet, thus removing one jump. This is not possible in ANSI Standard C, but can do in gcc using the && operator. PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 23

INDIRECT THREADED CODE interp(method method) { static void *dispatch_table[] = {&&NOP, &&ACONST_NULL, &&ICONST_M1,..., &&JSR_W }; u1 *pc = method->code;... goto *(dispatch_table[*pc]); NOP: pc++; goto *(dispatch_table[*pc]); } ACONST_NULL: *(++sp) = (u4) 0; pc++; goto *(dispatch_table[*pc]);... PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 24

USING PROCESSOR BRANCH SUPPORT One extra reason why indirect threaded code improves performance may be that it makes better use of hardware support for branch predication. Many pipelined processors contain a branch target buffer (BTB) that dynamically remembers the last target for each branch instruction (including conditional and indirect branches). The next time the branch instruction is executed, the processor pre-fetches from the address predicted by the buffer. A naive interpreter makes terrible use of this feature, because a single instruction dispatches to all the snippets, so prediction accuracy is 0. The indirect threaded code version does somewhat better, because the dispatches are distributed, and certain bytecode instruction sequences are quite common, so prediction accuracy may be > 0. But a fundamental prediction mismatch between the VM and the target hardware remains. PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 25

DIRECT THREADING Each instruction dispatch still requires two fetches: one to get the byte code and a second to get the snippet address. New idea: what if we represent each instruction opcode by the address of its snippet? interp() { char *codeaddrs[] =...; /* fill this with snippet addrs */ char *pc = codeaddrs; /* initialize to start */ goto **pc; ACONST_NULL: *(++sp) = (u4) 0; pc++; goto **pc;... Now need only one fetch per instruction! PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 26

REWRITING BYTE CODE But notice that we re no longer interpreting the original bytecode any more. Must rewrite before execution Simple in principle, but there are details. e.g. What should we do with the parameter bytes following the opcode? If we re going to rewrite the bytecode, there are many opportunities to improve things, e.g. Combine code for similar opcodes (e.g. constant loading). Short-circuit constant pool references (important in full language) Perform static stack caching Etc, etc. A more radical rewrite idea: dispatch to each snippet using a subroutine call instruction. May pay off on processors that pre-fetch from the return address on the hardware stack! PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 27

REDUCING DISPATCHES Another way to reduce dispatch time is to do fewer dispatches. One basic approach is to combine sequences of instructions that occur frequencly into into macro or super -instructions. For example, the following sequence pattern is very common: ILOAD n ICONST i IADD ISTORE n In fact, the JVM designers already invented a combined instruction for this (IINC) but the same idea works for other sequences. Another approach is to use a register architecture, which typically requires many fewer instructions (although each instruction gets more parameters). PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 28

BUILDING COMBINED INSTRUCTIONS This can be done in several ways: Statically, for multiple programs: - Essentially a refinement of the VM definition, possibly tuned to workload from a particular set of programs. - Can construct such specialized VM s semi-automatically from a generic VM. - Specialized VM can be compiled with cross-snippet optimization. Statically, for a single program - Encoding is sent with the program. Static encodings also have the benefit of reducing the program size, allowing quicker transmission. Dynamically, by building superinstructions on the fly from snippet code. - This is beginning to resemble a compiler! PSU CS577 SP 18 LECTURE INTERPRETERS c 2006-2018 ANDREW TOLMACH 29