The ILOC Virtual Machine (Lab 1 Background Material) Comp 412

Similar documents
Instruction Selection: Preliminaries. Comp 412

9/9/12. New- School Machine Structures (It s a bit more complicated!) CS 61C: Great Ideas in Computer Architecture IntroducMon to Machine Language

Superscalar Architectures: Part 2

The Software Stack: From Assembly Language to Machine Code

Lexical Analysis, V Implemen'ng Scanners. Comp 412 COMP 412 FALL Sec0on 2.5 in EaC2e. target code. source code Front End OpMmizer Back End

Code Shape Comp 412 COMP 412 FALL Chapters 4, 5, 6 & 7 in EaC2e. source code. IR IR target. code. Front End Optimizer Back End

CS415 Compilers. Instruction Scheduling and Lexical Analysis

Parsing II Top-down parsing. Comp 412

Intermediate Representations

Local Register Allocation (critical content for Lab 2) Comp 412

CS 406/534 Compiler Construction Instruction Scheduling

Processor Architecture

The Processor Memory Hierarchy

The ILOC Simulator User Documentation

Implementing Control Flow Constructs Comp 412

Software Tools for Lab 1

Intermediate Representations

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

The ILOC Simulator User Documentation

Lab 3, Tutorial 1 Comp 412

The ILOC Simulator User Documentation

CS415 Compilers Register Allocation. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Parsing II Top-down parsing. Comp 412

Local Optimization: Value Numbering The Desert Island Optimization. Comp 412 COMP 412 FALL Chapter 8 in EaC2e. target code

Register Allocation. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

The So'ware Stack: From Assembly Language to Machine Code

ECSE 425 Lecture 25: Mul1- threading

Just-In-Time Compilers & Runtime Optimizers

CPE300: Digital System Architecture and Design

Instruction Scheduling Beyond Basic Blocks Extended Basic Blocks, Superblock Cloning, & Traces, with a quick introduction to Dominators.

Runtime Support for Algol-Like Languages Comp 412

Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit

Basic Computer Architecture

Multiple Instruction Issue. Superscalars

ECSE 425 Lecture 1: Course Introduc5on Bre9 H. Meyer

Pipeline Architecture RISC

Procedure and Function Calls, Part II. Comp 412 COMP 412 FALL Chapter 6 in EaC2e. target code. source code Front End Optimizer Back End

Processors. Young W. Lim. May 12, 2016

COMPUTER ORGANIZATION AND DESI

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processors Ch 14

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Syntax Analysis, V Bottom-up Parsing & The Magic of Handles Comp 412

Show Me the $... Performance And Caches

Chapter 9. Pipelining Design Techniques

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

CS Computer Architecture

CS415 Compilers. Intermediate Represeation & Code Generation

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

: Compiler Design

COMP 412, Fall 2018 Lab 1: A Front End for ILOC

Mo Money, No Problems: Caches #2...

Naming in OOLs and Storage Layout Comp 412

Instruc=on Set Architecture

The Compiler s Front End (viewed from a lab 1 persepc2ve) Comp 412 COMP 412 FALL Chapter 1 & 2 in EaC2e. target code

Runtime Support for OOLs Object Records, Code Vectors, Inheritance Comp 412

Module 4c: Pipelining

The Processor: Instruction-Level Parallelism

Dynamic Control Hazard Avoidance

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Instruction Selection: Peephole Matching. Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Lecture-13 (ROB and Multi-threading) CS422-Spring

Multithreading: Exploiting Thread-Level Parallelism within a Processor

CS 136: Advanced Architecture. Review of Caches

Chapter 4. MARIE: An Introduction to a Simple Computer. Chapter 4 Objectives. 4.1 Introduction. 4.2 CPU Basics

Lecture Topics ECE 341. Lecture # 8. Functional Units. Information Processed by a Computer

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

Superscalar Processors

Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Computing Inside The Parser Syntax-Directed Translation, II. Comp 412

Micro-programmed Control Ch 15

Instruction Selection, II Tree-pattern matching

Machine Instructions vs. Micro-instructions. Micro-programmed Control Ch 15. Machine Instructions vs. Micro-instructions (2) Hardwired Control (4)

Micro-programmed Control Ch 15

Parallel Processing SIMD, Vector and GPU s cont.

Code Shape II Expressions & Assignment

UNIT V: CENTRAL PROCESSING UNIT

CS Computer Architecture

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Instruction-Level Parallelism and Its Exploitation

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

Getting CPI under 1: Outline

CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation

Computing Inside The Parser Syntax-Directed Translation, II. Comp 412 COMP 412 FALL Chapter 4 in EaC2e. source code. IR IR target.

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

ECE 154B Spring Project 4. Dual-Issue Superscalar MIPS Processor. Project Checkoff: Friday, June 1 nd, Report Due: Monday, June 4 th, 2018

Instruction Selection and Scheduling

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

Von Neumann architecture. The first computers used a single fixed program (like a numeric calculator).

Instructors: Randy H. Katz David A. PaGerson hgp://inst.eecs.berkeley.edu/~cs61c/fa10. 10/4/10 Fall Lecture #16. Agenda

CS415 Compilers. Lexical Analysis

Compiler Architecture

New Advances in Micro-Processors and computer architectures

Generating Code for Assignment Statements back to work. Comp 412 COMP 412 FALL Chapters 4, 6 & 7 in EaC2e. source code. IR IR target.

Transcription:

COMP 12 FALL 20 The ILOC Virtual Machine (Lab 1 Background Material) Comp 12 source code IR Front End OpMmizer Back End IR target code Copyright 20, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 12 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educamonal insmtumons may use these materials for nonprofit educamonal purposes, provided this copyright nomce is preserved.

What is the execu:on model for an ILOC program? ILOC is the assembly language of a simple, idealized RISC processor ILOC Virtual Machine Separate code memory and data memory SomeMmes called a Harvard architecture Sizes of data memory & register set are configurable Code memory is large enough to hold your program Simple, in-order execumon model ILOC Set ArithmeMc operamons work on values held in registers Load & store move values between registers and memory ILOC Processor To debug the output of your labs, you will use an ILOC simulator, a program that mimics the operamon of the ILOC virtual machine that is, it is an interpreter for ILOC Code COMP 12, Fall 20 RISC Reduced InstrucMon Set Processor 1

The ILOC Subset See also the Lab 1 handout and Appendix A in EaC2e Pay acen:on to the meanings of the ILOC opera:ons Syntax Meaning Latency load r 1 => r 2 r 2 MEM(r 1 ) store r 1 => r 2 MEM(r 2 ) r1 loadi c => r 2 r 2 c 1 add r 1, r 2 => r r r 1 + r 2 1 sub r 1, r 2 => r r r 1 - r 2 1 mult r 1, r 2 => r r r1 * r 2 1 lshib r 1, r 2 => r r r 1 << r 2 1 rshib r 1, r 2 => r r r 1 >> r 2 1 output c prints MEM(x) to stdout 1 nop idles for one cycle 1 ILOC is an abstract assembly language. Each operamon, except nop, use (or read) one or more values. Each operamon, except output and nop, defines a value. loadi reads its value from the instrucmon stream. load reads both a register and a memory locamon. store reads two registers and writes a memory locamon. add, sub, mult, lshie, and rshie read two registers and write one register. COMP 12, Fall 20 2

ILOC ExecuMon A Simple ILOC Program % cat ex1.iloc // add two numbers add r0,r1 => r2 % 12alloc ex1.iloc >ex1a.iloc % sim ex1a.iloc -i 0 1 28 The i opmon inimalizes memory, starmng at locamon 0, with the values 1 and 18. Executed 7 instructions and 7 operations in 11 cycles. COMP 12, Fall 20 RISC Reduced InstrucMon Set Processor

ILOC ExecuMon A Simple ILOC Program % cat ex1.iloc // add two numbers add r0,r1 => r2 ex1.iloc 12alloc ex1a.iloc sim results on stdout % 12alloc ex1.iloc >ex1a.iloc % sim ex1a.iloc -i 0 1 28 Executed 7 instructions and 7 operations in 11 cycles. COMP 12, Fall 20 RISC Reduced InstrucMon Set Processor

Before Execu:on of the ILOC Program Starts 1 0 1 2 Processor Invoked with the command line: % sim i 0 1 <ex1.iloc Code is loaded into instrucmon memory starmng at word 0. COMP 12, Fall 20

The virtual machine runs through the code, in order The basic unit of execumon is a cycle A cycle consists of a fetch phase and an execute phase ExecuMon looks like (fetch, execute) (fetch, execute) Fetch retrieves the next operamon from code memory Advances sequenmally through the straight-line code Execute performs the specified operamon Performs one step on each acmve operamon MulM-cycle operamons (e.g., load and store in lab 1) are divided into mulmple steps ExecuMon (on the processor s funcmonal unit) uses a pipeline of operamon steps Load and store proceed through three stages or steps in the pipeline The illustrated example should make this more clear COMP 12, Fall 20 6

Cycle 0: Fetch Phase 1 0 1 2 Processor First, the processor fetches and decodes the operamon at the current value of the program counter. COMP 12, Fall 20 7

Cycle 0: Execute Phase 1 0 1 1 2 Processor Next, it executes the operamon. In this case, that places the value 1 into register r0. COMP Trace 12, output: Fall 20 0: [ (1)] 8

Cycle 1: Fetch Phase 1 0 1 1 2 Processor COMP 12, Fall 20 It advances the PC and the pipeline. (Since loadi is a 1-cycle operamon, it discards that operamon.) It fetches the next operamon. 9

Cycle 1: Execute Phase 1 0 1 1 0 2 Processor Next, it executes the loadi, which places 0 in r1. COMP Trace 12, output: Fall 20 1: [ (0)] 10

Cycle 2: Fetch Phase 1 0 1 1 0 2 Processor COMP 12, Fall 20 It advances the PC and the pipeline. (Since loadi is a 1-cycle operamon, it discards that operamon.) It fetches the next operamon. 11

Cycle 2: Execute Phase 1 0 1 1 0 2 Processor The load begins operamon. COMP Trace 12, output: Fall 202: [load r1 (addr: 0) => r1 (1) 12

pipelined func:onal unit Cycle : Fetch Phase 1 0 1 1 0 2 COMP 12, Fall 20 The processor advances the PC and the pipeline. The load moves to slot 2 and the add fills slot 1. 1

Cycle : Execute Phase 1 0 1 1 0 2 The load conmnues to execute. The add needs the result of the load, so the processor stalls it. COMP Trace 12, output: Fall 20: [ stall ] stall means to hold the op for another cycle 1

Cycle : Fetch Phase 1 0 1 1 0 2 The processor advances the pipeline. Since the add is stalled, it remains in the first pipeline slot. COMP 12, Fall 20 1

Cycle : Execute Phase 1 0 1 1 1 2 The load completes and the value 1 is wrilen into r1. The add conmnues to stall, waimng on r1. COMP Trace 12, output: Fall 20: [ stall ] *2 16

Cycle : Fetch Phase 1 0 1 1 1 2 The processor advances the pipeline. The load rolls out of the bolom. The add remains in slot 1. COMP 12, Fall 20

Cycle : Execute Phase 1 0 1 1 1 2 28 The add executes and writes the value 28 into r2. COMP Trace 12, output: Fall 20: [add r0 (1), r1 (1) => r2 (28)] 18

Cycle 6: Fetch Phase 1 0 1 1 1 2 28 The processor advances the pipeline and fetches the next operamon. COMP 12, Fall 20 19

Cycle 6: Execute Phase 1 0 0 1 1 2 28 The processor executes the loadi operamon, which writes 0 into r0. COMP Trace 12, output: Fall 206: [ (0)] 20

Cycle 7: Fetch Phase 1 0 0 1 1 2 28 The processor advances the pipeline and fetches the next operamon. COMP 12, Fall 20 21

Cycle 7: Execute Phase 1 0 0 1 1 2 28 The processor begins execumon of the -cycle store operamon. COMP Trace 12, output: Fall 207: [store r2 (28) => r0 (addr: 0)] 22

Cycle 8: Fetch Phase 1 0 0 1 1 2 28 The processor advances the pipeline (moving the store to slot 2) and fetches the next operamon COMP 12, Fall 20 2

Cycle 8: Execute Phase 1 0 0 1 1 2 28 The store conmnues to execute. The output stalls, since it reads from data memory and the in-progress store writes to data memory. COMP Trace 12, output: Fall 208: [ stall ] 2

Cycle 9: Fetch Phase 1 0 0 1 1 2 28 COMP 12, Fall 20 The processor advances the pipeline. The store moves to slot. The stalled output operamon remains in slot 1, waimng for the store to finish. 2

Cycle 9: Execute Phase 28 0 0 1 1 2 28 The store writes 28 into memory locamon 0 at the end of the cycle. The output remains stalled. COMP Trace 12, output: Fall 209: [ stall ] *7 26

Cycle 10: Fetch Phase 28 0 0 1 1 2 28 The processor advances the pipeline. The store falls out of the bolom of the pipeline. The output stays in slot 1. COMP 12, Fall 20 27

Cycle 10: Execute Phase 28 0 0 1 1 2 28 The output operamons writes the contents of memory locamon 0 to stdout. Trace output: 10: [ (28)] COMP 12, Fall 20 output generates => 28 28

Cycle 11: Fetch Phase 28 0 0 1 1 2 28 COMP 12, Fall 20 The processor advances the pipeline and fetches the next operamon. Since the next slot in the instrucmon memory is invalid, the processor halts. 29

ILOC ExecuMon This execumon is captured in the trace provided by the simulator % cat ex1.iloc // add two numbers add r0,r1 => r2 % Compare the simulator s trace output against the preceding slides. % sim -t ex1.iloc i 0 1 ILOC Simulator, Version 12-201-1 Interlock settings memory registers branches 0: [ (1)] 1: [ (0)] 2: [load r1 (addr: 0) => r1 (1)] : [ stall ] : [ stall ] *2 : [add r0 (1), r1 (1) => r2 (28)] 6: [ (0)] 7: [store r2 (28) => r0 (addr: 0)] 8: [ stall ] 9: [ stall ] *7 10: [ (28)] output generates => 28 Executed 7 instructions and 7 operations in 11 cycles. COMP 12, Fall 20 0

The Model in the ILOC Virtual Machine 0 1 2 big enough Code memory ILOC Processor In 0 to 2,767 are reserved for storage from the input program Its variables, arrays, objects Programmer needs space 2,768 and beyond is reserved for the allocator to use for spilled value 0 1 2 2,768 big memory COMP 12, Fall 20 1

Does Real Hardware Work This Way? In fact, the ILOC model is fairly close to reality Real processors have a fetch, decode, execute cycle Fetch brings operamons into a buffer in the decode unit Decode deciphers the bits and sends control signals to the funcmonal unit(s) Execute clocks the funcmonal unit through one pipeline cycle Fetch, decode, execute is construed as a single cycle In reality, the units run concurrently Fetch unit works to deliver enough operamons to the fetch unit enough is defined, roughly, as one op per funcmonal unit per cycle Decode unit is, essenmally, combinatorial logic (&, therefore, fast) Execute unit performs complex operamons MulMply and divide are algorithmically complex operamons Pipeline units break long operamons into smaller subtasks COMP 12, Fall 20 2

More RealisMc Drawing Separate Fetch-Decode-Execute 1 0 1 1 2 Fetch Unit Control Lines Decode Unit COMP 12, Fall 20

What about processors like core i7 or ARM? 0 1 2 Control Lines s Decode Unit Fetch Unit 1 Modern processors typically have unified instrucmon and data memory. Operate on a fetch-decodeexecute cycle Complex, cachebased memory hierarchies MulMple pipelined funcmonal units MulMple cores One processing core Modified COMP 12, Fall Harvard 20 Architecture : separate pathways for code and data, but one store

What about processors like core i7 or ARM? Modern processors oeen have mul:ple func:onal units? For Lab 1, the ILOC simulator has one funcmonal unit In Lab, the simulator will have two funcmonal units Some operamons run on unit 0, some run on unit 1, some run on either unit 0 or unit 1 The basic model is the same Fetch then execute Number of operamons executed in a single cycle depends on the order in which they are encountered and the dependences between operamons Func:onal Unit 0 Register set Func:onal Unit 1 The Lab documentamon addresses these issues for ILOC The Lab simulator trace shows acmon in both units COMP 12, Fall 20

What about processors like core i7 or ARM? What happens to the execu:on model with mul:ple func:onal units? One operamon executes on each funcmonal unit The complicamon arises in the processor s fetch and decode units Fetch unit must be retrieve several operamons Fetch & decode must collaborate to decide where they execute Fixed, posimon-based scheme leads to VLIW system Dynamic scheme leads to superscalar systems More complex decode unit costs more transistors and more power Func:onal Unit 0 Register set Func:onal Unit 1 Processors with mulmple funcmonal units need code with mulmple independent (unrelated) operamons in each cycle Instruc;on Level Parallelism (or ILP ) See Lab in COMP 12 VLIW is Very Long Instruc;on Word computer COMP 12, Fall 20 6

What about processors like core i7 or ARM? When the number of func:onal units gets large At some point, the network to connect register sets to funcmonal units gets too deep Transmission Mme through the mulmplexor can come to dominate processor cycle Mme More funcmonal units would slow down the processor s fundamental clock speed Architects have adopted parmmoned register designs that have mulmple register sets with limited bandwidth between the register sets Adds a new problem to code generamon: the placement of operands Need to place each operamon on a funcmonal unit that can access the data Or, need to insert code to transfer the data (& ensure that a register is available for it in the new register set) Func:onal Unit 0 Func:onal Unit 1 Register sets, Func:onal Unit 0 Func:onal Unit 1 And the fetch and decode units get even more complex.. COMP 12, Fall 20 7

What s Next Aber MulMple FuncMonal Units? As processor complexity grows, the yield on performance for a given expenditure of chip real estate (or power) shrinks A core with eight funcmonal units might be bigger than four cores with two funcmonal units The interconnects between fetch, decode, register sets, (caches,) and funcmonal units become even more complex At some point, it is easier to put more cores on a chip than bigger cores Stamp out more simpler cores rather than fewer complex cores Easier design problem Lower power consumpmon Beler ramo of performance to chip area (and power) A great idea, if the programmer, language, and compiler can find Enough thread-level parallelism to keep all the cores busy Enough instrucmon-level parallelism (within each thread) to keep the funcmonal units busy COMP 12, Fall 20 8

What About MulMple Cores? 0 1 2 0 1 2 Func:onal Units Decode Unit Func:onal Units Decode Unit Fetch Core 0 Fetch Core 1 1 Modern mul:core processors have 2 to many (6, 12, 2) cores Require lots of parallelism for best performance Major limitamon is memory bandwidth 1/ (# cores)? Bandwidth may impose some pracmcal limits on the use of all those cores COMP 12, Fall 20 9

What s Next Aber MulMple FuncMonal Units? What happens to the execu:on model in a mul:core processor? ExecuMon within a thread follows the single core model Fetch, decode, & execute with (possibly) mulmple funcmonal units Single threads have simple behavior Individual threads operate independently Language (& processor) usually provide synchronizamon between threads Need synchronizamon to share data and communicate control See COMP 22 and COMP 1 COMP 12, Fall 20 0