The ILOC Virtual Machine (Lab 1 Background Material) Comp PDF Free Download

COMP 12 FALL 20 The ILOC Virtual Machine (Lab 1 Background Material) Comp 12 source code IR Front End OpMmizer Back End IR target code Copyright 20, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 12 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educamonal insmtumons may use these materials for nonprofit educamonal purposes, provided this copyright nomce is preserved.

What is the execu:on model for an ILOC program? ILOC is the assembly language of a simple, idealized RISC processor ILOC Virtual Machine Separate code memory and data memory SomeMmes called a Harvard architecture Sizes of data memory & register set are configurable Code memory is large enough to hold your program Simple, in-order execumon model ILOC Set ArithmeMc operamons work on values held in registers Load & store move values between registers and memory ILOC Processor To debug the output of your labs, you will use an ILOC simulator, a program that mimics the operamon of the ILOC virtual machine that is, it is an interpreter for ILOC Code COMP 12, Fall 20 RISC Reduced InstrucMon Set Processor 1

The ILOC Subset See also the Lab 1 handout and Appendix A in EaC2e Pay acen:on to the meanings of the ILOC opera:ons Syntax Meaning Latency load r 1 => r 2 r 2 MEM(r 1 ) store r 1 => r 2 MEM(r 2 ) r1 loadi c => r 2 r 2 c 1 add r 1, r 2 => r r r 1 + r 2 1 sub r 1, r 2 => r r r 1 - r 2 1 mult r 1, r 2 => r r r1 * r 2 1 lshib r 1, r 2 => r r r 1 << r 2 1 rshib r 1, r 2 => r r r 1 >> r 2 1 output c prints MEM(x) to stdout 1 nop idles for one cycle 1 ILOC is an abstract assembly language. Each operamon, except nop, use (or read) one or more values. Each operamon, except output and nop, defines a value. loadi reads its value from the instrucmon stream. load reads both a register and a memory locamon. store reads two registers and writes a memory locamon. add, sub, mult, lshie, and rshie read two registers and write one register. COMP 12, Fall 20 2

ILOC ExecuMon A Simple ILOC Program % cat ex1.iloc // add two numbers add r0,r1 => r2 % 12alloc ex1.iloc >ex1a.iloc % sim ex1a.iloc -i 0 1 28 The i opmon inimalizes memory, starmng at locamon 0, with the values 1 and 18. Executed 7 instructions and 7 operations in 11 cycles. COMP 12, Fall 20 RISC Reduced InstrucMon Set Processor

ILOC ExecuMon A Simple ILOC Program % cat ex1.iloc // add two numbers add r0,r1 => r2 ex1.iloc 12alloc ex1a.iloc sim results on stdout % 12alloc ex1.iloc >ex1a.iloc % sim ex1a.iloc -i 0 1 28 Executed 7 instructions and 7 operations in 11 cycles. COMP 12, Fall 20 RISC Reduced InstrucMon Set Processor

Before Execu:on of the ILOC Program Starts 1 0 1 2 Processor Invoked with the command line: % sim i 0 1 <ex1.iloc Code is loaded into instrucmon memory starmng at word 0. COMP 12, Fall 20

The virtual machine runs through the code, in order The basic unit of execumon is a cycle A cycle consists of a fetch phase and an execute phase ExecuMon looks like (fetch, execute) (fetch, execute) Fetch retrieves the next operamon from code memory Advances sequenmally through the straight-line code Execute performs the specified operamon Performs one step on each acmve operamon MulM-cycle operamons (e.g., load and store in lab 1) are divided into mulmple steps ExecuMon (on the processor s funcmonal unit) uses a pipeline of operamon steps Load and store proceed through three stages or steps in the pipeline The illustrated example should make this more clear COMP 12, Fall 20 6

Cycle 0: Fetch Phase 1 0 1 2 Processor First, the processor fetches and decodes the operamon at the current value of the program counter. COMP 12, Fall 20 7

Cycle 0: Execute Phase 1 0 1 1 2 Processor Next, it executes the operamon. In this case, that places the value 1 into register r0. COMP Trace 12, output: Fall 20 0: [ (1)] 8

Cycle 1: Fetch Phase 1 0 1 1 2 Processor COMP 12, Fall 20 It advances the PC and the pipeline. (Since loadi is a 1-cycle operamon, it discards that operamon.) It fetches the next operamon. 9

Cycle 1: Execute Phase 1 0 1 1 0 2 Processor Next, it executes the loadi, which places 0 in r1. COMP Trace 12, output: Fall 20 1: [ (0)] 10

Cycle 2: Fetch Phase 1 0 1 1 0 2 Processor COMP 12, Fall 20 It advances the PC and the pipeline. (Since loadi is a 1-cycle operamon, it discards that operamon.) It fetches the next operamon. 11

Cycle 2: Execute Phase 1 0 1 1 0 2 Processor The load begins operamon. COMP Trace 12, output: Fall 202: [load r1 (addr: 0) => r1 (1) 12

pipelined func:onal unit Cycle : Fetch Phase 1 0 1 1 0 2 COMP 12, Fall 20 The processor advances the PC and the pipeline. The load moves to slot 2 and the add fills slot 1. 1

Cycle : Execute Phase 1 0 1 1 0 2 The load conmnues to execute. The add needs the result of the load, so the processor stalls it. COMP Trace 12, output: Fall 20: [ stall ] stall means to hold the op for another cycle 1

Cycle : Fetch Phase 1 0 1 1 0 2 The processor advances the pipeline. Since the add is stalled, it remains in the first pipeline slot. COMP 12, Fall 20 1

Cycle : Execute Phase 1 0 1 1 1 2 The load completes and the value 1 is wrilen into r1. The add conmnues to stall, waimng on r1. COMP Trace 12, output: Fall 20: [ stall ] *2 16

Cycle : Fetch Phase 1 0 1 1 1 2 The processor advances the pipeline. The load rolls out of the bolom. The add remains in slot 1. COMP 12, Fall 20

Cycle : Execute Phase 1 0 1 1 1 2 28 The add executes and writes the value 28 into r2. COMP Trace 12, output: Fall 20: [add r0 (1), r1 (1) => r2 (28)] 18

Cycle 6: Fetch Phase 1 0 1 1 1 2 28 The processor advances the pipeline and fetches the next operamon. COMP 12, Fall 20 19

Cycle 6: Execute Phase 1 0 0 1 1 2 28 The processor executes the loadi operamon, which writes 0 into r0. COMP Trace 12, output: Fall 206: [ (0)] 20

Cycle 7: Fetch Phase 1 0 0 1 1 2 28 The processor advances the pipeline and fetches the next operamon. COMP 12, Fall 20 21

Cycle 7: Execute Phase 1 0 0 1 1 2 28 The processor begins execumon of the -cycle store operamon. COMP Trace 12, output: Fall 207: [store r2 (28) => r0 (addr: 0)] 22

Cycle 8: Fetch Phase 1 0 0 1 1 2 28 The processor advances the pipeline (moving the store to slot 2) and fetches the next operamon COMP 12, Fall 20 2

Cycle 8: Execute Phase 1 0 0 1 1 2 28 The store conmnues to execute. The output stalls, since it reads from data memory and the in-progress store writes to data memory. COMP Trace 12, output: Fall 208: [ stall ] 2

Cycle 9: Fetch Phase 1 0 0 1 1 2 28 COMP 12, Fall 20 The processor advances the pipeline. The store moves to slot. The stalled output operamon remains in slot 1, waimng for the store to finish. 2

Cycle 9: Execute Phase 28 0 0 1 1 2 28 The store writes 28 into memory locamon 0 at the end of the cycle. The output remains stalled. COMP Trace 12, output: Fall 209: [ stall ] *7 26

Cycle 10: Fetch Phase 28 0 0 1 1 2 28 The processor advances the pipeline. The store falls out of the bolom of the pipeline. The output stays in slot 1. COMP 12, Fall 20 27

Cycle 10: Execute Phase 28 0 0 1 1 2 28 The output operamons writes the contents of memory locamon 0 to stdout. Trace output: 10: [ (28)] COMP 12, Fall 20 output generates => 28 28

Cycle 11: Fetch Phase 28 0 0 1 1 2 28 COMP 12, Fall 20 The processor advances the pipeline and fetches the next operamon. Since the next slot in the instrucmon memory is invalid, the processor halts. 29

ILOC ExecuMon This execumon is captured in the trace provided by the simulator % cat ex1.iloc // add two numbers add r0,r1 => r2 % Compare the simulator s trace output against the preceding slides. % sim -t ex1.iloc i 0 1 ILOC Simulator, Version 12-201-1 Interlock settings memory registers branches 0: [ (1)] 1: [ (0)] 2: [load r1 (addr: 0) => r1 (1)] : [ stall ] : [ stall ] *2 : [add r0 (1), r1 (1) => r2 (28)] 6: [ (0)] 7: [store r2 (28) => r0 (addr: 0)] 8: [ stall ] 9: [ stall ] *7 10: [ (28)] output generates => 28 Executed 7 instructions and 7 operations in 11 cycles. COMP 12, Fall 20 0

The Model in the ILOC Virtual Machine 0 1 2 big enough Code memory ILOC Processor In 0 to 2,767 are reserved for storage from the input program Its variables, arrays, objects Programmer needs space 2,768 and beyond is reserved for the allocator to use for spilled value 0 1 2 2,768 big memory COMP 12, Fall 20 1

Does Real Hardware Work This Way? In fact, the ILOC model is fairly close to reality Real processors have a fetch, decode, execute cycle Fetch brings operamons into a buffer in the decode unit Decode deciphers the bits and sends control signals to the funcmonal unit(s) Execute clocks the funcmonal unit through one pipeline cycle Fetch, decode, execute is construed as a single cycle In reality, the units run concurrently Fetch unit works to deliver enough operamons to the fetch unit enough is defined, roughly, as one op per funcmonal unit per cycle Decode unit is, essenmally, combinatorial logic (&, therefore, fast) Execute unit performs complex operamons MulMply and divide are algorithmically complex operamons Pipeline units break long operamons into smaller subtasks COMP 12, Fall 20 2

More RealisMc Drawing Separate Fetch-Decode-Execute 1 0 1 1 2 Fetch Unit Control Lines Decode Unit COMP 12, Fall 20

What about processors like core i7 or ARM? 0 1 2 Control Lines s Decode Unit Fetch Unit 1 Modern processors typically have unified instrucmon and data memory. Operate on a fetch-decodeexecute cycle Complex, cachebased memory hierarchies MulMple pipelined funcmonal units MulMple cores One processing core Modified COMP 12, Fall Harvard 20 Architecture : separate pathways for code and data, but one store

What about processors like core i7 or ARM? Modern processors oeen have mul:ple func:onal units? For Lab 1, the ILOC simulator has one funcmonal unit In Lab, the simulator will have two funcmonal units Some operamons run on unit 0, some run on unit 1, some run on either unit 0 or unit 1 The basic model is the same Fetch then execute Number of operamons executed in a single cycle depends on the order in which they are encountered and the dependences between operamons Func:onal Unit 0 Register set Func:onal Unit 1 The Lab documentamon addresses these issues for ILOC The Lab simulator trace shows acmon in both units COMP 12, Fall 20

What about processors like core i7 or ARM? What happens to the execu:on model with mul:ple func:onal units? One operamon executes on each funcmonal unit The complicamon arises in the processor s fetch and decode units Fetch unit must be retrieve several operamons Fetch & decode must collaborate to decide where they execute Fixed, posimon-based scheme leads to VLIW system Dynamic scheme leads to superscalar systems More complex decode unit costs more transistors and more power Func:onal Unit 0 Register set Func:onal Unit 1 Processors with mulmple funcmonal units need code with mulmple independent (unrelated) operamons in each cycle Instruc;on Level Parallelism (or ILP ) See Lab in COMP 12 VLIW is Very Long Instruc;on Word computer COMP 12, Fall 20 6

What about processors like core i7 or ARM? When the number of func:onal units gets large At some point, the network to connect register sets to funcmonal units gets too deep Transmission Mme through the mulmplexor can come to dominate processor cycle Mme More funcmonal units would slow down the processor s fundamental clock speed Architects have adopted parmmoned register designs that have mulmple register sets with limited bandwidth between the register sets Adds a new problem to code generamon: the placement of operands Need to place each operamon on a funcmonal unit that can access the data Or, need to insert code to transfer the data (& ensure that a register is available for it in the new register set) Func:onal Unit 0 Func:onal Unit 1 Register sets, Func:onal Unit 0 Func:onal Unit 1 And the fetch and decode units get even more complex.. COMP 12, Fall 20 7

What s Next Aber MulMple FuncMonal Units? As processor complexity grows, the yield on performance for a given expenditure of chip real estate (or power) shrinks A core with eight funcmonal units might be bigger than four cores with two funcmonal units The interconnects between fetch, decode, register sets, (caches,) and funcmonal units become even more complex At some point, it is easier to put more cores on a chip than bigger cores Stamp out more simpler cores rather than fewer complex cores Easier design problem Lower power consumpmon Beler ramo of performance to chip area (and power) A great idea, if the programmer, language, and compiler can find Enough thread-level parallelism to keep all the cores busy Enough instrucmon-level parallelism (within each thread) to keep the funcmonal units busy COMP 12, Fall 20 8

What About MulMple Cores? 0 1 2 0 1 2 Func:onal Units Decode Unit Func:onal Units Decode Unit Fetch Core 0 Fetch Core 1 1 Modern mul:core processors have 2 to many (6, 12, 2) cores Require lots of parallelism for best performance Major limitamon is memory bandwidth 1/ (# cores)? Bandwidth may impose some pracmcal limits on the use of all those cores COMP 12, Fall 20 9

What s Next Aber MulMple FuncMonal Units? What happens to the execu:on model in a mul:core processor? ExecuMon within a thread follows the single core model Fetch, decode, & execute with (possibly) mulmple funcmonal units Single threads have simple behavior Individual threads operate independently Language (& processor) usually provide synchronizamon between threads Need synchronizamon to share data and communicate control See COMP 22 and COMP 1 COMP 12, Fall 20 0

The ILOC Virtual Machine (Lab 1 Background Material) Comp 412