The Hardware/Software Interface CSE351 Spring 2015

Similar documents
Java and C II. CSE 351 Spring Instructor: Ruth Anderson

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Lectures Parallelism

Announcements. Homework 4 out today Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead.

Data III & Integers I

Virtual Memory I. CSE 351 Spring Instructor: Ruth Anderson

Memory, Data, & Addressing I

The Hardware/Software Interface CSE351 Spring 2015

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

CSE351: Memory, Data, & Addressing I

The Hardware/Software Interface CSE351 Spring 2015

Data III & Integers I

Data III & Integers I

L14: Structs and Alignment. Structs and Alignment. CSE 351 Spring Instructor: Ruth Anderson

Java and C I. CSE 351 Spring Instructor: Ruth Anderson

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Java and C CSE 351 Spring

Structs and Alignment CSE 351 Spring

Java and C. CSE 351 Autumn 2018

Assembly Programming IV

Structs & Alignment. CSE 351 Autumn Instructor: Justin Hsia

We made it! Java: Assembly language: OS: Machine code: Computer system:

x86-64 Programming III & The Stack

University of Washington

Building an Executable

Hardware: Logical View

University*of*Washington*

Run-time Program Management. Hwansoo Han


x86 Programming I CSE 351 Winter

The Hardware/Software Interface CSE351 Spring 2015

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Memory Allocation I. CSE 351 Autumn Instructor: Justin Hsia

Virtual Memory I. CSE 351 Winter Instructor: Mark Wyse

The Hardware/Software Interface CSE351 Spring 2013 (spring has sprung!)

Arrays. CSE 351 Autumn Instructor: Justin Hsia

SOFTWARE ARCHITECTURE 7. JAVA VIRTUAL MACHINE

Structs and Alignment

Assembly Programming IV

x86 Programming I CSE 351 Autumn 2016 Instructor: Justin Hsia

Floating Point II, x86 64 Intro

JVML Instruction Set. How to get more than 256 local variables! Method Calls. Example. Method Calls

02 B The Java Virtual Machine

x86-64 Assembly CSE 351 Winter 2018 Instructor: Mark Wyse

Cache Example, System Control Flow

Compiling Techniques

JVM. What This Topic is About. Course Overview. Recap: Interpretive Compilers. Abstract Machines. Abstract Machines. Class Files and Class File Format

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Arrays. CSE 351 Autumn Instructor: Justin Hsia

Java: framework overview and in-the-small features

Executables & Arrays. CSE 351 Autumn Instructor: Justin Hsia

Martin Kruliš, v

Computer Systems CSE 410 Autumn Memory Organiza:on and Caches

Roadmap. The Interface CSE351 Winter Frac3onal Binary Numbers. Today s Topics. Floa3ng- Point Numbers. What is ?

Code Generation Introduction

Memory Allocation I. CSE 351 Autumn Instructor: Justin Hsia

x86-64 Programming III

x86-64 Programming I CSE 351 Summer 2018 Instructor: Justin Hsia Teaching Assistants: Josie Lee Natalie Andreeva Teagan Horkan

Structs and Alignment

CSE P 501 Compilers. Java Implementation JVMs, JITs &c Hal Perkins Winter /11/ Hal Perkins & UW CSE V-1

JAM 16: The Instruction Set & Sample Programs

Practical Malware Analysis

Floating Point II, x86 64 Intro

Cache Wrap Up, System Control Flow

Chapter 5. A Closer Look at Instruction Set Architectures. Chapter 5 Objectives. 5.1 Introduction. 5.2 Instruction Formats

Chapter 5. A Closer Look at Instruction Set Architectures

Compiler Design Spring 2018

High Performance Computing Systems

Under the Hood: The Java Virtual Machine. Problem: Too Many Platforms! Compiling for Different Platforms. Compiling for Different Platforms

How Software Executes

Under the Hood: The Java Virtual Machine. Lecture 23 CS2110 Fall 2008

Course Overview. PART I: overview material. PART II: inside a compiler. PART III: conclusion

Assembly Programming I

Chapter 1 GETTING STARTED. SYS-ED/ Computer Education Techniques, Inc.

Over-view. CSc Java programs. Java programs. Logging on, and logging o. Slides by Michael Weeks Copyright Unix basics. javac.

Semantic Analysis. Outline. The role of semantic analysis in a compiler. Scope. Types. Where we are. The Compiler Front-End

H.-S. Oh, B.-J. Kim, H.-K. Choi, S.-M. Moon. School of Electrical Engineering and Computer Science Seoul National University, Korea

Chapter 5. A Closer Look at Instruction Set Architectures

Chapter 5. A Closer Look at Instruction Set Architectures. Chapter 5 Objectives. 5.1 Introduction. 5.2 Instruction Formats

EXAMINATIONS 2014 TRIMESTER 1 SWEN 430. Compiler Engineering. This examination will be marked out of 180 marks.

Agenda. CSE P 501 Compilers. Java Implementation Overview. JVM Architecture. JVM Runtime Data Areas (1) JVM Data Types. CSE P 501 Su04 T-1

Final Exam. 12 December 2018, 120 minutes, 26 questions, 100 points

Project. there are a couple of 3 person teams. a new drop with new type checking is coming. regroup or see me or forever hold your peace

CSE351 Spring 2018, Midterm Exam April 27, 2018

Plan for Today. Safe Programming Languages. What is a secure programming language?

What is concurrency? Concurrency. What is parallelism? concurrency vs parallelism. Concurrency: (the illusion of) happening at the same time.

CS577 Modern Language Processors. Spring 2018 Lecture Interpreters

CS 351 Final Exam Solutions

Java Implementation and Performance

Chapter 11 Introduction to Programming in C

Caches Spring 2016 May 4: Caches 1

Shared Mutable State SWEN-220

CS 360 Programming Languages Interpreters

ECE 2400 / ENGRD 2140 Computer Systems Programming Course Overview

Memory Allocation. Static Allocation. Dynamic Allocation. Dynamic Storage Allocation. CS 414: Operating Systems Spring 2008

University*of*Washington*

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2018 Lecture 4

Parallel Computing. Hwansoo Han (SKKU)

Introduction. CS 2210 Compiler Design Wonsun Ahn

Computer Systems Lecture 9

Transcription:

The Hardware/Software Interface CSE351 Spring 2015 Lecture 26 Instructor: Katelin Bailey Teaching Assistants: Kaleo Brandt, Dylan Johnson, Luke Nelson, Alfian Rizqi, Kritin Vij, David Wong, and Shan Yang

Roadmap C: car *c = malloc(sizeof(car)); c->miles = 100; c->gals = 17; float mpg = get_mpg(c); free(c); Assembly language: get_mpg: pushq movq... popq ret %rbp %rsp, %rbp %rbp Machine code: 0111010000011000 100011010000010000000010 1000100111000010 110000011111101000011111 Java: Car c = new Car(); c.setmiles(100); c.setgals(17); float mpg = c.getmpg(); OS: Memory, data, & addressing Integers & floats Machine code & C x86 assembly Procedures & stacks Arrays & structs Memory & caches Processes Virtual memory Memory allocation Java vs. C Computer system: 2

Implementing Programming Languages Many choices in how to implement programming models We ve talked about compilation, can also interpret Execute line by line in original source code Simpler/no compiler less translation More transparent to debug less translation Easier to run on different architectures runs in a simulated environment that exists only inside the interpreter process Slower and harder to optimize All errors at run time Interpreting languages has a long history Lisp, an early programming language, was interpreted Interpreters are still in common use: Python, Javascript, Ruby, Matlab, PHP, Perl, 3

Interpreted vs. Compiled in practice Really a continuum, a choice to be made More or less work done by interpreter/compiler Lisp Java Interpreted C Compiled Java programs are usually run by a virtual machine JVMs interpret an intermediate language called Java bytecode Many JVMs compile bytecode to native machine code just-in-time (JIT) compilation Java is sometimes compiled ahead of time (AOT) like C 4

Virtual Machine Model High-Level Language Program compile time run time Bytecode compiler Virtual Machine Language Ahead-of-time compiler Virtual machine (interpreter) JIT compiler Native Machine Language 5

Java bytecode like assembly code for JVM, but works on all JVMs: hardware-independent typed (unlike ASM) strong JVM protections Holds pointer this Other arguments to method Other local variables 0 1 2 3 4 n variable table operand stack constant pool 6

JVM Operand Stack machine: Holds pointer this Other arguments to method Other local variables 0 1 2 3 4 n variable table operand stack i stands for integer, a for reference, b for byte, c for char, d for double, constant pool bytecode: iload 1 // push 1 st argument from table onto stack iload 2 // push 2 nd argument from table onto stack iadd // pop top 2 elements from stack, add together, and // push result back onto stack istore 3 // pop result and put it into third slot in table No registers or stack locations; all operations use operand stack. compiled to x86: 7 mov 8(%ebp), %eax mov 12(%ebp), %edx add %edx, %eax mov %eax, -8(%ebp)

A Simple Java Method Method java.lang.string getemployeename() 0 aload 0 // "this" object is stored at 0 in the var table 1 getfield #5 <Field java.lang.string name> // takes 3 bytes // pop an element from top of stack, retrieve its // specified instance field and push it onto stack. // "name" field is the fifth field of the object 4 areturn // Returns object at top of stack 0 1 4 aload_0 getfield 00 05 areturn In the.class file: 2A B4 00 05 B0 8 http://en.wikipedia.org/wiki/ Java_bytecode_instruction_listings

Class File Format Every class in Java source code is compiled to its own class file 10 sections in the Java class file structure: Magic number: 0xCAFEBABE (legible hex from James Gosling Java s inventor) Version of class file format: the minor and major versions of the class file Constant pool: set of constant values for the class Access flags: for example whether the class is abstract, static, final, etc. This class: The name of the current class Super class: The name of the super class Interfaces: Any interfaces in the class Fields: Any fields in the class Methods: Any methods in the class Attributes: Any attributes of the class (for example, name of source file, etc.) A.jar file collects together all of the class files needed for the program, plus any additional resources (e.g. images) 9

Disassembled javac Employee.java javap -c Employee Compiled from Employee.java class Employee extends java.lang.object { public Employee(java.lang.String,int); public java.lang.string getemployeename(); public int getemployeenumber(); } Method Employee(java.lang.String,int) 0 aload_0 1 invokespecial #3 <Method java.lang.object()> 4 aload_0 5 aload_1 6 putfield #5 <Field java.lang.string name> 9 aload_0 10 iload_2 11 putfield #4 <Field int idnumber> 14 aload_0 15 aload_1 16 iload_2 17 invokespecial #6 <Method void storedata(java.lang.string, int)> 20 return Method java.lang.string getemployeename() 0 aload_0 1 getfield #5 <Field java.lang.string name> 4 areturn Method int getemployeenumber() 0 aload_0 1 getfield #4 <Field int idnumber> 4 ireturn Method void storedata(java.lang.string, int) 10

Other languages for JVMs JVMs run on so many computers that compilers have been built to translate many other languages to Java bytecode: AspectJ, an aspect-oriented extension of Java ColdFusion, a scripting language compiled to Java Clojure, a functional Lisp dialect Groovy, a scripting language JavaFX Script, a scripting language for web apps JRuby, an implementation of Ruby Jython, an implementation of Python Rhino, an implementation of JavaScript Scala, an object-oriented and functional programming language And many others, even including C! 11

Microsoft s C# and.net Framework C# has similar motivations as Java Virtual machine is called the Common Language Runtime; Common Intermediate Language is the bytecode for C# and other languages in the.net framework 12

Roadmap C: car *c = malloc(sizeof(car)); c->miles = 100; c->gals = 17; float mpg = get_mpg(c); free(c); Assembly language: get_mpg: pushq movq... popq ret %rbp %rsp, %rbp %rbp Machine code: 0111010000011000 100011010000010000000010 1000100111000010 110000011111101000011111 Java: Car c = new Car(); c.setmiles(100); c.setgals(17); float mpg = c.getmpg(); OS: Memory, data, & addressing Integers & floats Machine code & C x86 assembly Procedures & stacks Arrays & structs Memory & caches Processes Virtual memory Memory allocation Java vs. C Bonus topic: parallelism Computer system: 13

Bonus Topic: parallel processing When can we execute things in parallel? Parallelism: Use extra resources to solve a problem faster work Concurrency: Correctly and efficiently manage access to shared resources requests resources resource thanks to Dan Grossman for the succinct definitions 14

What is parallel processing? Briefly introduction to key ideas of parallel processing instruction level parallelism data-level parallelism thread-level parallelism 15

Exploiting Parallelism Of the computing problems for which performance is important, many have inherent parallelism: computer games Graphics, physics, sound, AI etc. can be done separately Furthermore, there is often parallelism within each of these: Each pixel on the screen s color can be computed independently Non-contacting objects can be updated/simulated independently Artificial intelligence of non-human entities done independently search engine queries Every query is independent Searches are (ehm, pretty much) read-only!! 16

Instruction-Level Parallelism add %r2 <- %r3, %r4 or %r2 <- %r2, %r4 lw %r6 <- 0(%r4) addi %r7 <- %r6, 0x5 sub %r8 <- %r8, %r4 Dependences? RAW read after write WAW write after write WAR write after read When can we reorder instructions? When should we reorder instructions? add %r2 <- %r3, %r4 or %r5 <- %r2, %r4 lw %r6 <- 0(%r4) sub %r8 <- %r8, %r4 addi %r7 <- %r6, 0x5 Superscalar Processors: Multiple instructions executing in parallel at *same* stage Take 352 to learn more. 17

Data Parallelism Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; } } Operating on one element at a time + 18

Data Parallelism Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; } } Operating on one element at a time + 19

Data Parallelism with SIMD Consider adding together two arrays: void array_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; } } Operate on MULTIPLE elements + + + + Single Instruction, Multiple Data (SIMD) 20

Is it always that easy? Not always a more challenging example: unsigned sum_array(unsigned *array, int length) { int total = 0; for (int i = 0 ; i < length ; ++ i) { total += array[i]; } return total; } Is there parallelism here? Each loop iteration uses data from previous iteration. 21

Restructure the code for SIMD // one option... unsigned sum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; // chunks of 4 at a time for (i = 0 ; i < length & ~0x3 ; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } // add the 4 sub-totals total = temp[0] + temp[1] + temp[2] + temp[3]; return total; } 22

What are threads? Independent thread of control within process Like multiple processes within one process, but sharing the same virtual address space. logical control flow program counter stack shared virtual address space all threads in process use same virtual address space Lighter-weight than processes faster context switching system can support more threads 23

Thread-level parallelism: Multicore Processors Two (or more) complete processors, fabricated on the same silicon chip Execute instructions from two (or more) programs/threads at same time #1 #2 IBM Power5 24

Multicores are everywhere Laptops, desktops, servers Most any machine from the past few years has at least 2 cores Game consoles: Xbox 360: 3 PowerPC cores; Xbox One: 8 AMD cores PS3: 9 Cell cores (1 master; 8 special SIMD cores); PS4: 8 custom AMD x86-64 cores Wii U: 2 Power cores Smartphones iphone 4S, 5: dual-core ARM CPUs Galaxy S II, III, IV: dual-core ARM or Snapdragon 25

Why Multicores Now? Number of transistors we can put on a chip growing exponentially But performance is no longer growing along with transistor count. So let s use those transistors to add more cores to do more at once 26

As programmers, do we care? What happens if we run this program on a multicore? void array_add(int A[], int B[], int C[], int length) { } int i; for (i = 0 ; i < length ; ++i) { C[i] = A[i] + B[i]; } #1 #2 27

What if we want one program to run on multiple processors (cores)? We have to explicitly tell the machine exactly how to do this This is called parallel programming or concurrent programming There are many parallel/concurrent programming models We will look at a relatively simple one: fork-join parallelism 28

How does this help performance? Parallel speedup measures improvement from parallelization: speedup(p) = What can we realistically expect? time for best serial version time for version with p processors 29

Reason #1: Amdahl s Law In general, the whole computation is not (easily) parallelizable Serial regions limit the potential parallel speedup. Serial regions 30

Reason #1: Amdahl s Law Suppose a program takes 1 unit of time to execute serially A fraction of the program, s, is inherently serial (unparallelizable) New Execution Time = 1-s + s p For example, consider a program that, when executing on one processor, spends 10% of its time in a non-parallelizable region. How much faster will this program run on a 3-processor system? New Execution Time =.9T 3 +.1T 31

Reason #2: Overhead Forking and joining is not instantaneous Involves communicating between processors May involve calls into the operating system Depends on the implementation New Execution Time = 1-s + s + overhead(p) P 32

Multicore: what should worry us? Concurrency: what if we re sharing resources, memory, etc.? Cache Coherence What if two cores have the same data in their own caches? How do we keep those copies in sync? Memory Consistency, Ordering, Interleaving, Synchronization With multiple cores, we can have truly concurrent execution of threads. In what order do their memory accesses appear to happen? Do the orders seen by different cores/threads agree? Concurrency Bugs When it all goes wrong Hard to reproduce, hard to debug 33

Summary Multicore: more than one processor on the same chip. Almost all devices now have multicore processors Results from Moore s law and power constraint Exploiting multicore requires parallel programming Automatically extracting parallelism too hard for compiler, in general. But, can have compiler do much of the bookkeeping for us Fork-Join model of parallelism At parallel region, fork a bunch of threads, do the work in parallel, and then join, continuing with just one thread Expect a speedup of less than P on P processors Amdahl s Law: speedup limited by serial portion of program Overhead: forking and joining are not free 34