Advances and Future Challenges in Binary Translation and Optimization
|
|
- Gerald Parks
- 6 years ago
- Views:
Transcription
1 Advances and Future Challenges in Binary Translation and Optimization ERIK R. ALTMAN, KEMAL EBCIOG LU, MICHAEL GSCHWIND, SENIOR MEMBER, IEEE, AND SUMEDH SATHAYE Presented by Holly Ferguson
2 Can you define these terms? Introduction Dynamic binary translation = Dynamic optimization = (Various Current Architectural Solutions) Regular Hardware Architecture Solution Keywords: Binary Translation, Compilers, Dynamic optimization, ILP, Java, JIT, VLIW
3 Introduction Dynamic binary translation = is just-in-time (JIT) compilation from the binary code of one architecture to another. Dynamic optimization = is run-time improvement of code. (see the fig. below) (Various Current Architectural Solutions) Regular Hardware Architecture Solution Keywords: Binary Translation, Compilers, Dynamic optimization, ILP, Java, JIT, VLIW
4 Dynamic binary translation = is just-in-time (JIT) compilation from the binary code of one architecture to another. Dynamic optimization = is run-time improvement of code. (see the fig. below) Purpose Running Program Input Code Stream Translate and Optimize Output Code Regular Hardware Architecture Solution Keywords: Binary Translation, Compilers, Dynamic optimization, ILP, Java, JIT, VLIW
5 Dynamic binary translation = is just-in-time (JIT) compilation from the binary code of one architecture to another. Purpose Translate and Optimize Running Program Input Code Stream Translate and Optimize Instead Rules: Variety of Regular Regular Hardware Hardware Architecture Architecture Solution Solutions Keywords: Binary Translation, Compilers, Dynamic optimization, ILP, Java, JIT, VLIW
6 Why is BT a negative? Binary Translation Addresses: Purpose Allows Architecture to become a Layer of Software As SW, fixes problems of running legacy SW directly Enables Optimizations outside existing hardware boundaries Commercial & Research Interest BT is done automatically at run-time without programmer Saves POWER since memory uses less power than Logic (for non-superscalar) Paper analyzes questions with projects including: DAISY, Crusoe, Dynamo, and LaTTe
7 Why else is BT a negative? BT & VMM have additional drawbacks: Negatives IT-apps, commonality, or virtual IT shops, etc. may mean disruptive behavior VMM debugging difficult: target is several times removed from source = behavior is nondeterministic Takes memory and resources from the source arch. machine Takes cycles from source arch. programs New design territory (2001), so calibration is difficult. Start of prgm. exe is slow, since all code is interpreted and translated to target arch. code Overtaking Hardware is a concern large as well Running Program Input Code Stream Translate and Optimize Output Code Regular Hardware Architecture Solution Virtual Machine Monitor (VMM) CMS (Code Morphing Software) for VMM Translation Cache (TCache)
8 Why else is BT a negative? a large focus for: Difficult Issues with managing BT: Self Modifying Code Precise Exceptions Address Translation Self Referential Code Management of Translations Real-Time Behavior Boot and BIOS Code Reliability and Correctness Code Reuse Interpreting versus translating/optimizing Emulating a virtual versus a real machine Full system versus user mode only; (IEEE, NOV. 2001) OS independent versus OS dependent Translating to a different versus same architecture Emulating single versus multiple source architecture (Referred to as Source (legacy) to Target Arch.) DAISY, Crusoe, Dynamo, and LaTTe focus on: talked about in terms of: Negatives Virtual Machine Monitor (VMM) CMS (Code Morphing Software) for VMM Translation Cache (TCache)
9 Why is BT a positive? Attractions to the study of Binary Translation: In pursuit of Architecturally Independent Computing using the run-anywhere object code idea Variety of Customers Farm = Variety of Architectures Positives Paper analyzes questions with projects including: DAISY, Crusoe, Dynamo, and LaTTe
10 Attractions to the study of Binary Translation: In pursuit of Architecturally Independent Computing using the run-anywhere object code idea Variety of Customers Farm = Variety of Architectures Gives a Static Total Positives Determines a Static Breakdown Paper analyzes questions with projects including: DAISY, Crusoe, Dynamo, and LaTTe
11 Attractions to the study of Binary Translation: In pursuit of Architecturally Independent Computing using the run-anywhere object code idea Variety of Customers Farm = Variety of Architectures Gives a Static Total Positives Determines a Static Breakdown Means Limited Utilization Thus Increasing Cost Paper analyzes questions with projects including: DAISY, Crusoe, Dynamo, and LaTTe
12 Attractions to the study of Binary Translation: In pursuit of Architecturally Independent Computing using the run-anywhere object code idea Variety of Customers Farm = Variety of Architectures Gives a Static Total Positives Determines a Static Breakdown BT is a Solution for better Utilization if it is a Layer of SW and thus dynamic configuration of many machines of a farm. Means Limited Utilization Thus Increasing Cost Paper analyzes questions with projects including: DAISY, Crusoe, Dynamo, and LaTTe
13 Why else is BT a positive? BT permits such optimizations under the covers when the user runs the program. Positives BT is not limited and can cross boundaries such as indirect calls, function returns, shared libraries, and system calls. With BT, intelligence is in software, not hardware. This means smaller chips with higher yield. Only a software patch for better algorithm is needed to install them and update the VMM. A software patch is sufficient to fix a bug in the VMM. Bugs such as nonworking opcodes may be manipulated by changing the VMM software. Running Program Input Code Stream Translate and Optimize Output Code Regular Hardware Architecture Solution
14 Why else is BT a positive? BT permits such optimizations under the covers when the user runs the program. BT is not limited and can cross boundaries such as indirect calls, function returns, shared libraries, and system calls. With BT, intelligence is in software, not hardware. This means smaller chips with higher yield. Only a software patch for better algorithm is needed to install them and update the VMM. A software patch is sufficient to fix a bug in the VMM. Bugs such as nonworking opcodes may be manipulated by changing the VMM software. Legacy binaries where source code is unavailable can be optimized. Translated basic blocks can be laid out contiguously in a natural order. This improves instruction cache performance. Future architectural improvements are transparent to the user. Compatibility of VLIWs of different sizes and generations. Positives Running Program Input Code Stream Translate and Optimize Output Code Regular Hardware Architecture Solution
15 What is the best solution? With software convergence, BT JIT optimizations admit the possibility of convergence virtual machine (CVM): JVM: Java Virtual Machine Write-Once, run anywhere model Existing C/C++ apps and OSs do not run on JVM, nor does Linux For Security and safety guarantees, gives up universal ability to handle other HW problems CVM: Similar to the JVM (Same Goal) Different Tradeoffs: Research works to allow same OS and app object code to run on different platforms with CVM Is universal because of/ through JIT compilation and virtual device emulation = less protection than a modern RISC processor CVM The Internet has recently been changing the software landscape [radically] and has been implicitly encouraging write-once, run-anywhere software and interoperability of different hardware platforms, as exemplified by the recent popularity of technologies such as XML, Simple Object Access Protocol (SOAP) [17], and Java. ~Altman
16 What are future implications? Linux Apps Compiled for CVM Linux Compiled for CVM Linux Apps Compiled for PowerPC or x86 Linux Compiled for PowerPC or (diff. for x86) CVM CVM for PowerPC or x86 PowerPC or x86 Hardware PowerPC or x86 Hardware (Linux & Apps Under CVM) (Linux & Apps Under PowerPC/x86)
17 What are future implications? Linux Apps Compiled for CVM Linux Compiled for CVM Linux Apps Compiled for PowerPC or x86 Linux Compiled for PowerPC or (diff. for x86) CVM CVM for PowerPC or x86 PowerPC or x86 Hardware PowerPC or x86 Hardware (Linux & Apps Under CVM) (Linux & Apps Under PowerPC/x86) Using CVM, & object code in XML format, OS can be booted from the web
18 DAISY, Crusoe, Dynamo, LaTTe DAISY
19 PowerPC IBM DAISY L3 Cache DAISY VLIW DAISY 6xx Bus Memory Controller PCI Bus DAISY Flash ROM PowerPC Flash ROM Disk Video Network Keyboard Memory PowerPC DAISY. Most DAISY work uses PowerPC as the source architecture. By allowing operations from multiple paths in its groups, DAISY is not dependent on good branch prediction, and indeed makes forward progress along all possible paths.
20 PowerPC IBM DAISY Interpret, Add to Group X no Stopping Point yes Y = X X= New Group no Previous Translate d Entry Point yes no Interp. 30x yes DAISY Execute Group s VLIW Translation Translate and Sched. Group Y to VLIW Instruct. no Freq. Exe Code yes > 24 Ops ILP > 3 ILP > 10 > 180 Cps yes yes At Good Stop yes yes yes Good Stop Point.: such as a Loopback
21 PowerPC IBM DAISY DAISY renaming Scratchpad registers r36 to r63 LS telescoping (allows dependence chains to be significantly shortened)
22 PowerPC x86 TRANSMETA CRUSOE When a translated group completes, the contents of all working x86 registers are copied at once to the shadow register set. This shadow copy allows Crusoe to efficiently recover from exceptions. x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware DAISY = 3 4 PowerPC instructions per cycle Transmeta claims the 667-MHz Crusoe TM5400 = 500-MHz Pentium III Different Intended Use Daisy Crusoe type Big Machine Small/ Mobile IPC Up to 4 atoms (ops) per molecule 128-bit Molecule FADD Floatin g Point Unit ADD Intege r ALU LD Load Store Unit In order pipeline BRCC Branc hunit CRUSOE Total Memory Gigs MB 64 Registers (of Crusoe processor block d.) Self/ Translations 100 MB+ 16 MB Working Shadow x86 Architectural State Crusoe Working Registers
23 PowerPC x86 TRANSMETA CRUSOE x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware 8 kb of on-chip local memory for data and 8 kb for instructions. (So CMS can remain for quick access without disturbing x86 code and data.) 16-way associativity of the L1 DCache minimizes conflicts of x86 code data and that used by CMS. CRUSOE Memory controller is integrated into the TM5400 because it is part of the standard PC architecture. Unlike DAISY, has optimizations like strength reduction and aggressive dead code elimination. Alias Butler AB TM5400 Crusoe chip The TM5400 has less (7 million) transistors compared to AMD and Intel microprocessors, showing that BT allows = reduction in hardware complexity. Idp (X) (Speculative Load) Addr Size (Store Under Alias Mask) stam (Y) No : Continue Execution Hardware Aliasing Yes : Raise Exception to Crusoe VMM
24 x86 PowerPC HP DYNAMO HPUX Applications Problematic where it exposes Dynamo to the application = Potential + = Dynamo s translations last only 1 program invocation = Dynamo Software HPUX PA-RISC DYNAMO Potential + = Runs in the VA space of a single HPUX process = x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware
25 x86 PowerPC HP DYNAMO HPUX Applications Dynamo Software HPUX Problematic where it exposes Dynamo to the application = Less Transparent PA-RISC DYNAMO Potential + = Dynamo s translations last only 1 program invocation = Daisy & Crusoe last over many invocations of a program Potential + = Runs in the VA space of a single HPUX process = No need to deal with translation of addresses or grps spanning multiple pages= Reduces # synchronous exceptions to be done correctly = Allows more aggressive code optimizations x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware
26 x86 PowerPC HP DYNAMO HPUX Applications DAISY vs. Some of Dynamo s Optimizations : copy propagation constant propagation strength reduction loop invariant code motion loop unrolling DYNAMO Dynamo Software HPUX PA-RISC translates Groups as paths OR trees If OHead $ > optimization is helping, bails & exe original code directly ~ translation If OHead $ < optimization is helping, continues with VMM Groups are path, never trees or other forms= limit code explosion = cost of losses in exploiting parallelism DYNAMO x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware
27 x86 PowerPC HP DYNAMO HPUX Applications Dynamo Software HPUX PA-RISC DYNAMO x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware
28 x86 PowerPC SPARC IMB LaTTe Java JIT compilers, such as LaTTe, use dynamic translation and optimization to move from Java Virtual Machine code to RISC code. [3] Bytecode Bytecode CFG of Pseudo CFG of Real Code Native SPARC Bytecode Translation: Java stack is mapped to symbolic registers Register Allocation & Optimizations: Symbolic registers are allocated to machine registers Code Emission: Binary image is generated from the CFG Determines locations of basic blocks CFG of Pseudo SPARC Code CFG of Real SPARC Code Native SPARC Code LaTTe x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC
29 x86 PowerPC SPARC IMB LaTTe LaTTe and Java performance advantages over BT for traditional architectures: Bytecode CFG of Pseudo CFG of Real Code Lightweight monitor optimized for single threaded programs Native SPARC Efficient exception handling. Instead of inserting code, it uses hardware generated signals to detect exceptions such as out-ofbounds memory accesses LaTTe Efficient garbage collection, memory management, and sophisticated JIT compilation techniques Converts virtual method calls to direct method calls or inlines them by including a specific (conditional branch) check for the most frequently occurring method invoked from a particular call site Register allocation converts the JVM s stackbased model with push and pop operations to the register based model used in RISC machines such as Sparc. x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC
30 x86 PowerPC SPARC IMB LaTTe LaTTe and Java performance advantages over BT for traditional architectures: ( Difference: This uses BT to improve VM performance and in a way mitigate the gap between the source architecture JVM and the underlying target architecture ) Bytecode CFG of Pseudo CFG of Real Code Native SPARC A A A LaTTe B D C B D C D B D C (Tree Regions= its optimization unit) a) Original Control Flow Graph b) CFG of DAISY transformed code c) CFG of LaTTe transformed code Seeks tree groups in input code, instead of output w/ two-pass register allocation (first backward sweep to covey upward the info. for all symbolic registers live at its group exits, then forward sweep for actual register allocation based on hints from the backward sweep) x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC
31 PowerPC Additional Questions Asked: x86 SPARC 1. Can a binary translation machine have generally better performance than a well-designed superscalar? 2. Can all real-time problems be avoided? 3. What memory management schemes are best over a wide range of TCache sizes? 4. In full system BT, how can VMM memory amount change after startup? 5. If the VMM gives memory back to the system because of an unusually large working set for the TCache it has no way to transparently steal the memory from the OS running above it 6. Should the target architecture ever be exposed for users to access directly bypassing the source architecture layer and translation by the VMM? Afterthought Important considerations: operation semantics data formats (cond. code FP, etc) address translation special purpose registers more registers than the sourcearch (for VMM scratchpad) x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC Bytecode CFG of Pseudo CFG of Real Code Native SPARC
32 PowerPC FPGA/ Warp Processing (2008): x86 SPARC Using desktop, server, and scientific-computing applications, results gave similar speedups: ex. compared to a four-processor 400-MHz ARM11 system, warp processing obtained average speedups of 169X Recent Work x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC Bytecode CFG of Pseudo CFG of Real Code Native SPARC
33 PowerPC MTCrossBit (2011, from CrossBit) x86 SPARC A dynamic binary translation system based on multithreaded optimization Existing DBT techniques employed are for a single-threaded executive environment (increase the complexity of the hardware or runtime overhead) This is a multithreaded DBT framework with no associated hardware (uses a helper thread for building a hot trace which reduces overhead) Main and Helper threads use different cores to use multi-core resources efficiently Two methods: 1. the dual-special-parallel translation caches and 2. the new lock-free threads communication mechanism assembly language communication (ASLC) Recent Work Supported guest platforms including SimpleScalar, IA32, MIPS, SPARC, fully supported the IA32 host platform, PowerPC, etc x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC Bytecode CFG of Pseudo CFG of Real Code Native SPARC
34 PowerPC MTCrossBit (2011, framework) x86 SPARC MTCrossBit builds a hot trace that is concurrently executed in the helper thread to boost the performance, opposed to sequential execution in a single thread executive environment. However, multi-threaded framework has unavoidable problems such as mutual exclusion, the access of translated basic blocks, and communication between threads. Recent Work x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC Bytecode CFG of Pseudo CFG of Real Code Native SPARC
35 PowerPC CPU framework into GPUs () x86 SPARC Low overhead always has been an issue with DBT GPU (multi-core processor, used as a co-processor) can parallel execute the hot spot of binary code can reduce the overhead of DBT One solution is to construct the virtual execution environment to accelerate the process of DBT on CPU/GPU based architectures Hot spots of binary code and their related information, the framework converts the sequential code into PTX form and executes them on GPUs Recent Work No need to rewrite the source code, and the binary compatibility issues between different GPUs are also resolved, this usually has 10X speedup compared to X86 native platforms, and better with larger input x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC Bytecode CFG of Pseudo CFG of Real Code Native SPARC
36 PowerPC CPU framework into GPUs () x86 SPARC Recent Work Workflow of GXBit, First and Second Exe Pahases(extracts hotspots and converts to GPU form): The Translation Framework: x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC Bytecode CFG of Pseudo CFG of Real Code Native SPARC
37 PowerPC Sources Consulted: x86 SPARC [1] Altman et al., Advances and Future Challenges in Binary Translation and Optimization, Proceedings of the IEEE, vol. 89, No. 11, November [2] K. Diefendorff, Power4 focuses on memory bandwidth, Microprocessor: Rep., vol. 13, October [3] Dynamic Binary Translation and Optimization, December 2000: < micro33/tutorial/tutorial.html>, < [4] Vahid, F; Stitt, G; Lvsecky, R et al., Warp Processing: Dynamic Translation of Binaries to FPGA Circuits, Computer, vol. 41, Issue. 7, [5] Guan HaiBing et al., MTCrossBit: A dynamic binary translation system based on multithreaded optimization, Science China, vol. 54, No. 10, October Bibliography [6] Erzhou Zhu, Haibing Guan, Guoxing Dong, Yindong Yang, Hongbo Yang, A Translation Framework for Executing the Sequential Binary Code on CPU/GPU Based Architectures, Journal of Software, vol. 6, No. 12, December x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC Bytecode CFG of Pseudo CFG of Real Code Native SPARC
38 x86 SPARC PowerPC Questions/Comments?: Introduction Q U E S T I O N S? Purpose Negatives Positives CVM DAISY CRUSOE DYNAMO LaTTe Afterthought Recent Work Bibliography x86 Applications Windows, Linux, BIOS Code Morphing SW Crusoe Hardware HPUX Applications Dynamo Software HPUX PA-RISC Bytecode CFG of Pseudo CFG of Real Code Native SPARC
Crusoe Reference. What is Binary Translation. What is so hard about it? Thinking Outside the Box The Transmeta Crusoe Processor
Crusoe Reference Thinking Outside the Box The Transmeta Crusoe Processor 55:132/22C:160 High Performance Computer Architecture The Technology Behind Crusoe Processors--Low-power -Compatible Processors
More informationVirtual Machines and Dynamic Translation: Implementing ISAs in Software
Virtual Machines and Dynamic Translation: Implementing ISAs in Software Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Software Applications How is a software application
More informationAdvanced Computer Architecture
ECE 563 Advanced Computer Architecture Fall 2007 Lecture 14: Virtual Machines 563 L14.1 Fall 2009 Outline Types of Virtual Machine User-level (or Process VMs) System-level Techniques for implementing all
More informationExecution-based Scheduling for VLIW Architectures. Kemal Ebcioglu Erik R. Altman (Presenter) Sumedh Sathaye Michael Gschwind
Execution-based Scheduling for VLIW Architectures Kemal Ebcioglu Erik R. Altman (Presenter) Sumedh Sathaye Michael Gschwind September 2, 1999 Outline Overview What's new? Results Conclusions Overview Based
More informationIntroduction to Virtual Machines. Michael Jantz
Introduction to Virtual Machines Michael Jantz Acknowledgements Slides adapted from Chapter 1 in Virtual Machines: Versatile Platforms for Systems and Processes by James E. Smith and Ravi Nair Credit to
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationInherently Lower Complexity Architectures using Dynamic Optimization. Michael Gschwind Erik Altman
Inherently Lower Complexity Architectures using Dynamic Optimization Michael Gschwind Erik Altman ÿþýüûúùúüø öõôóüòñõñ ðïîüíñóöñð What is the Problem? Out of order superscalars achieve high performance....butatthecostofhighhigh
More informationCS 252 Graduate Computer Architecture. Lecture 15: Virtual Machines
CS 252 Graduate Computer Architecture Lecture 15: Virtual Machines Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs252
More informationA Survey on Virtualization Technologies
A Survey on Virtualization Technologies Virtualization is HOT Microsoft acquires Connectix Corp. EMC acquires VMware Veritas acquires Ejascent IBM, already a pioneer Sun working hard on it HP picking up
More informationLecture 12. Motivation. Designing for Low Power: Approaches. Architectures for Low Power: Transmeta s Crusoe Processor
Lecture 12 Architectures for Low Power: Transmeta s Crusoe Processor Motivation Exponential performance increase at a low cost However, for some application areas low power consumption is more important
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationCS 152 Computer Architecture and Engineering. Lecture 22: Virtual Machines
CS 152 Computer Architecture and Engineering Lecture 22: Virtual Machines Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationIntel released new technology call P6P
P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationCS252 Spring 2017 Graduate Computer Architecture. Lecture 18: Virtual Machines
CS252 Spring 2017 Graduate Computer Architecture Lecture 18: Virtual Machines Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Midterm Topics ISA -- e.g. RISC vs. CISC
More informationAdvance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts
Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism
More informationECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation
ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating
More informationIntroduction. CS 2210 Compiler Design Wonsun Ahn
Introduction CS 2210 Compiler Design Wonsun Ahn What is a Compiler? Compiler: A program that translates source code written in one language to a target code written in another language Source code: Input
More informationRISC Architecture Ch 12
RISC Architecture Ch 12 Some History Instruction Usage Characteristics Large Register Files Register Allocation Optimization RISC vs. CISC 18 Original Ideas Behind CISC (Complex Instruction Set Comp.)
More informationIA-64, P4 HT and Crusoe Architectures Ch 15
IA-64, P4 HT and Crusoe Architectures Ch 15 IA-64 General Organization Predication, Speculation Software Pipelining Example: Itanium Pentium 4 HT Crusoe General Architecture Emulated Precise Exceptions
More informationIntroduction to Virtual Machines
Introduction to Virtual Machines abstraction and interfaces virtualization Vs. abstraction computer system architecture process virtual machines system virtual machines Abstraction Abstraction is a mechanism
More informationC 1. Last time. CSE 490/590 Computer Architecture. Virtual Machines I. Types of Virtual Machine (VM) Outline. User Virtual Machine = ISA + Environment
CSE 490/590 Computer Architecture Last time Directory-based coherence protocol 4 cache states: C-invalid, C-shared, C-modified, and C-transient 4 memory states: R(dir), W(id), TR(dir), TW(id) Virtual Machines
More informationInstruction Set Principles and Examples. Appendix B
Instruction Set Principles and Examples Appendix B Outline What is Instruction Set Architecture? Classifying ISA Elements of ISA Programming Registers Type and Size of Operands Addressing Modes Types of
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationCase Study : Transmeta s Crusoe
Case Study : Transmeta s Crusoe Motivation David Ditzel (SUN microsystems) observed that Microprocessor complexity is getting worse, and they consume too much power. This led to the birth of Crusoe (nicknamed
More informationVirtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])
EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationDelft-Java Link Translation Buffer
Delft-Java Link Translation Buffer John Glossner 1,2 and Stamatis Vassiliadis 2 1 Lucent / Bell Labs Advanced DSP Architecture and Compiler Research Allentown, Pa glossner@lucent.com 2 Delft University
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationKeywords and Review Questions
Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain
More informationProcessors. Young W. Lim. May 12, 2016
Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationAlternate definition: Instruction Set Architecture (ISA) What is Computer Architecture? Computer Organization. Computer structure: Von Neumann model
What is Computer Architecture? Structure: static arrangement of the parts Organization: dynamic interaction of the parts and their control Implementation: design of specific building blocks Performance:
More informationJust-In-Time Compilation
Just-In-Time Compilation Thiemo Bucciarelli Institute for Software Engineering and Programming Languages 18. Januar 2016 T. Bucciarelli 18. Januar 2016 1/25 Agenda Definitions Just-In-Time Compilation
More informationChapter 12. CPU Structure and Function. Yonsei University
Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor
More informationDynamic Translation for EPIC Architectures
Dynamic Translation for EPIC Architectures David R. Ditzel Chief Architect for Hybrid Computing, VP IAG Intel Corporation Presentation for 8 th Workshop on EPIC Architectures April 24, 2010 1 Dynamic Translation
More informationComputer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue
More informationUNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.
UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known
More informationROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define
More informationSuperscalar Processors Ch 14
Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion
More informationSuperscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?
Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion
More informationHardware Speculation Support
Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationArchitectural Approaches for Dynamic Translation and Reconfiguration
Architectural Approaches for Dynamic Translation and Reconfiguration Brian F. Veale *, John K. Antonio *, and Monte P. Tull * School of Computer Science School of Electrical and Computer Engineering University
More informationJust-In-Time Compilers & Runtime Optimizers
COMP 412 FALL 2017 Just-In-Time Compilers & Runtime Optimizers Comp 412 source code IR Front End Optimizer Back End IR target code Copyright 2017, Keith D. Cooper & Linda Torczon, all rights reserved.
More informationCSE 120 Principles of Operating Systems
CSE 120 Principles of Operating Systems Spring 2018 Lecture 16: Virtual Machine Monitors Geoffrey M. Voelker Virtual Machine Monitors 2 Virtual Machine Monitors Virtual Machine Monitors (VMMs) are a hot
More informationLec 25: Parallel Processors. Announcements
Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza
More informationEC 413 Computer Organization
EC 413 Computer Organization Review I Prof. Michel A. Kinsy Computing: The Art of Abstraction Application Algorithm Programming Language Operating System/Virtual Machine Instruction Set Architecture (ISA)
More informationAdvanced processor designs
Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The
More informationAdvanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University
Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationCHAPTER 5 A Closer Look at Instruction Set Architectures
CHAPTER 5 A Closer Look at Instruction Set Architectures 5.1 Introduction 199 5.2 Instruction Formats 199 5.2.1 Design Decisions for Instruction Sets 200 5.2.2 Little versus Big Endian 201 5.2.3 Internal
More informationVirtualization. Dr. Yingwu Zhu
Virtualization Dr. Yingwu Zhu Virtualization Definition Framework or methodology of dividing the resources of a computer into multiple execution environments. Types Platform Virtualization: Simulate a
More informationNew Challenges in Microarchitecture and Compiler Design
New Challenges in Microarchitecture and Compiler Design Contributors: Jesse Fang Tin-Fook Ngai Fred Pollack Intel Fellow Director of Microprocessor Research Labs Intel Corporation fred.pollack@intel.com
More informationEvolution of Computers & Microprocessors. Dr. Cahit Karakuş
Evolution of Computers & Microprocessors Dr. Cahit Karakuş Evolution of Computers First generation (1939-1954) - vacuum tube IBM 650, 1954 Evolution of Computers Second generation (1954-1959) - transistor
More informationAdvanced Instruction-Level Parallelism
Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu
More informationChapter 13 Reduced Instruction Set Computers
Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining
More informationEECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)
Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static
More informationCOMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy
COMPUTER ARCHITECTURE Virtualization and Memory Hierarchy 2 Contents Virtual memory. Policies and strategies. Page tables. Virtual machines. Requirements of virtual machines and ISA support. Virtual machines:
More informationMain Points of the Computer Organization and System Software Module
Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a
More informationComputer Architecture Area Fall 2009 PhD Qualifier Exam October 20 th 2008
Computer Architecture Area Fall 2009 PhD Qualifier Exam October 20 th 2008 This exam has nine (9) problems. You should submit your answers to six (6) of these nine problems. You should not submit answers
More informationUntyped Memory in the Java Virtual Machine
Untyped Memory in the Java Virtual Machine Andreas Gal and Michael Franz University of California, Irvine {gal,franz}@uci.edu Christian W. Probst Technical University of Denmark probst@imm.dtu.dk July
More informationCSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics
CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics Computing system: performance, speedup, performance/cost Origins and benefits of scalar instruction pipelines and caches
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationRISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.
COMP 212 Computer Organization & Architecture Pipeline Re-Cap Pipeline is ILP -Instruction Level Parallelism COMP 212 Fall 2008 Lecture 12 RISC & Superscalar Divide instruction cycles into stages, overlapped
More informationThe Slide does not contain all the information and cannot be treated as a study material for Operating System. Please refer the text book for exams.
The Slide does not contain all the information and cannot be treated as a study material for Operating System. Please refer the text book for exams. Operating System Services User Operating System Interface
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More information55:132/22C:160, HPCA Spring 2011
55:132/22C:160, HPCA Spring 2011 Second Lecture Slide Set Instruction Set Architecture Instruction Set Architecture ISA, the boundary between software and hardware Specifies the logical machine that is
More informationA Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning By: Roman Lysecky and Frank Vahid Presented By: Anton Kiriwas Disclaimer This specific
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationHigh-Level Language VMs
High-Level Language VMs Outline Motivation What is the need for HLL VMs? How are these different from System or Process VMs? Approach to HLL VMs Evolutionary history Pascal P-code Object oriented HLL VMs
More informationSuperscalar Processors
Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance
More informationTutorial 11. Final Exam Review
Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache
More informationEITF20: Computer Architecture Part2.1.1: Instruction Set Architecture
EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Instruction Set Principles The Role of Compilers MIPS 2 Main Content Computer
More informationHigh-Performance Processors Design Choices
High-Performance Processors Design Choices Ramon Canal PD Fall 2013 1 High-Performance Processors Design Choices 1 Motivation 2 Multiprocessors 3 Multithreading 4 VLIW 2 Motivation Multiprocessors Outline
More informationUG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects
Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer
More informationRun-time Program Management. Hwansoo Han
Run-time Program Management Hwansoo Han Run-time System Run-time system refers to Set of libraries needed for correct operation of language implementation Some parts obtain all the information from subroutine
More informationLecture 4: Instruction Set Architecture
Lecture 4: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation Reading: Textbook (5 th edition) Appendix A Appendix B (4 th edition)
More informationTwo hours. No special instructions. UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date. Time
Two hours No special instructions. UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE System Architecture Date Time Please answer any THREE Questions from the FOUR questions provided Use a SEPARATE answerbook
More informationComputer Architecture. Fall Dongkun Shin, SKKU
Computer Architecture Fall 2018 1 Syllabus Instructors: Dongkun Shin Office : Room 85470 E-mail : dongkun@skku.edu Office Hours: Wed. 15:00-17:30 or by appointment Lecture notes nyx.skku.ac.kr Courses
More informationControl Hazards. Branch Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationLatches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter
IT 3123 Hardware and Software Concepts Notice: This session is being recorded. CPU and Memory June 11 Copyright 2005 by Bob Brown Latches Can store one bit of data Can be ganged together to store more
More informationEEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)
1 EEC 581 Computer Architecture Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University Overview
More informationArchitectural Support for Operating Systems
Architectural Support for Operating Systems Today Computer system overview Next time OS components & structure Computer architecture and OS OS is intimately tied to the hardware it runs on The OS design
More informationLecture 9: Multiple Issue (Superscalar and VLIW)
Lecture 9: Multiple Issue (Superscalar and VLIW) Iakovos Mavroidis Computer Science Department University of Crete Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationChapter 4 The Processor (Part 4)
Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline
More informationAssembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam
Assembly Language Lecture 2 - x86 Processor Architecture Ahmed Sallam Introduction to the course Outcomes of Lecture 1 Always check the course website Don t forget the deadline rule!! Motivations for studying
More informationBEAMJIT: An LLVM based just-in-time compiler for Erlang. Frej Drejhammar
BEAMJIT: An LLVM based just-in-time compiler for Erlang Frej Drejhammar 140407 Who am I? Senior researcher at the Swedish Institute of Computer Science (SICS) working on programming languages,
More informationMultiple Issue ILP Processors. Summary of discussions
Summary of discussions Multiple Issue ILP Processors ILP processors - VLIW/EPIC, Superscalar Superscalar has hardware logic for extracting parallelism - Solutions for stalls etc. must be provided in hardware
More informationInteraction of JVM with x86, Sparc and MIPS
Interaction of JVM with x86, Sparc and MIPS Sasikanth Avancha, Dipanjan Chakraborty, Dhiral Gada, Tapan Kamdar {savanc1, dchakr1, dgada1, kamdar}@cs.umbc.edu Department of Computer Science and Electrical
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationChapter 2. OS Overview
Operating System Chapter 2. OS Overview Lynn Choi School of Electrical Engineering Class Information Lecturer Prof. Lynn Choi, School of Electrical Eng. Phone: 3290-3249, Kong-Hak-Kwan 411, lchoi@korea.ac.kr,
More informationLast class: OS and Architecture. OS and Computer Architecture
Last class: OS and Architecture OS and Computer Architecture OS Service Protection Interrupts System Calls IO Scheduling Synchronization Virtual Memory Hardware Support Kernel/User Mode Protected Instructions
More informationLast class: OS and Architecture. Chapter 3: Operating-System Structures. OS and Computer Architecture. Common System Components
Last class: OS and Architecture Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System Design and Implementation
More information