ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation
|
|
- Shannon Thompson
- 5 years ago
- Views:
Transcription
1 ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating virtual machine. The ABI virtual machines such as FX!32 [1], SUN WABI [2], and SHADE [3] use binary translator to translate application binaries with an ISA different from hardware platform so that they can be executed on that hardware platform. Some system uses binary translator as a component of dynamic optimization. In this paper, we study the effective of binary translation applied to two ISA that have different configuration of register files. We chose MIPS instruction set as a base ISA that has a flat register file for its simplicity and generality and CRAY 2 instruction set that has a hierarchal register file structure as a target ISA. 1. Introduction In VM architecture design, the emulation is a very important method to enable a (sub) system to present the same interface and characteristics as another system. By executing and tracing programs, the emulator can help build better computer hardware and software. As one of the ways to implementing emulator, Binary Translation is not only efficient for repeated instruction executions, but also can solve the software compatibility problem, which is becoming more and more serious nowadays. For example, one of the major impediments to using a VLIW (or any new ILP machine architecture) has been its inability to run existing binaries of established architectures. Code scheduling plays an importance role in increasing ILP available in the program. However, aggressive scheduling technique has high register requirement. In addition, the use of aggressive processor configurations tends to increase the number of register required by software pipeline loops. The flat register file organization traditionally used in the design of microprocessors does not scale well when the register file requirements and the number of ports required to access it are high. In [4] present an alternative design for register file of future aggressive VLIW processor, a 2-level hierarchical register file, which combines high capacity and high number of ports access with low 1
2 access time. Higher capacity reduces spill code and allows the software to make aggressive code scheduling and optimization. Our goal is to develop and binary translator that map flat register file configuration to hierarchical register file configuration and obtain the characteristic and some initial requirement for translating between the two platforms. This paper is organized in the following manner. The next section discusses our implementation of major module consist in the project. Section 3 shows our experiment set up and results. Section 4 discusses related works, and finally, we present our conclusion and future work in section Methodology 2.1 Overview Figure 1 shows the overview of the component that we develop for this project. - The compiler that compiles the Java or C benchmarks to MIPS assembly code is already available. In this project we use a simple Java compiler which can compile limited Java programs to MIPS assembly code. The language features supported by this compiler include integer operations, logic operation, functional call which can also be recursive, variables comparison, string variables, result output (printf), variables assignment, branch instructions (if... else...) and while loop. Although it only supports a subset of Java or C language, we think it is enough for us because our focus is on binary translation. - We develop simple in-order simulators for MIPS and CRAY-2 in order to obtain the profile information, as well as the statistic information for comparison between the two platforms. These simulators also serve as correctness verification. - The parser compiles the MIPS assembly code to binary format and writes the result binary in to a file. - The translator translates the input MIPS binary in to CRAY-2 binary file, and perform optimizations during the translation. The translator is the main emphasis in this project and will be discuss in great detail. 2
3 Compiler Parser Translator MIPS Simulator CRAY-2 Simulator Correctness verification and Performance comparison Figure 1: Overview of the component in MIPS to CRAY-2 Binary translator 2.2 Instruction set MIPS instruction has a fixed length of 4 bytes. It has 34 registers. Besides the 32 general purpose registers, LO and HI are special registers that hold 64 bit results from multiplication and division instructions. CRAY-2 instruction set has variable length instructions and uses word addressing. It has 8 of first level register and 64 of 2nd level registers. Due to time constrain, we only implement a subset of MIPS and CRAY-2 instruction set and we only chose a use-level instruction. Appendix A gives detail information about our source and target instruction set. 2.3 Binary translator Our translator is a static translator. We have 3 levels of translator. Level 1 is a baseline translator that has no optimization. A level 2 translator enhances the base line translator by construct a basic block and add register allocation technique to reduce spills to second level register. Level 3 translator enhances level 2 by 3
4 construct a superblock and applies various technique of superblock optimization. Appendix B gives detail information of the map from MIPS instructions to CRAY-2 instructions Level 1 translator Since Cray has more level 2 registers than MIPS registers, all MIPS registers are identically mapped to Cray level 2 register. Level 2 register is loaded to level 1 register whenever it s used and write it back to level 2 register when the result is generated. The MIPS instructions are scanned and converted into CRAY- 2 instruction, then the result CRAY-2 instructions are scanned and fix all the link of change of control flow instruction. This scheme allows us to do the translation statically without any runtime information but not very efficiently level 2 translator We construct the control flow graphs of basic blocks. Each basic block contains several MIPS instructions. The new basic block is constructed whenever encounter branch and direct jump whose targets basic block haven t been constructed. The new graph is constructed whenever the translator encounters the instruction of Jump And Link (i.e. call). The termination condition for a graph construction is indirect jump (i.e. return), direct jump and branch whose targets are already been constructed, and the end of program. After all graphs are constructed, the graph-coloring algorithm is applied to allocation MIPS registers to 8 registers with minimum register spills. The translators then walk through the graphs, convert MIPS instruction to CRAY-2 instruction, and adjust all the links of change of control flow instruction. Because we translate the binary code statically, and we assume all indirect jumps happen only when functions return, we do not need to handle them specifically level 3 translators After transform the MIPS instructions into graphs of basic blocks and apply the graph-coloring algorithm, the superblock [5] is generated based on the flow graph of basic blocks and the edge profile information obtained by the simulator. We generate the superblocks conservatively, i.e. we do not try to duplicate basic 4
5 blocks when the down stream block is target of more than one block. Besides this, the criterion to enlarge a superblock is that the ratio of execution frequency between its two paths exceeds 2.0. If the conservative condition and ratio requirement are met, we enlarge the superblock along the path with higher execution frequency. Unconditional branch is removed and its target block is directly merged if the conservative condition is met. After the superblock is generated, now with bigger scope, the translator walks through all the graph of superblocks and converts MIPS instruction to CRAY-2 instruction. Then many code optimization techniques can be applied to the CRAY-2 superblock. As our translator is static, couple of optimizations such as code sinking, peephole and dead code elimination have been used at this level. The following figure explains the relationships between the three different level translations: superblock basic Block Level 3 Level 2 Natural Translator Level 1 reg s coloring code sinking, dead code, peephole Figure 2: Translator diagram Binary CRAY sequence As compared in section 3, each level gives different performance and tradeoff. 2.4 Simulator For better and more precise comparison between the codes before and after our translation, we construct two simple simulators, one for MIPS and the other for CRAY. The simulator only supports in-order execution and only one functional unit. The simulators has ability to generate edge profile information along with other statistic such as total instruction simulated, IPC, total number of bytes of instruction 5
6 simulated, total number of load and stores instruction, total number of inter and intra level register moves. These simulators also serve as correctness verification. 3. Experimental setup and result Since we only translate a subset of MIPS instruction sets, we use the benchmark generated from our compiler. This test benchmark contains approximately 1850 MIPS assembly instructions and performs different tasks, e.g. Fibonacci recursion algorithm, Dot product, tight mathematical loop, nested loop. This benchmark needs to be run for more than one million cycles before it finishes, and is sufficient enough to evaluate both the quality and quantity of our special translator running at different level. The register configuration that we use for both simulators are as follow. MIPS simulator CRAY-2 simulator Number of first level register 32 8 Number of second level register 0 64 Memory, 1 st, 2 nd latency level register 1 1 Table 1: Register configuration for both simulators We generate the MIPS binary of the benchmark and translate it with different translator, which has different level of optimization and run it on the simulator to compare the result from standard output to verify the correctness. The following figures summarize the results that we got from our experiment. Figure 4 shows the code size for the binary files. The code size increases after translation. The reasons are: First, even though most of CRAY-2 instruction consumes only 2 bytes while all MIPS instruction consumes four bytes, the most frequently use MIPS instruction such as ADDI, ORI, LUI, LW, ST requires more than one CRAY-2 instruction to emulate them. An important drawback of the CRAY-2 instructions is 6
7 that the arithmetic operations and memory operations can only have register operands, and can not operate immediate value directly. So to translate all of these instructions, we need at least one more instruction to load the immediate value to a temporal register, which will be 4-byte instructions and thus increase the code size. Also, some MIPS instructions have no directly corresponding instructions in CRAY-2 ISA. So we need more than one instruction to achieve the same results. Second, the small amount of level 1 register in CRAY-2 architecture introduces some overhead because extra inter-level register move instructions are required, which consumes 4 bytes each. Therefore, the code and byte code expansion are quite high. Code size of program generated from different translator Code size(byte) MIPS level1 level2 level3 Translator type Code size(byte) Figure 4: Comparison between code size in byte of MIPS binary, binary generated from level 1, level 2, and level 3 translator. From figure 4, the code expansion of the code generated from natural (level1) translator is about 4 times the source code size, which is mainly because the huge numbers of data transfer between the two levels of registers. The code size of natural translator can be reduced sharply by effectively coloring register allocation taking the advantage of 1st level registers. In our benchmark the code size is shrunk with the factor about 2.2 when apply optimization level 2. Optimization level 3 gives a slightly smaller code size than optimization level2 due to the removal of some redundant code. Figure 5 shows the simulation cycle results. Because the simulation cycle is proportional to execution time for the same clock frequency, it is an ideal matrix for performance measurement. From figure 5 we can see, 7
8 the performance of code generated from natural translator degrades by approximately 4 times comparing to the performance of MIPS code due to translation overhead. When applying the optimization level2, the performance improves substantially due to the reduction of inter-level register move instruction as shown in figure 6 and 7. Simulation Cycle of program generated from differnt translator Simulation Cycle MIPS level1 level2 level3 Translator type Simulation Cycle Figure 5: Comparison between simulation cycles from simulator executing MIPS binary, binary generated from level1, level2, and level3 translator. For the optimizations used in translation at level3 do not give out very big improvement in terms of code size and execution time. There are two reasons: 1) The superblocks we formed are too conservative. This limits the size of superblocks and the scope for optimization. This is also the reason that the code size does not increase after superblock formation. 2) Optimization schemes such as peephole, dead code elimination and code sinking are not powerful for reducing both size and layout. Because peephole has already been performed on the MIPS code, not too much chances for it will be introduced by translation. Code sinking will not reduce the code size, but it can reduce the execution time. In fact we believe the inlining technique will be very helpful for performance. But due to the time limitation we do not implement and evaluate it. Figure 6 and 7 show the experiment results of register read and write. Because MIPS only has one level of registers, we compare the register accesses of first-level registers in CRAY to the register accesses in 8
9 MIPS. In figure 6 we can see the similar trend for register read and write operation as that in figure 4 and 5. The graph coloring register allocation in level-2 translation effectively reduces the register read and write. However, as mentioned before, the limitation exists as that a fair amount of CRAY instructions can not operate immediate value, while the corresponding MIPS instructions can. So the increase of register access in inevitable. Number of level1 register Read/Write of program generated from different translator Number of level1 register Read/Write MIPS level1 level2 level3 Translator type L1 register Write L1 register Read Figure 6: Comparison between number of level1 register read and write from simulator executing MIPS binary, and binary generated from level 1, level 2, level 3 translator. Figure 7 shows the accesses to the second level registers, for three different levels of translation. It is quite reasonable to see the drop of accesses after graph coloring method effectively allocate first level registers. Figure 8 shows the results of memory load and store. The numbers don t change too much, except on level 3 we can see some reductions which come from the elimination of some redundant load and store instructions, and move some load and store instructions to off-trace blocks. However, the improvement is minor. Note that the number of load and store operations does not increase after translation, which is because we assume the same bus bandwidth, and one CRAY memory operation (load or store) can achieve the same effect as one MIPS memory operation. On the other hand, because we assume the memory load 9
10 and store only have one-cycle latency, the reduction of memory load and store instructions does not affect the total execution time too much. Number of level2 register Read/Write of program generated from different translator Number of level2 register Read/Write level1 level2 level3 Level 2 register Read Level 2 registe Write Translator type Figure 7: Comparison between number of level2 register read and write from simulator executing MIPS binary, and binary generated from level 1, level 2, level 3 translator. 10
11 Number of Memory Load/Store of program genertaed from different translator Number of memory Load/Store MIPS level1 level2 level3 Translator type Memory Store Memory Load Figure 8: Comparison between number of memory load and store from simulator executing MIPS binary, and binary generated from level 1, level 2, level 3 translator. 4. Related work In the last decades there are several emulators very successfully performed dynamic binary translations like SHADE [3], DAISY [10], FX!32 [1], and SUN Wabi [2]. As a fast instruction set simulator at user level code, SHADE runs on SPARC system and simulates the SPARC (Version 8 and 9) and MIPS I instruction sets. It emulates a target system by dynamically cross-compiling the target machine code to run on the host. The dynamically compiled code can optionally include profiling code that traces the execution of the application running on the virtual target machine. The profiling is extensible and may be dynamically controlled by the user. Another interesting point is that SHADE has translation cache and translation TLB. So the translations are cached for later reuse to amortize compilation cost. DAISY from IBM Watson Research Center focuses on the full system simulation, i.e. unlike SHADE, it runs on both user and OS level. At run-time DAISY dynamically translates code for a PowerPC processor into code for an underlying VLIW processor. Its translated VLIW code are saved and cached too so that when the same PowerPC code is later encountered, its VLIW translation can execute immediately without retranslation. FX!32 and SUN Wabi are both ABI VM and their ISA are differ from Intel X-86 ISA. However the FX!32 translate any 32-bit x-86 application that run on an intelx86 microprocessor running the Windows NT4.0 operating system to run on Alpha microprocessor that also running the Windows NT4.0 operating system 11
12 while the SUN Wabi translate some commonly used x-86 application to run on Unix. So the translation for the SUN Wabi is more involved and this is reflex in their primary goal that they only support some commonly use windows application and emphasize on correctness. FX!32 does not do any translation while the application is executing. Rather, FX!32 captures an execution profile on the first run time and store it in the disk. The translation in done offline using run time collected data from profiling to translate and optimization then put the result into the translation cache. The later executions use the translated code in translation cache and continue profiling. The SUN Wabi uses dynamic translation base on the advantage of the dynamic translation such as adaptive optimization can exceed static quality and dynamically generate application code. It re-translates each time the ABI is initiated so there s no persistence between executions. SUN Wabi does translation per interval and store in the code cache with FIFO management. 5. Conclusion and future work There are some conclusions we can make from this experiment. First, it is desirable to have large amount of first-level registers. In our experiment, the small amount of first level register becomes the bottleneck. We do not fully utilize the second-level registers. Because we do not implement register renaming and speculative execution, we only need the same amount of registers in the second level register file, as that in MIPS. However, the small amount of first level register introduces extra execution such as spills. On the other hand, if we implement register renaming and speculative execution, we do not think there will be much improvement because the increase of inter-register move will reduce the benefit brought by renaming and speculation. On the other hand, in our experiment, we only have one execution unit and the instructions can not be executed parallel. This structure is not suitable to have hierarchical register files. For modern hierarchical register design, there are more than one function unit and each functional unit has its own first-level register file. However, even in those designs with multi function unit, the translation should have an effectively way to distribute the register usages evenly. Otherwise the small amount of first-level register will still become the bottleneck. 12
13 Graph coloring register allocation algorithm can reduce the inter-register move overhead substantially. But it is time-consuming and it is not possible to be implemented for dynamical translation. Second, the conservative superblock formation does not bring us too much benefit. Although it limit the size expansion, it prevents us from discovering more opportunities for optimization. Third, the limitation that a fair amount of CRAY instructions can not operate immediate value directly brings much overhead. It increases the code size, register assesses, and consequently increases the execution time. For the future work, integrating the simulator with the translator and constructing the translation cache allow the dynamic optimization to be performed and can improve the quality of the translator even further than what we see in this project. It s possible that the hierarchical register file can help improve the performance of VLIW architecture as introduce in [4]. Therefore, it s interesting to study the effect of organizing the CRAY instruction in bundles and run them in VLIW manner. This requires the modification in simulator to VLIW simulator. 13
Virtual Machines and Dynamic Translation: Implementing ISAs in Software
Virtual Machines and Dynamic Translation: Implementing ISAs in Software Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Software Applications How is a software application
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationAdvanced Computer Architecture
ECE 563 Advanced Computer Architecture Fall 2007 Lecture 14: Virtual Machines 563 L14.1 Fall 2009 Outline Types of Virtual Machine User-level (or Process VMs) System-level Techniques for implementing all
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationECE 486/586. Computer Architecture. Lecture # 7
ECE 486/586 Computer Architecture Lecture # 7 Spring 2015 Portland State University Lecture Topics Instruction Set Principles Instruction Encoding Role of Compilers The MIPS Architecture Reference: Appendix
More informationChapter 2A Instructions: Language of the Computer
Chapter 2A Instructions: Language of the Computer Copyright 2009 Elsevier, Inc. All rights reserved. Instruction Set The repertoire of instructions of a computer Different computers have different instruction
More informationEITF20: Computer Architecture Part2.1.1: Instruction Set Architecture
EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Instruction Set Principles The Role of Compilers MIPS 2 Main Content Computer
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationLecture 4: Instruction Set Architecture
Lecture 4: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation Reading: Textbook (5 th edition) Appendix A Appendix B (4 th edition)
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationInherently Lower Complexity Architectures using Dynamic Optimization. Michael Gschwind Erik Altman
Inherently Lower Complexity Architectures using Dynamic Optimization Michael Gschwind Erik Altman ÿþýüûúùúüø öõôóüòñõñ ðïîüíñóöñð What is the Problem? Out of order superscalars achieve high performance....butatthecostofhighhigh
More informationExecution-based Scheduling for VLIW Architectures. Kemal Ebcioglu Erik R. Altman (Presenter) Sumedh Sathaye Michael Gschwind
Execution-based Scheduling for VLIW Architectures Kemal Ebcioglu Erik R. Altman (Presenter) Sumedh Sathaye Michael Gschwind September 2, 1999 Outline Overview What's new? Results Conclusions Overview Based
More informationLecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture
Lecture Topics ECE 486/586 Computer Architecture Lecture # 5 Spring 2015 Portland State University Quantitative Principles of Computer Design Fallacies and Pitfalls Instruction Set Principles Introduction
More informationMachine Language Instructions Introduction. Instructions Words of a language understood by machine. Instruction set Vocabulary of the machine
Machine Language Instructions Introduction Instructions Words of a language understood by machine Instruction set Vocabulary of the machine Current goal: to relate a high level language to instruction
More informationReview of instruction set architectures
Review of instruction set architectures Outline ISA and Assembly Language RISC vs. CISC Instruction Set Definition (MIPS) 2 ISA and assembly language Assembly language ISA Machine language 3 Assembly language
More informationComputer Architecture
CS3350B Computer Architecture Winter 2015 Lecture 4.2: MIPS ISA -- Instruction Representation Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and Design,
More informationAdministration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers
Administration CS 380C: Advanced Topics in Compilers Instructor: eshav Pingali Professor (CS, ICES) Office: POB 4.126A Email: pingali@cs.utexas.edu TA: TBD Graduate student (CS) Office: Email: Meeting
More informationCSEE 3827: Fundamentals of Computer Systems
CSEE 3827: Fundamentals of Computer Systems Lecture 15 April 1, 2009 martha@cs.columbia.edu and the rest of the semester Source code (e.g., *.java, *.c) (software) Compiler MIPS instruction set architecture
More information2 TEST: A Tracer for Extracting Speculative Threads
EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath
More informationEITF20: Computer Architecture Part2.1.1: Instruction Set Architecture
EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Instruction Set Principles The Role of Compilers MIPS 2 Main Content Computer
More information55:132/22C:160, HPCA Spring 2011
55:132/22C:160, HPCA Spring 2011 Second Lecture Slide Set Instruction Set Architecture Instruction Set Architecture ISA, the boundary between software and hardware Specifies the logical machine that is
More informationAdvanced Instruction-Level Parallelism
Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu
More information15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15
More informationInstruction Set Architecture (ISA)
Instruction Set Architecture (ISA)... the attributes of a [computing] system as seen by the programmer, i.e. the conceptual structure and functional behavior, as distinct from the organization of the data
More informationECE 486/586. Computer Architecture. Lecture # 8
ECE 486/586 Computer Architecture Lecture # 8 Spring 2015 Portland State University Lecture Topics Instruction Set Principles MIPS Control flow instructions Dealing with constants IA-32 Fallacies and Pitfalls
More informationEvolution of ISAs. Instruction set architectures have changed over computer generations with changes in the
Evolution of ISAs Instruction set architectures have changed over computer generations with changes in the cost of the hardware density of the hardware design philosophy potential performance gains One
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationECE232: Hardware Organization and Design
ECE232: Hardware Organization and Design Lecture 2: Hardware/Software Interface Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Overview Basic computer components How does a microprocessor
More informationThe Role of Performance
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware
More informationLecture Topics. Branch Condition Options. Branch Conditions ECE 486/586. Computer Architecture. Lecture # 8. Instruction Set Principles.
ECE 486/586 Computer Architecture Lecture # 8 Spring 2015 Portland State University Instruction Set Principles MIPS Control flow instructions Dealing with constants IA-32 Fallacies and Pitfalls Reference:
More informationCOMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 The Processor: C Multiple Issue Based on P&H Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in
More informationCS3350B Computer Architecture MIPS Instruction Representation
CS3350B Computer Architecture MIPS Instruction Representation Marc Moreno Maza http://www.csd.uwo.ca/~moreno/cs3350_moreno/index.html Department of Computer Science University of Western Ontario, Canada
More informationChapter 2. Instructions: Language of the Computer. Adapted by Paulo Lopes
Chapter 2 Instructions: Language of the Computer Adapted by Paulo Lopes Instruction Set The repertoire of instructions of a computer Different computers have different instruction sets But with many aspects
More informationInstruction Set Principles and Examples. Appendix B
Instruction Set Principles and Examples Appendix B Outline What is Instruction Set Architecture? Classifying ISA Elements of ISA Programming Registers Type and Size of Operands Addressing Modes Types of
More informationIntroduction to the MIPS. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Introduction to the MIPS Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction to the MIPS The Microprocessor without Interlocked Pipeline Stages
More informationCS252 Spring 2017 Graduate Computer Architecture. Lecture 18: Virtual Machines
CS252 Spring 2017 Graduate Computer Architecture Lecture 18: Virtual Machines Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Midterm Topics ISA -- e.g. RISC vs. CISC
More informationVirtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])
EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,
More informationEITF20: Computer Architecture Part 5.1.1: Virtual Memory
EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationLIA. Large Installation Administration. Virtualization
LIA Large Installation Administration Virtualization 2 Virtualization What is Virtualization "a technique for hiding the physical characteristics of computing resources from the way in which other systems,
More informationMultiple Issue ILP Processors. Summary of discussions
Summary of discussions Multiple Issue ILP Processors ILP processors - VLIW/EPIC, Superscalar Superscalar has hardware logic for extracting parallelism - Solutions for stalls etc. must be provided in hardware
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture The Computer Revolution Progress in computer technology Underpinned by Moore s Law Makes novel applications
More informationASSEMBLY LANGUAGE MACHINE ORGANIZATION
ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction
More informationInstruction Level Parallelism
Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationCrusoe Reference. What is Binary Translation. What is so hard about it? Thinking Outside the Box The Transmeta Crusoe Processor
Crusoe Reference Thinking Outside the Box The Transmeta Crusoe Processor 55:132/22C:160 High Performance Computer Architecture The Technology Behind Crusoe Processors--Low-power -Compatible Processors
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationVirtual Memory: From Address Translation to Demand Paging
Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 12, 2014
More informationCS3350B Computer Architecture MIPS Introduction
CS3350B Computer Architecture MIPS Introduction Marc Moreno Maza http://www.csd.uwo.ca/~moreno/cs3350_moreno/index.html Department of Computer Science University of Western Ontario, Canada Thursday January
More informationA Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines
A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode
More informationChapter 7 The Potential of Special-Purpose Hardware
Chapter 7 The Potential of Special-Purpose Hardware The preceding chapters have described various implementation methods and performance data for TIGRE. This chapter uses those data points to propose architecture
More informationInstruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov
Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated
More informationSlides for Lecture 6
Slides for Lecture 6 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 28 January,
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationChapter 13 Reduced Instruction Set Computers
Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining
More informationUNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.
UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known
More informationUG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects
Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer
More informationCENG3420 Lecture 03 Review
CENG3420 Lecture 03 Review Bei Yu byu@cse.cuhk.edu.hk 2017 Spring 1 / 38 CISC vs. RISC Complex Instruction Set Computer (CISC) Lots of instructions of variable size, very memory optimal, typically less
More informationInstructions: Language of the Computer
CS359: Computer Architecture Instructions: Language of the Computer Yanyan Shen Department of Computer Science and Engineering 1 The Language a Computer Understands Word a computer understands: instruction
More informationComputer Architecture. Chapter 2-2. Instructions: Language of the Computer
Computer Architecture Chapter 2-2 Instructions: Language of the Computer 1 Procedures A major program structuring mechanism Calling & returning from a procedure requires a protocol. The protocol is a sequence
More informationBuilding a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano
Building a Runnable Program and Code Improvement Dario Marasco, Greg Klepic, Tess DiStefano Building a Runnable Program Review Front end code Source code analysis Syntax tree Back end code Target code
More informationUniprocessors. HPC Fall 2012 Prof. Robert van Engelen
Uniprocessors HPC Fall 2012 Prof. Robert van Engelen Overview PART I: Uniprocessors and Compiler Optimizations PART II: Multiprocessors and Parallel Programming Models Uniprocessors Processor architectures
More informationBus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao
Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao Abstract In microprocessor-based systems, data and address buses are the core of the interface between a microprocessor
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationLec 13: Linking and Memory. Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University. Announcements
Lec 13: Linking and Memory Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University PA 2 is out Due on Oct 22 nd Announcements Prelim Oct 23 rd, 7:30-9:30/10:00 All content up to Lecture on Oct
More informationECE232: Hardware Organization and Design
ECE232: Hardware Organization and Design Lecture 4: MIPS Instructions Adapted from Computer Organization and Design, Patterson & Hennessy, UCB From Last Time Two values enter from the left (A and B) Need
More informationThe Implications of Multi-core
The Implications of Multi- What I want to do today Given that everyone is heralding Multi- Is it really the Holy Grail? Will it cure cancer? A lot of misinformation has surfaced What multi- is and what
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More information4. Hardware Platform: Real-Time Requirements
4. Hardware Platform: Real-Time Requirements Contents: 4.1 Evolution of Microprocessor Architecture 4.2 Performance-Increasing Concepts 4.3 Influences on System Architecture 4.4 A Real-Time Hardware Architecture
More informationCS 252 Graduate Computer Architecture. Lecture 15: Virtual Machines
CS 252 Graduate Computer Architecture Lecture 15: Virtual Machines Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs252
More informationTen Reasons to Optimize a Processor
By Neil Robinson SoC designs today require application-specific logic that meets exacting design requirements, yet is flexible enough to adjust to evolving industry standards. Optimizing your processor
More informationParallelism of Java Bytecode Programs and a Java ILP Processor Architecture
Australian Computer Science Communications, Vol.21, No.4, 1999, Springer-Verlag Singapore Parallelism of Java Bytecode Programs and a Java ILP Processor Architecture Kenji Watanabe and Yamin Li Graduate
More informationInstruction Set Principles. (Appendix B)
Instruction Set Principles (Appendix B) Outline Introduction Classification of Instruction Set Architectures Addressing Modes Instruction Set Operations Type & Size of Operands Instruction Set Encoding
More informationChapter 4 The Processor (Part 4)
Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline
More information4.1 Paging suffers from and Segmentation suffers from. Ans
Worked out Examples 4.1 Paging suffers from and Segmentation suffers from. Ans: Internal Fragmentation, External Fragmentation 4.2 Which of the following is/are fastest memory allocation policy? a. First
More informationHomework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures
Homework 5 Start date: March 24 Due date: 11:59PM on April 10, Monday night 4.1.1, 4.1.2 4.3 4.8.1, 4.8.2 4.9.1-4.9.4 4.13.1 4.16.1, 4.16.2 1 CSCI 402: Computer Architectures The Processor (4) Fengguang
More informationComputer Organization MIPS ISA
CPE 335 Computer Organization MIPS ISA Dr. Iyad Jafar Adapted from Dr. Gheith Abandah Slides http://www.abandah.com/gheith/courses/cpe335_s08/index.html CPE 232 MIPS ISA 1 (vonneumann) Processor Organization
More informationCHAPTER 5 A Closer Look at Instruction Set Architectures
CHAPTER 5 A Closer Look at Instruction Set Architectures 5.1 Introduction 199 5.2 Instruction Formats 199 5.2.1 Design Decisions for Instruction Sets 200 5.2.2 Little versus Big Endian 201 5.2.3 Internal
More informationLecture 4: MIPS Instruction Set
Lecture 4: MIPS Instruction Set No class on Tuesday Today s topic: MIPS instructions Code examples 1 Instruction Set Understanding the language of the hardware is key to understanding the hardware/software
More informationComputer Systems Architecture I. CSE 560M Lecture 3 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 3 Prof. Patrick Crowley Plan for Today Announcements Readings are extremely important! No class meeting next Monday Questions Commentaries A few remaining
More informationCPS104 Computer Organization Lecture 1. CPS104: Computer Organization. Meat of the Course. Robert Wagner
CPS104 Computer Organization Lecture 1 Robert Wagner Slides available on: http://www.cs.duke.edu/~raw/cps104/lectures 1 CPS104: Computer Organization Instructor: Robert Wagner Office: LSRC D336, 660-6536
More informationCSCE 5610: Computer Architecture
HW #1 1.3, 1.5, 1.9, 1.12 Due: Sept 12, 2018 Review: Execution time of a program Arithmetic Average, Weighted Arithmetic Average Geometric Mean Benchmarks, kernels and synthetic benchmarks Computing CPI
More informationChapter 8 & Chapter 9 Main Memory & Virtual Memory
Chapter 8 & Chapter 9 Main Memory & Virtual Memory 1. Various ways of organizing memory hardware. 2. Memory-management techniques: 1. Paging 2. Segmentation. Introduction Memory consists of a large array
More informationComputer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue
More informationEITF20: Computer Architecture Part4.1.1: Cache - 2
EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationComputer Architecture Review. Jo, Heeseung
Computer Architecture Review Jo, Heeseung Computer Abstractions and Technology Jo, Heeseung Below Your Program Application software Written in high-level language System software Compiler: translates HLL
More informationMore on Conjunctive Selection Condition and Branch Prediction
More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused
More informationDonn Morrison Department of Computer Science. TDT4255 ILP and speculation
TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple
More informationPipelining, Branch Prediction, Trends
Pipelining, Branch Prediction, Trends 10.1-10.4 Topics 10.1 Quantitative Analyses of Program Execution 10.2 From CISC to RISC 10.3 Pipelining the Datapath Branch Prediction, Delay Slots 10.4 Overlapping
More informationHardware Speculation Support
Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification
More informationA Study of Workstation Computational Performance for Real-Time Flight Simulation
A Study of Workstation Computational Performance for Real-Time Flight Simulation Summary Jeffrey M. Maddalon Jeff I. Cleveland II This paper presents the results of a computational benchmark, based on
More informationCS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 15: Caching: Demand Paged Virtual Memory
CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring 2003 Lecture 15: Caching: Demand Paged Virtual Memory 15.0 Main Points: Concept of paging to disk Replacement policies
More informationMicroprocessor Architecture Dr. Charles Kim Howard University
EECE416 Microcomputer Fundamentals Microprocessor Architecture Dr. Charles Kim Howard University 1 Computer Architecture Computer System CPU (with PC, Register, SR) + Memory 2 Computer Architecture ALU
More informationCPS104 Computer Organization Lecture 1
CPS104 Computer Organization Lecture 1 Robert Wagner Slides available on: http://www.cs.duke.edu/~raw/cps104/lectures 1 CPS104: Computer Organization Instructor: Robert Wagner Office: LSRC D336, 660-6536
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not
More information