instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals
|
|
- Rachel Gardner
- 5 years ago
- Views:
Transcription
1 Performance Evaluations of a Multithreaded Java Microcontroller J. Kreuzinger, M. Pfeer A. Schulz, Th. Ungerer Institute for Computer Design and Fault Tolerance University of Karlsruhe, Germany U. Brinkschulte, C. Krakowski Institute for Process Control Automation and Robotics University of Karlsruhe, Germany Abstract We propose handling of external real-time events through multithreading and describe the microarchitecture of our multithreaded Java microcontroller, called Komodo microcontroller. Real-time Java threads are used as interrupt service threads (ISTs) instead of interrupt service routines (ISRs). Our proposed Komodo microcontroller supports multiple ISTs with zerocycle context switching overhead. We evaluate the basic performance and optimize the internal structures of our microcontroller using real-time applications as benchmarks on a software simulator. We determine the correlation between the fetch bandwidth and the instruction window size. Our results show a performance gain of about. by latency reducing over the single threaded model. Keywords: multithreading, microcontroller, Java microprocessor, real-time, interrupt service threads Introduction A multithreaded processor is characterized by the ability to simultaneously execute instructions of dierent threads within the processor pipeline. Multiple on-chip register sets are employed to reach an extremely fast context switch. Multithreading techniques are proposed in processor architecture research to mask latencies of instructions of the presently executing thread by execution of instructions of other threads []. Recently, multithreading has also been proposed for event-handling of internal events ([], []) by taking advantage of its fast context switching ability. However, to date multithreading has never been explored in context of microcontrollers for handling of external hardware events. Contemporary microprocessors and microcontrollers activate Interrupt-Service-Routines (ISRs) for event handling. Assuming a multithreaded processor core allows to trigger so-called Interrupt-Service-Threads (ISTs) instead of ISRs for event handling. In this case an occurring event activates an assigned thread. Using threads instead of ISRs in combination with a multithreaded processor core has several advantages: Thread activation is extremely fast due to the multiple register sets on the multithreaded processor chip, thus enabling single-cycle event reaction. Because of the parallel execution, ISTs are not blocked by ISTs with higher priority, in contrast to ISRs. Latencies within the pipeline can be utilized by other concurrently executing threads or ISTs. A exible context switch between event triggered ISTs and other threads is made possible by the IST scheme. In contrast, ISRs can only be interrupted by other ISRs, but not by threads.
2 Threads allow exible priority schemes. Our project explores the suitability ofmul- tithreading techniques in embedded real-time systems. We propose multithreading as an event handling mechanism that allows ecient handling of simultaneous overlapping events with hard real-time requirements. Moreover, we explore our methods in context of a realtime Java system. State-of-the-art embedded systems are programmed in assembly or C/C++ languages. Java features several advantages with respect to real-time systems: The object orientation of the Java language supports easy programming, reusability, and robustness. Java bytecode is portable, of small size, and secure due to the bytecode verier in the Java-Virtual- Machine (JVM). However, to date the use of Java is hardly possible in embedded real-time systems because of its high hardware requirements and unpredictable real-time behavior. Both problems are solved with the design of a multithreaded Java-microcontroller enhanced by an adapted JVM. A Java-microprocessor, like the picojava-i and II from Sun ([], [5]), executes Java bytecode instructions directly in hardware raising the performance of Java applications and decreasing the hardware requirements compared to a Java interpreter or Just-In-Time (JIT) compiler. In enhancement to the picojava approach the real-time issue is supported by applying the multithreading technique to reach a zero cycle context switching overhead and by hardware support for advanced real-time scheduling schemes. The next section presents our Komodo microcontroller and section three our performance results. The proposed Komodo microcontroller The Komodo microcontroller is a multithreaded Java microcontroller which supports multiple ISTs with zero-cycle context switching overhead and several priority schemes. A high priority IST may execute with full speed, lower priority ISTs and non-real-time threads are restricted to the latency pipeline slots of the high priority IST. In the actual design, the microcontroller holds the contexts of up to four threads. If there are more then four threads to handle, our architecture allows up to three real-time threads, which are directly mapped to hardware threads. The remaining threads must be non real-time and are scheduled within the fourth hardware thread. To scale up for larger systems with more than three real-time threads, we propose a parallel execution on several microcontrollers connected by a middleware platform. This topic is covered in another paper [6]. Because of its application for embedded systems, the processor core of the Komodo microcontroller is kept at the hardware level of a simple scalar processor. As shown in Fig., the microcontroller core consists of an instructionfetch unit, a decode unit, a memory access unit (MEM) and an execution unit (ALU). Four stack register sets are provided on the processor chip. A signal unit triggers IST execution due to external signals. memory interface address instructions micro-ops ROM data path PC IW MEM instruction fetch PC PC PC IW IW IW priority manager instruction decode stack register sets ALU signal unit Figure : Block diagram of the Komodo microcontroller The instruction fetch unit holds four program counters (PC) with dedicated status bits (e.g. thread active/suspended), each PC is assigned to another thread. Four byte por- extern signals
3 tions are fetched over the memory interface and put in the according instruction window (IW). Several instructions may be contained in the fetch portion, because of the average bytecode length of.8 bytes. Instructions are fetched depending on the status bits and ll levels of the IWs. The instruction decode unit contains the above mentioned IWs, dedicated status bits (e.g. priority) and counters. A priority manager decides subject to the bits and counters from which IW the next instruction will be decoded. We dened several priority schemes [7] to handle real-time requirements. The priority manager applies one of the implemented thread priority schemes for IW selection. Abytecode instruction is decoded either to a single microop, a sequence of micro-ops, or a trap routine is called. Each opcode is propagated through the pipeline together with its thread id. Opcodes from multiple threads can be simultaneously present in the dierent pipeline stages. The instructions for memory access are executed by the MEM unit. If the memory interface only permits one access each cycle, an arbiter is needed for instruction fetch and data access. All other instructions are executed by the ALU unit. Both units (MEM and ALU) can take several cycles to complete an instruction execution. After that, the result is written back to the stack register set of the according thread. External signals are delivered to the signal unit from the peripheral components of the microcontroller core as e.g. timer, counter, or serial interface. By the occurrence of such a signal the corresponding IST is activated. As soon as an IST activation ends its assigned real-time thread is suspended and its status is stored. An external signal may activate the same thread again. To avoid pipeline stalls, instructions from other threads can be fed into the pipeline. Idle times may result from branches or memory accesses. The decode unit predicts the latency after such an instruction, and proceeds with instructions from other IWs. There is no overhead for such a context switch. No save/restore of registers or removal of instructions from the pipeline is needed, because each thread has it's own stack register set. Because of the unpredictability of cache accesses, a non-cached memory access is preferred for real-time microcontrollers. The emerging load latencies are bridged by scheduling instructions of other threads by the priority manager. Therefore a cache is omitted from our Komodo microcontroller. Performance evaluation For the evaluation of our design, we use an execution-based simulator of our Komodo microcontroller with an adapted JVM to run several benchmarks (e.g. CoeineMark.0, a FFT algorithm, a PID element, and an impulse counter). This JVM is yet not optimized with quick-opcodes or something else. Therefore, the benchmarks are used to apply a real bytecode mix for microarchitecture optimization and not to compare the performance of the Komodo microcontroller with other approaches. First of all, we want to prove the presumption of a performance gain by bridging latencies. Our rst experiment assumes processor clock cycles of 00MHz or less and fast SRAM storage for memory thus eliminating load/store latencies. Under these assumptions the only latencies stem from branches, that take two cycles overhead in the pipeline to resolve and fetch from the new PC. The results pointed out a gain of.8 with: gain = cycles needed for sequential execution cycles needed by multithreaded execution The second experiment additionally regards latencies of data accesses. We assume a large data memory with DRAM modules instead of costly SRAM chips. Due to the need for available instructions we assume a small instruction memory implemented with SRAM chips or as an on-chip memory. This yields instruction fetch without latency. We vary the share of load/store instructions and their latency. The
4 results of this situation are shown in gure. The dierent graphs show the gain with respect to memory latency (in cycles). Each graph represents a program with another percentage of l/s-instructions (0, 0, 0, 50, 70, 90%) respectively the impulse counter (imp), PID element (pid), and fast fourier transformation algorithm (t). The results in Fig show a,5,5, latency of l/s-instructions,5,5,5 0,5 share of l/s-instructions in % imp pid fft Memory latency Figure : Benchmark with dierent share and latency of l/s-instructions signicant performance gain which is reached only by the multithreading ability of latency hiding and not by parallel processing. As a third performance test, we determine the number of threads needed to bridge nearly all latencies in typical Java bytecode programs subject to the latency of data access. The "zero" latency graph in Figure shows the gain yielded by branch latency hiding only, while the other graphs concern branch and load/store latency hiding. The gure shows an increasing gain until the number of threads reaches the latency of load/store-instructions. At this point, nearly all latencies are bridged and the processor is fully utilized. After the investigation the basic performance abilities of our architecture, we take a closer look at some hardware parameters. Henceforth, we assume a single memory interface which prefers data access over instruction fetch and a bandwidth of bit. We assume latency of zero cycles for instruction fetch, one cycle for data access, and two cycles for branches. Four threads are active and scheduled by a xed priority until they are 0, Number of threads Figure : threads Benchmark with dierent number of blocked. The share of load/store-instructions is 0% which is typical for many applications. For a zero cycle context switch, it is necessary to keep the instruction windows stued. To reach this goal, several instruction fetch strategies are investigated: only the ll level of the IWs is taken into account, the ll level and the priority of the thread are evaluated, ll level and recently taken branches are taken into account, because the IW is empty after a branch, or a combination of all three techniques is used. Figure shows the dierence between these strategies subject to the IW size. As you can see, there is only a minimal discrepancy and therefore the simplest fetch strategy (only ll level) is sucient. To nd an optimal size of the instruction windows, we run several tests with dierent sizes. We found out, that the window size correlates with the instruction fetch bandwidth, as shown in gure 5. The minimum of the window size is about the fetch bandwidth plus two, necessary to avoid a deadlock that arises when no complete bytecode is in the buer and there
5 ,5,,5,,5 fill level fill level & priority fill level & branches fill level & priority & branches Window size in bytes Figure : Benchmark with dierent fetch strategies is not enough space to fetch a new portion. A detailed view shows an optimum size of about fetch bandwidth plus six, with regard to hardware costs and performance.,6,, 0,8 0,6 0, 0, 0 Fetch bandwidth in bytes Window size in bytes Figure 5: Benchmark with dierent window size and fetch bandwidth Conclusions This paper describes the basic idea of ISTs to handle external events and the appropriate multithreaded Java microcontroller. The microarchitecture is outlined and several benchmark results are shown. On one hand the performance gain possible through multithreading is proven and on the other hand some hardware parameters are optimized. A complete system with hardware, operating system (JVM), middleware and application (automated guided vehicles) is under construction. We are working on a detailed hardware design in VHDL of the Komodo microcontroller and further improve the adapted JVM. References [] J. Silc, B. Robic, and T. Ungerer. Processor Architecture: From Dataow to Superscalar and Beyond. Springer-Verlag, Heidelberg, 999. [] R. S. Chappell, J. Stark, S. P. Kim, S. K. Reinhardt, Y. N. Patt. Simultaneous Subordinate Microthreading (SSMT). ISCA'6 Proceedings, Atlanta, Georgia, Vol 7, No, pp , May 999. [] S. W. Keckler, A. Chang, W. S. Lee, W. J. Dally. Concurrent Event Handling through Multithreading. IEEE Transactions on computers, Vol 8, No 9, pp , September 999 [] J. O'Connor and M. Tremblay. PicoJava-I: The Java Virtual Machine in Hardware. IEEE Micro, pages 5{5, March/April 997. [5] Sun. PicoJava-II Microarchitecture Guide. Sun microsystems, Part No.: , March 999. [6] U. Brinkschulte, C. Krakowski, J. Kreuzinger, R. Marston, and T. Ungerer. The Komodo Project: Thread-Based Event Handling Supported by a Multithreaded Java Microcontroller. 5th EUROMICRO Conference, Milano, September 999. [7] U. Brinkschulte, C. Krakowski, J. Kreuzinger, T. Ungerer. A Multithreaded Java Microcontroller for Thread-oriented Real-time Event- Handling. 999 International Conference on Parallel Architectures and Compilation Techniques (PACT '99), Newport Beach, Ca., pp. -9, October 999. The maximum length of a bytecode instruction executed direct by the Komodo microcontroller is three.
Real-time Scheduling on Multithreaded Processors
Real-time Scheduling on Multithreaded Processors J. Kreuzinger, A. Schulz, M. Pfeffer, Th. Ungerer Institute for Computer Design, and Fault Tolerance University of Karlsruhe D-76128 Karlsruhe, Germany
More informationReal-time Scheduling on Multithreaded Processors
Real-time Scheduling on Multithreaded Processors J. Kreuzinger, A. Schulz, M. Pfeffer, Th. Ungerer U. Brinkschulte, C. Krakowski Institute for Computer Design, Institute for Process Control, and Fault
More informationInterrupt Service Threads - A New Approach to Handle Multiple Hard Real-Time Events on a Multithreaded Microcontroller
Interrupt Service Threads - A New Approach to Handle Multiple Hard Real-Time Events on a Multithreaded Microcontroller U. Brinkschulte, C. Krakowski J. Kreuzinger, Th. Ungerer Institute of Process Control,
More informationA Multithreaded Java Microcontroller for Thread-Oriented Real-Time Event-Handling
A Multithreaded Java Microcontroller for Thread-Oriented Real-Time Event-Handling U. Brinkschulte, C. Krakowski J. Kreuzinger, Th. Ungerer Institute of Process Control, Institute of Computer Design Automation
More informationA Real-Time Java System on a Multithreaded Java Microcontroller
A Real-Time Java System on a Multithreaded Java Microcontroller M. Pfeffer, S. Uhrig, Th. Ungerer Institute for Computer Science University of Augsburg D-86159 Augsburg fpfeffer, uhrig, ungererg @informatik.uni-augsburg.de
More informationA Microkernel Architecture for a Highly Scalable Real-Time Middleware
A Microkernel Architecture for a Highly Scalable Real-Time Middleware U. Brinkschulte, C. Krakowski,. Riemschneider. Kreuzinger, M. Pfeffer, T. Ungerer Institute of Process Control, Institute of Computer
More informationA Scheduling Technique Providing a Strict Isolation of Real-time Threads
A Scheduling Technique Providing a Strict Isolation of Real-time Threads U. Brinkschulte ¾, J. Kreuzinger ½, M. Pfeffer, Th. Ungerer ½ Institute for Computer Design and Fault Tolerance University of Karlsruhe,
More informationPriority manager. I/O access
Implementing Real-time Scheduling Within a Multithreaded Java Microcontroller S. Uhrig 1, C. Liemke 2, M. Pfeffer 1,J.Becker 2,U.Brinkschulte 3, Th. Ungerer 1 1 Institute for Computer Science, University
More informationThe Komodo Project: Thread-based Event Handling Supported by a Multithreaded Java Microcontroller
The Komodo Project: Thread-based Event Handling Supported by a Multithreaded Java Microcontroller J. Kreuzinger, R. Marston, Th. Ungerer Dept. of Computer Design and Fault Tolerance University of Karlsruhe
More informationMPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors
MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development
More information[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School
References [1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School of Computer Science, McGill University, May 1993. [2] C. Young, N. Gloy, and M. D. Smith, \A Comparative
More informationReal-time Garbage Collection for a Multithreaded Java Microcontroller
Real-time Garbage Collection for a Multithreaded Java Microcontroller S. Fuhrmann, M. Pfeffer, J. Kreuzinger, Th. Ungerer Institute for Computer Design and Fault Tolerance University of Karlsruhe D-76128
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationJOP: A Java Optimized Processor for Embedded Real-Time Systems. Martin Schoeberl
JOP: A Java Optimized Processor for Embedded Real-Time Systems Martin Schoeberl JOP Research Targets Java processor Time-predictable architecture Small design Working solution (FPGA) JOP Overview 2 Overview
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationpicojava I Java Processor Core DATA SHEET DESCRIPTION
picojava I DATA SHEET DESCRIPTION picojava I is a uniquely designed processor core which natively executes Java bytecodes as defined by the Java Virtual Machine (JVM). Most processors require the JVM to
More information8085 Microprocessor Architecture and Memory Interfacing. Microprocessor and Microcontroller Interfacing
8085 Microprocessor Architecture and Memory 1 Points to be Discussed 8085 Microprocessor 8085 Microprocessor (CPU) Block Diagram Control & Status Signals Interrupt Signals 8085 Microprocessor Signal Flow
More informationEvaluation of Branch Prediction Strategies
1 Evaluation of Branch Prediction Strategies Anvita Patel, Parneet Kaur, Saie Saraf Department of Electrical and Computer Engineering Rutgers University 2 CONTENTS I Introduction 4 II Related Work 6 III
More informationPerformance Characteristics. i960 CA SuperScalar Microprocessor
Performance Characteristics of the i960 CA SuperScalar Microprocessor s. McGeady Intel Corporation Embedded Microprocessor Focus Group i960 CA - History A highly-integrated microprocessor for embedded
More informationsignature i-1 signature i instruction j j+1 branch adjustment value "if - path" initial value signature i signature j instruction exit signature j+1
CONTROL FLOW MONITORING FOR A TIME-TRIGGERED COMMUNICATION CONTROLLER Thomas M. Galla 1, Michael Sprachmann 2, Andreas Steininger 1 and Christopher Temple 1 Abstract A novel control ow monitoring scheme
More informationInstr. execution impl. view
Pipelining Sangyeun Cho Computer Science Department Instr. execution impl. view Single (long) cycle implementation Multi-cycle implementation Pipelined implementation Processing an instruction Fetch instruction
More informationTypes of Interrupts:
Interrupt structure Introduction Interrupt is signals send by an external device to the processor, to request the processor to perform a particular task or work. Mainly in the microprocessor based system
More informationCARUSO Project Goals and Principal Approach
CARUSO Project Goals and Principal Approach Uwe Brinkschulte *, Jürgen Becker #, Klaus Dorfmüller-Ulhaas +, Ralf König #, Sascha Uhrig +, and Theo Ungerer + * Department of Computer Science, University
More informationChapter 12. CPU Structure and Function. Yonsei University
Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor
More informationWhat is Pipelining? RISC remainder (our assumptions)
What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism
More informationCache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals
Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics
More informationParallelism of Java Bytecode Programs and a Java ILP Processor Architecture
Australian Computer Science Communications, Vol.21, No.4, 1999, Springer-Verlag Singapore Parallelism of Java Bytecode Programs and a Java ILP Processor Architecture Kenji Watanabe and Yamin Li Graduate
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 3 Fundamentals in Computer Architecture Computer Architecture Part 3 page 1 of 55 Prof. Dr. Uwe Brinkschulte,
More informationASSEMBLY LANGUAGE MACHINE ORGANIZATION
ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction
More informationSupporting Multithreading in Configurable Soft Processor Cores
Supporting Multithreading in Configurable Soft Processor Cores Roger Moussali, Nabil Ghanem, and Mazen A. R. Saghir Department of Electrical and Computer Engineering American University of Beirut P.O.
More informationParallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization
Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationMultithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors
Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors Manvi Agarwal and S.K. Nandy CADL, SERC, Indian Institute of Science, Bangalore, INDIA {manvi@rishi.,nandy@}serc.iisc.ernet.in
More informationACCELERATING 2D GRAPHIC APPLICATIONS WITH LOW ENERGY OVERHEAD
ACCELERATING 2D GRAPHIC APPLICATIONS WITH LOW ENERGY OVERHEAD Oliveira, L.; Neves, B; Carro, L Instituto de Informática Universidade Federal do Rio Grande do Sul {loliveira,bsneves,carro}@inf.ufrgs.br
More informationChapter 14 - Processor Structure and Function
Chapter 14 - Processor Structure and Function Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 14 - Processor Structure and Function 1 / 94 Table of Contents I 1 Processor Organization
More informationEE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May Branch Prediction
EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May 2000 Lecture #3: Wednesday, 5 April 2000 Lecturer: Mattan Erez Scribe: Mahesh Madhav Branch Prediction
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationComputer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics
Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can
More informationCPE300: Digital System Architecture and Design
CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Pipelining 11142011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review I/O Chapter 5 Overview Pipelining Pipelining
More informationSimultaneous Multithreading Blending Thread-level and Instruction-level Parallelism in Advanced Microprocessors
Simultaneous Multithreading Blending Thread-level and Instruction-level Parallelism in Advanced Microprocessors JURIJ ŠILC BORUT ROBIČ THEO UNGERER Computer Systems Department Faculty of Computer and Information
More informationConcurrent Event Handling through Multithreading
IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 9, SEPTEMBER 1999 903 Concurrent Event Handling through Multithreading Stephen W. Keckler, Member, IEEE, Andrew Chang, Student Member, IEEE, Whay S. Lee, Sandeep
More informationThis course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers
Course Introduction Purpose: This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers Objectives: Learn about error detection and address errors
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationJob Posting (Aug. 19) ECE 425. ARM7 Block Diagram. ARM Programming. Assembly Language Programming. ARM Architecture 9/7/2017. Microprocessor Systems
Job Posting (Aug. 19) ECE 425 Microprocessor Systems TECHNICAL SKILLS: Use software development tools for microcontrollers. Must have experience with verification test languages such as Vera, Specman,
More informationUNIT- 5. Chapter 12 Processor Structure and Function
UNIT- 5 Chapter 12 Processor Structure and Function CPU Structure CPU must: Fetch instructions Interpret instructions Fetch data Process data Write data CPU With Systems Bus CPU Internal Structure Registers
More informationPipelining. Pipeline performance
Pipelining Basic concept of assembly line Split a job A into n sequential subjobs (A 1,A 2,,A n ) with each A i taking approximately the same time Each subjob is processed by a different substation (or
More informationAdvanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017
Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation
More informationThe Use of Multithreading for Exception Handling
The Use of Multithreading for Exception Handling Craig Zilles, Joel Emer*, Guri Sohi University of Wisconsin - Madison *Compaq - Alpha Development Group International Symposium on Microarchitecture - 32
More informationProcessors. Young W. Lim. May 12, 2016
Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version
More informationNew Advances in Micro-Processors and computer architectures
New Advances in Micro-Processors and computer architectures Prof. (Dr.) K.R. Chowdhary, Director SETG Email: kr.chowdhary@jietjodhpur.com Jodhpur Institute of Engineering and Technology, SETG August 27,
More informationControl Hazards. Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationWhat is Pipelining? Time per instruction on unpipelined machine Number of pipe stages
What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism
More informationStatic Branch Prediction
Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already
More informationREAL-TIME MULTITASKING KERNEL FOR IBM-BASED MICROCOMPUTERS
Malaysian Journal of Computer Science, Vol. 9 No. 1, June 1996, pp. 12-17 REAL-TIME MULTITASKING KERNEL FOR IBM-BASED MICROCOMPUTERS Mohammed Samaka School of Computer Science Universiti Sains Malaysia
More informationA hardware operating system kernel for multi-processor systems
A hardware operating system kernel for multi-processor systems Sanggyu Park a), Do-sun Hong, and Soo-Ik Chae School of EECS, Seoul National University, Building 104 1, Seoul National University, Gwanakgu,
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationDesign and Implementation of a FPGA-based Pipelined Microcontroller
Design and Implementation of a FPGA-based Pipelined Microcontroller Rainer Bermbach, Martin Kupfer University of Applied Sciences Braunschweig / Wolfenbüttel Germany Embedded World 2009, Nürnberg, 03.03.09
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationIntroduction to Computer Systems and Operating Systems
Introduction to Computer Systems and Operating Systems Minsoo Ryu Real-Time Computing and Communications Lab. Hanyang University msryu@hanyang.ac.kr Topics Covered 1. Computer History 2. Computer System
More informationCourse II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan
Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor
More informationComputer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading
More informationAdvanced processor designs
Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationQUESTION BANK. EE 6502 / Microprocessor and Microcontroller. Unit I Processor. PART-A (2-Marks)
QUESTION BANK EE 6502 / Microprocessor and Microcontroller Unit I- 8085 Processor PART-A (2-Marks) YEAR/SEM : III/V 1. What is meant by Level triggered interrupt? Which are the interrupts in 8085 level
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationEmbedded Systems: Hardware Components (part II) Todor Stefanov
Embedded Systems: Hardware Components (part II) Todor Stefanov Leiden Embedded Research Center, Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Outline Generic Embedded
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationARM Simulation using C++ and Multithreading
International Journal of Innovative Technology and Exploring Engineering (IJITEE) ARM Simulation using C++ and Multithreading Suresh Babu S, Channabasappa Baligar Abstract: - This project is to be produced
More informationIn embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.
Lesson 1 Course Notes Review of Computer Architecture Embedded Systems ideal: low power, low cost, high performance Overview of VLIW and ILP What is ILP? It can be seen in: Superscalar In Order Processors
More informationSuperscalar Processors
Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More information15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationronny@mit.edu www.cag.lcs.mit.edu/scale Introduction Architectures are all about exploiting the parallelism inherent to applications Performance Energy The Vector-Thread Architecture is a new approach
More informationLRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.
LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E
More informationCaches. Hiding Memory Access Times
Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY
More informationAdvanced Parallel Architecture Lesson 3. Annalisa Massini /2015
Advanced Parallel Architecture Lesson 3 Annalisa Massini - 2014/2015 Von Neumann Architecture 2 Summary of the traditional computer architecture: Von Neumann architecture http://williamstallings.com/coa/coa7e.html
More information2. List the five interrupt pins available in INTR, TRAP, RST 7.5, RST 6.5, RST 5.5.
DHANALAKSHMI COLLEGE OF ENGINEERING DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING EE6502- MICROPROCESSORS AND MICROCONTROLLERS UNIT I: 8085 PROCESSOR PART A 1. What is the need for ALE signal in
More informationCMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1
CMCS 611-101 Advanced Computer Architecture Lecture 8 Control Hazards and Exception Handling September 30, 2009 www.csee.umbc.edu/~younis/cmsc611/cmsc611.htm Mohamed Younis CMCS 611, Advanced Computer
More informationTHE OPTIUM MICROPROCESSOR AN FPGA-BASED IMPLEMENTATION
THE OPTIUM MICROPROCESSOR AN FPGA-BASED IMPLEMENTATION Radu Balaban Computer Science student, Technical University of Cluj Napoca, Romania horizon3d@yahoo.com Horea Hopârtean Computer Science student,
More informationLecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability
Lecture 27: Pot-Pourri Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood
More informationLecture-15 W-Z: Increment-Decrement Address Latch:
Lecture-15 W-Z: (W) and (Z) are two 8-bit temporary registers not accessible to the user. They are exclusively used for the internal operation by the microprocessor. These registers are used either to
More informationProcessing Unit CS206T
Processing Unit CS206T Microprocessors The density of elements on processor chips continued to rise More and more elements were placed on each chip so that fewer and fewer chips were needed to construct
More informationMicroprocessors and Microcontrollers. Assignment 1:
Microprocessors and Microcontrollers Assignment 1: 1. List out the mass storage devices and their characteristics. 2. List the current workstations available in the market for graphics and business applications.
More informationCS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines
CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationEE 3170 Microcontroller Applications
EE 317 Microcontroller Applications Lecture 5 : Instruction Subset & Machine Language: Introduction to the Motorola 68HC11 - Miller 2.1 & 2.2 Based on slides for ECE317 by Profs. Davis, Kieckhafer, Tan,
More informationA Scalable Multiprocessor for Real-time Signal Processing
A Scalable Multiprocessor for Real-time Signal Processing Daniel Scherrer, Hans Eberle Institute for Computer Systems, Swiss Federal Institute of Technology CH-8092 Zurich, Switzerland {scherrer, eberle}@inf.ethz.ch
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationADVANCED COMPUTER ARCHITECTURE TWO MARKS WITH ANSWERS
ADVANCED COMPUTER ARCHITECTURE TWO MARKS WITH ANSWERS 1.Define Computer Architecture Computer Architecture Is Defined As The Functional Operation Of The Individual H/W Unit In A Computer System And The
More informationChapter 5. Introduction ARM Cortex series
Chapter 5 Introduction ARM Cortex series 5.1 ARM Cortex series variants 5.2 ARM Cortex A series 5.3 ARM Cortex R series 5.4 ARM Cortex M series 5.5 Comparison of Cortex M series with 8/16 bit MCUs 51 5.1
More informationQ.1 Explain Computer s Basic Elements
Q.1 Explain Computer s Basic Elements Ans. At a top level, a computer consists of processor, memory, and I/O components, with one or more modules of each type. These components are interconnected in some
More information