instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals

Size: px
Start display at page:

Download "instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals"

Transcription

1 Performance Evaluations of a Multithreaded Java Microcontroller J. Kreuzinger, M. Pfeer A. Schulz, Th. Ungerer Institute for Computer Design and Fault Tolerance University of Karlsruhe, Germany U. Brinkschulte, C. Krakowski Institute for Process Control Automation and Robotics University of Karlsruhe, Germany Abstract We propose handling of external real-time events through multithreading and describe the microarchitecture of our multithreaded Java microcontroller, called Komodo microcontroller. Real-time Java threads are used as interrupt service threads (ISTs) instead of interrupt service routines (ISRs). Our proposed Komodo microcontroller supports multiple ISTs with zerocycle context switching overhead. We evaluate the basic performance and optimize the internal structures of our microcontroller using real-time applications as benchmarks on a software simulator. We determine the correlation between the fetch bandwidth and the instruction window size. Our results show a performance gain of about. by latency reducing over the single threaded model. Keywords: multithreading, microcontroller, Java microprocessor, real-time, interrupt service threads Introduction A multithreaded processor is characterized by the ability to simultaneously execute instructions of dierent threads within the processor pipeline. Multiple on-chip register sets are employed to reach an extremely fast context switch. Multithreading techniques are proposed in processor architecture research to mask latencies of instructions of the presently executing thread by execution of instructions of other threads []. Recently, multithreading has also been proposed for event-handling of internal events ([], []) by taking advantage of its fast context switching ability. However, to date multithreading has never been explored in context of microcontrollers for handling of external hardware events. Contemporary microprocessors and microcontrollers activate Interrupt-Service-Routines (ISRs) for event handling. Assuming a multithreaded processor core allows to trigger so-called Interrupt-Service-Threads (ISTs) instead of ISRs for event handling. In this case an occurring event activates an assigned thread. Using threads instead of ISRs in combination with a multithreaded processor core has several advantages: Thread activation is extremely fast due to the multiple register sets on the multithreaded processor chip, thus enabling single-cycle event reaction. Because of the parallel execution, ISTs are not blocked by ISTs with higher priority, in contrast to ISRs. Latencies within the pipeline can be utilized by other concurrently executing threads or ISTs. A exible context switch between event triggered ISTs and other threads is made possible by the IST scheme. In contrast, ISRs can only be interrupted by other ISRs, but not by threads.

2 Threads allow exible priority schemes. Our project explores the suitability ofmul- tithreading techniques in embedded real-time systems. We propose multithreading as an event handling mechanism that allows ecient handling of simultaneous overlapping events with hard real-time requirements. Moreover, we explore our methods in context of a realtime Java system. State-of-the-art embedded systems are programmed in assembly or C/C++ languages. Java features several advantages with respect to real-time systems: The object orientation of the Java language supports easy programming, reusability, and robustness. Java bytecode is portable, of small size, and secure due to the bytecode verier in the Java-Virtual- Machine (JVM). However, to date the use of Java is hardly possible in embedded real-time systems because of its high hardware requirements and unpredictable real-time behavior. Both problems are solved with the design of a multithreaded Java-microcontroller enhanced by an adapted JVM. A Java-microprocessor, like the picojava-i and II from Sun ([], [5]), executes Java bytecode instructions directly in hardware raising the performance of Java applications and decreasing the hardware requirements compared to a Java interpreter or Just-In-Time (JIT) compiler. In enhancement to the picojava approach the real-time issue is supported by applying the multithreading technique to reach a zero cycle context switching overhead and by hardware support for advanced real-time scheduling schemes. The next section presents our Komodo microcontroller and section three our performance results. The proposed Komodo microcontroller The Komodo microcontroller is a multithreaded Java microcontroller which supports multiple ISTs with zero-cycle context switching overhead and several priority schemes. A high priority IST may execute with full speed, lower priority ISTs and non-real-time threads are restricted to the latency pipeline slots of the high priority IST. In the actual design, the microcontroller holds the contexts of up to four threads. If there are more then four threads to handle, our architecture allows up to three real-time threads, which are directly mapped to hardware threads. The remaining threads must be non real-time and are scheduled within the fourth hardware thread. To scale up for larger systems with more than three real-time threads, we propose a parallel execution on several microcontrollers connected by a middleware platform. This topic is covered in another paper [6]. Because of its application for embedded systems, the processor core of the Komodo microcontroller is kept at the hardware level of a simple scalar processor. As shown in Fig., the microcontroller core consists of an instructionfetch unit, a decode unit, a memory access unit (MEM) and an execution unit (ALU). Four stack register sets are provided on the processor chip. A signal unit triggers IST execution due to external signals. memory interface address instructions micro-ops ROM data path PC IW MEM instruction fetch PC PC PC IW IW IW priority manager instruction decode stack register sets ALU signal unit Figure : Block diagram of the Komodo microcontroller The instruction fetch unit holds four program counters (PC) with dedicated status bits (e.g. thread active/suspended), each PC is assigned to another thread. Four byte por- extern signals

3 tions are fetched over the memory interface and put in the according instruction window (IW). Several instructions may be contained in the fetch portion, because of the average bytecode length of.8 bytes. Instructions are fetched depending on the status bits and ll levels of the IWs. The instruction decode unit contains the above mentioned IWs, dedicated status bits (e.g. priority) and counters. A priority manager decides subject to the bits and counters from which IW the next instruction will be decoded. We dened several priority schemes [7] to handle real-time requirements. The priority manager applies one of the implemented thread priority schemes for IW selection. Abytecode instruction is decoded either to a single microop, a sequence of micro-ops, or a trap routine is called. Each opcode is propagated through the pipeline together with its thread id. Opcodes from multiple threads can be simultaneously present in the dierent pipeline stages. The instructions for memory access are executed by the MEM unit. If the memory interface only permits one access each cycle, an arbiter is needed for instruction fetch and data access. All other instructions are executed by the ALU unit. Both units (MEM and ALU) can take several cycles to complete an instruction execution. After that, the result is written back to the stack register set of the according thread. External signals are delivered to the signal unit from the peripheral components of the microcontroller core as e.g. timer, counter, or serial interface. By the occurrence of such a signal the corresponding IST is activated. As soon as an IST activation ends its assigned real-time thread is suspended and its status is stored. An external signal may activate the same thread again. To avoid pipeline stalls, instructions from other threads can be fed into the pipeline. Idle times may result from branches or memory accesses. The decode unit predicts the latency after such an instruction, and proceeds with instructions from other IWs. There is no overhead for such a context switch. No save/restore of registers or removal of instructions from the pipeline is needed, because each thread has it's own stack register set. Because of the unpredictability of cache accesses, a non-cached memory access is preferred for real-time microcontrollers. The emerging load latencies are bridged by scheduling instructions of other threads by the priority manager. Therefore a cache is omitted from our Komodo microcontroller. Performance evaluation For the evaluation of our design, we use an execution-based simulator of our Komodo microcontroller with an adapted JVM to run several benchmarks (e.g. CoeineMark.0, a FFT algorithm, a PID element, and an impulse counter). This JVM is yet not optimized with quick-opcodes or something else. Therefore, the benchmarks are used to apply a real bytecode mix for microarchitecture optimization and not to compare the performance of the Komodo microcontroller with other approaches. First of all, we want to prove the presumption of a performance gain by bridging latencies. Our rst experiment assumes processor clock cycles of 00MHz or less and fast SRAM storage for memory thus eliminating load/store latencies. Under these assumptions the only latencies stem from branches, that take two cycles overhead in the pipeline to resolve and fetch from the new PC. The results pointed out a gain of.8 with: gain = cycles needed for sequential execution cycles needed by multithreaded execution The second experiment additionally regards latencies of data accesses. We assume a large data memory with DRAM modules instead of costly SRAM chips. Due to the need for available instructions we assume a small instruction memory implemented with SRAM chips or as an on-chip memory. This yields instruction fetch without latency. We vary the share of load/store instructions and their latency. The

4 results of this situation are shown in gure. The dierent graphs show the gain with respect to memory latency (in cycles). Each graph represents a program with another percentage of l/s-instructions (0, 0, 0, 50, 70, 90%) respectively the impulse counter (imp), PID element (pid), and fast fourier transformation algorithm (t). The results in Fig show a,5,5, latency of l/s-instructions,5,5,5 0,5 share of l/s-instructions in % imp pid fft Memory latency Figure : Benchmark with dierent share and latency of l/s-instructions signicant performance gain which is reached only by the multithreading ability of latency hiding and not by parallel processing. As a third performance test, we determine the number of threads needed to bridge nearly all latencies in typical Java bytecode programs subject to the latency of data access. The "zero" latency graph in Figure shows the gain yielded by branch latency hiding only, while the other graphs concern branch and load/store latency hiding. The gure shows an increasing gain until the number of threads reaches the latency of load/store-instructions. At this point, nearly all latencies are bridged and the processor is fully utilized. After the investigation the basic performance abilities of our architecture, we take a closer look at some hardware parameters. Henceforth, we assume a single memory interface which prefers data access over instruction fetch and a bandwidth of bit. We assume latency of zero cycles for instruction fetch, one cycle for data access, and two cycles for branches. Four threads are active and scheduled by a xed priority until they are 0, Number of threads Figure : threads Benchmark with dierent number of blocked. The share of load/store-instructions is 0% which is typical for many applications. For a zero cycle context switch, it is necessary to keep the instruction windows stued. To reach this goal, several instruction fetch strategies are investigated: only the ll level of the IWs is taken into account, the ll level and the priority of the thread are evaluated, ll level and recently taken branches are taken into account, because the IW is empty after a branch, or a combination of all three techniques is used. Figure shows the dierence between these strategies subject to the IW size. As you can see, there is only a minimal discrepancy and therefore the simplest fetch strategy (only ll level) is sucient. To nd an optimal size of the instruction windows, we run several tests with dierent sizes. We found out, that the window size correlates with the instruction fetch bandwidth, as shown in gure 5. The minimum of the window size is about the fetch bandwidth plus two, necessary to avoid a deadlock that arises when no complete bytecode is in the buer and there

5 ,5,,5,,5 fill level fill level & priority fill level & branches fill level & priority & branches Window size in bytes Figure : Benchmark with dierent fetch strategies is not enough space to fetch a new portion. A detailed view shows an optimum size of about fetch bandwidth plus six, with regard to hardware costs and performance.,6,, 0,8 0,6 0, 0, 0 Fetch bandwidth in bytes Window size in bytes Figure 5: Benchmark with dierent window size and fetch bandwidth Conclusions This paper describes the basic idea of ISTs to handle external events and the appropriate multithreaded Java microcontroller. The microarchitecture is outlined and several benchmark results are shown. On one hand the performance gain possible through multithreading is proven and on the other hand some hardware parameters are optimized. A complete system with hardware, operating system (JVM), middleware and application (automated guided vehicles) is under construction. We are working on a detailed hardware design in VHDL of the Komodo microcontroller and further improve the adapted JVM. References [] J. Silc, B. Robic, and T. Ungerer. Processor Architecture: From Dataow to Superscalar and Beyond. Springer-Verlag, Heidelberg, 999. [] R. S. Chappell, J. Stark, S. P. Kim, S. K. Reinhardt, Y. N. Patt. Simultaneous Subordinate Microthreading (SSMT). ISCA'6 Proceedings, Atlanta, Georgia, Vol 7, No, pp , May 999. [] S. W. Keckler, A. Chang, W. S. Lee, W. J. Dally. Concurrent Event Handling through Multithreading. IEEE Transactions on computers, Vol 8, No 9, pp , September 999 [] J. O'Connor and M. Tremblay. PicoJava-I: The Java Virtual Machine in Hardware. IEEE Micro, pages 5{5, March/April 997. [5] Sun. PicoJava-II Microarchitecture Guide. Sun microsystems, Part No.: , March 999. [6] U. Brinkschulte, C. Krakowski, J. Kreuzinger, R. Marston, and T. Ungerer. The Komodo Project: Thread-Based Event Handling Supported by a Multithreaded Java Microcontroller. 5th EUROMICRO Conference, Milano, September 999. [7] U. Brinkschulte, C. Krakowski, J. Kreuzinger, T. Ungerer. A Multithreaded Java Microcontroller for Thread-oriented Real-time Event- Handling. 999 International Conference on Parallel Architectures and Compilation Techniques (PACT '99), Newport Beach, Ca., pp. -9, October 999. The maximum length of a bytecode instruction executed direct by the Komodo microcontroller is three.

Real-time Scheduling on Multithreaded Processors

Real-time Scheduling on Multithreaded Processors Real-time Scheduling on Multithreaded Processors J. Kreuzinger, A. Schulz, M. Pfeffer, Th. Ungerer Institute for Computer Design, and Fault Tolerance University of Karlsruhe D-76128 Karlsruhe, Germany

More information

Real-time Scheduling on Multithreaded Processors

Real-time Scheduling on Multithreaded Processors Real-time Scheduling on Multithreaded Processors J. Kreuzinger, A. Schulz, M. Pfeffer, Th. Ungerer U. Brinkschulte, C. Krakowski Institute for Computer Design, Institute for Process Control, and Fault

More information

Interrupt Service Threads - A New Approach to Handle Multiple Hard Real-Time Events on a Multithreaded Microcontroller

Interrupt Service Threads - A New Approach to Handle Multiple Hard Real-Time Events on a Multithreaded Microcontroller Interrupt Service Threads - A New Approach to Handle Multiple Hard Real-Time Events on a Multithreaded Microcontroller U. Brinkschulte, C. Krakowski J. Kreuzinger, Th. Ungerer Institute of Process Control,

More information

A Multithreaded Java Microcontroller for Thread-Oriented Real-Time Event-Handling

A Multithreaded Java Microcontroller for Thread-Oriented Real-Time Event-Handling A Multithreaded Java Microcontroller for Thread-Oriented Real-Time Event-Handling U. Brinkschulte, C. Krakowski J. Kreuzinger, Th. Ungerer Institute of Process Control, Institute of Computer Design Automation

More information

A Real-Time Java System on a Multithreaded Java Microcontroller

A Real-Time Java System on a Multithreaded Java Microcontroller A Real-Time Java System on a Multithreaded Java Microcontroller M. Pfeffer, S. Uhrig, Th. Ungerer Institute for Computer Science University of Augsburg D-86159 Augsburg fpfeffer, uhrig, ungererg @informatik.uni-augsburg.de

More information

A Microkernel Architecture for a Highly Scalable Real-Time Middleware

A Microkernel Architecture for a Highly Scalable Real-Time Middleware A Microkernel Architecture for a Highly Scalable Real-Time Middleware U. Brinkschulte, C. Krakowski,. Riemschneider. Kreuzinger, M. Pfeffer, T. Ungerer Institute of Process Control, Institute of Computer

More information

A Scheduling Technique Providing a Strict Isolation of Real-time Threads

A Scheduling Technique Providing a Strict Isolation of Real-time Threads A Scheduling Technique Providing a Strict Isolation of Real-time Threads U. Brinkschulte ¾, J. Kreuzinger ½, M. Pfeffer, Th. Ungerer ½ Institute for Computer Design and Fault Tolerance University of Karlsruhe,

More information

Priority manager. I/O access

Priority manager. I/O access Implementing Real-time Scheduling Within a Multithreaded Java Microcontroller S. Uhrig 1, C. Liemke 2, M. Pfeffer 1,J.Becker 2,U.Brinkschulte 3, Th. Ungerer 1 1 Institute for Computer Science, University

More information

The Komodo Project: Thread-based Event Handling Supported by a Multithreaded Java Microcontroller

The Komodo Project: Thread-based Event Handling Supported by a Multithreaded Java Microcontroller The Komodo Project: Thread-based Event Handling Supported by a Multithreaded Java Microcontroller J. Kreuzinger, R. Marston, Th. Ungerer Dept. of Computer Design and Fault Tolerance University of Karlsruhe

More information

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development

More information

[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School

[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator, ACAPS Technical Memo 64, School References [1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School of Computer Science, McGill University, May 1993. [2] C. Young, N. Gloy, and M. D. Smith, \A Comparative

More information

Real-time Garbage Collection for a Multithreaded Java Microcontroller

Real-time Garbage Collection for a Multithreaded Java Microcontroller Real-time Garbage Collection for a Multithreaded Java Microcontroller S. Fuhrmann, M. Pfeffer, J. Kreuzinger, Th. Ungerer Institute for Computer Design and Fault Tolerance University of Karlsruhe D-76128

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

JOP: A Java Optimized Processor for Embedded Real-Time Systems. Martin Schoeberl

JOP: A Java Optimized Processor for Embedded Real-Time Systems. Martin Schoeberl JOP: A Java Optimized Processor for Embedded Real-Time Systems Martin Schoeberl JOP Research Targets Java processor Time-predictable architecture Small design Working solution (FPGA) JOP Overview 2 Overview

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

picojava I Java Processor Core DATA SHEET DESCRIPTION

picojava I Java Processor Core DATA SHEET DESCRIPTION picojava I DATA SHEET DESCRIPTION picojava I is a uniquely designed processor core which natively executes Java bytecodes as defined by the Java Virtual Machine (JVM). Most processors require the JVM to

More information

8085 Microprocessor Architecture and Memory Interfacing. Microprocessor and Microcontroller Interfacing

8085 Microprocessor Architecture and Memory Interfacing. Microprocessor and Microcontroller Interfacing 8085 Microprocessor Architecture and Memory 1 Points to be Discussed 8085 Microprocessor 8085 Microprocessor (CPU) Block Diagram Control & Status Signals Interrupt Signals 8085 Microprocessor Signal Flow

More information

Evaluation of Branch Prediction Strategies

Evaluation of Branch Prediction Strategies 1 Evaluation of Branch Prediction Strategies Anvita Patel, Parneet Kaur, Saie Saraf Department of Electrical and Computer Engineering Rutgers University 2 CONTENTS I Introduction 4 II Related Work 6 III

More information

Performance Characteristics. i960 CA SuperScalar Microprocessor

Performance Characteristics. i960 CA SuperScalar Microprocessor Performance Characteristics of the i960 CA SuperScalar Microprocessor s. McGeady Intel Corporation Embedded Microprocessor Focus Group i960 CA - History A highly-integrated microprocessor for embedded

More information

signature i-1 signature i instruction j j+1 branch adjustment value "if - path" initial value signature i signature j instruction exit signature j+1

signature i-1 signature i instruction j j+1 branch adjustment value if - path initial value signature i signature j instruction exit signature j+1 CONTROL FLOW MONITORING FOR A TIME-TRIGGERED COMMUNICATION CONTROLLER Thomas M. Galla 1, Michael Sprachmann 2, Andreas Steininger 1 and Christopher Temple 1 Abstract A novel control ow monitoring scheme

More information

Instr. execution impl. view

Instr. execution impl. view Pipelining Sangyeun Cho Computer Science Department Instr. execution impl. view Single (long) cycle implementation Multi-cycle implementation Pipelined implementation Processing an instruction Fetch instruction

More information

Types of Interrupts:

Types of Interrupts: Interrupt structure Introduction Interrupt is signals send by an external device to the processor, to request the processor to perform a particular task or work. Mainly in the microprocessor based system

More information

CARUSO Project Goals and Principal Approach

CARUSO Project Goals and Principal Approach CARUSO Project Goals and Principal Approach Uwe Brinkschulte *, Jürgen Becker #, Klaus Dorfmüller-Ulhaas +, Ralf König #, Sascha Uhrig +, and Theo Ungerer + * Department of Computer Science, University

More information

Chapter 12. CPU Structure and Function. Yonsei University

Chapter 12. CPU Structure and Function. Yonsei University Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor

More information

What is Pipelining? RISC remainder (our assumptions)

What is Pipelining? RISC remainder (our assumptions) What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Parallelism of Java Bytecode Programs and a Java ILP Processor Architecture

Parallelism of Java Bytecode Programs and a Java ILP Processor Architecture Australian Computer Science Communications, Vol.21, No.4, 1999, Springer-Verlag Singapore Parallelism of Java Bytecode Programs and a Java ILP Processor Architecture Kenji Watanabe and Yamin Li Graduate

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 3 Fundamentals in Computer Architecture Computer Architecture Part 3 page 1 of 55 Prof. Dr. Uwe Brinkschulte,

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Supporting Multithreading in Configurable Soft Processor Cores

Supporting Multithreading in Configurable Soft Processor Cores Supporting Multithreading in Configurable Soft Processor Cores Roger Moussali, Nabil Ghanem, and Mazen A. R. Saghir Department of Electrical and Computer Engineering American University of Beirut P.O.

More information

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors

Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors Manvi Agarwal and S.K. Nandy CADL, SERC, Indian Institute of Science, Bangalore, INDIA {manvi@rishi.,nandy@}serc.iisc.ernet.in

More information

ACCELERATING 2D GRAPHIC APPLICATIONS WITH LOW ENERGY OVERHEAD

ACCELERATING 2D GRAPHIC APPLICATIONS WITH LOW ENERGY OVERHEAD ACCELERATING 2D GRAPHIC APPLICATIONS WITH LOW ENERGY OVERHEAD Oliveira, L.; Neves, B; Carro, L Instituto de Informática Universidade Federal do Rio Grande do Sul {loliveira,bsneves,carro}@inf.ufrgs.br

More information

Chapter 14 - Processor Structure and Function

Chapter 14 - Processor Structure and Function Chapter 14 - Processor Structure and Function Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 14 - Processor Structure and Function 1 / 94 Table of Contents I 1 Processor Organization

More information

EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May Branch Prediction

EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May Branch Prediction EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May 2000 Lecture #3: Wednesday, 5 April 2000 Lecturer: Mattan Erez Scribe: Mahesh Madhav Branch Prediction

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Pipelining 11142011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review I/O Chapter 5 Overview Pipelining Pipelining

More information

Simultaneous Multithreading Blending Thread-level and Instruction-level Parallelism in Advanced Microprocessors

Simultaneous Multithreading Blending Thread-level and Instruction-level Parallelism in Advanced Microprocessors Simultaneous Multithreading Blending Thread-level and Instruction-level Parallelism in Advanced Microprocessors JURIJ ŠILC BORUT ROBIČ THEO UNGERER Computer Systems Department Faculty of Computer and Information

More information

Concurrent Event Handling through Multithreading

Concurrent Event Handling through Multithreading IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 9, SEPTEMBER 1999 903 Concurrent Event Handling through Multithreading Stephen W. Keckler, Member, IEEE, Andrew Chang, Student Member, IEEE, Whay S. Lee, Sandeep

More information

This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers

This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers Course Introduction Purpose: This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers Objectives: Learn about error detection and address errors

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Job Posting (Aug. 19) ECE 425. ARM7 Block Diagram. ARM Programming. Assembly Language Programming. ARM Architecture 9/7/2017. Microprocessor Systems

Job Posting (Aug. 19) ECE 425. ARM7 Block Diagram. ARM Programming. Assembly Language Programming. ARM Architecture 9/7/2017. Microprocessor Systems Job Posting (Aug. 19) ECE 425 Microprocessor Systems TECHNICAL SKILLS: Use software development tools for microcontrollers. Must have experience with verification test languages such as Vera, Specman,

More information

UNIT- 5. Chapter 12 Processor Structure and Function

UNIT- 5. Chapter 12 Processor Structure and Function UNIT- 5 Chapter 12 Processor Structure and Function CPU Structure CPU must: Fetch instructions Interpret instructions Fetch data Process data Write data CPU With Systems Bus CPU Internal Structure Registers

More information

Pipelining. Pipeline performance

Pipelining. Pipeline performance Pipelining Basic concept of assembly line Split a job A into n sequential subjobs (A 1,A 2,,A n ) with each A i taking approximately the same time Each subjob is processed by a different substation (or

More information

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017 Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation

More information

The Use of Multithreading for Exception Handling

The Use of Multithreading for Exception Handling The Use of Multithreading for Exception Handling Craig Zilles, Joel Emer*, Guri Sohi University of Wisconsin - Madison *Compaq - Alpha Development Group International Symposium on Microarchitecture - 32

More information

Processors. Young W. Lim. May 12, 2016

Processors. Young W. Lim. May 12, 2016 Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version

More information

New Advances in Micro-Processors and computer architectures

New Advances in Micro-Processors and computer architectures New Advances in Micro-Processors and computer architectures Prof. (Dr.) K.R. Chowdhary, Director SETG Email: kr.chowdhary@jietjodhpur.com Jodhpur Institute of Engineering and Technology, SETG August 27,

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism

More information

Static Branch Prediction

Static Branch Prediction Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already

More information

REAL-TIME MULTITASKING KERNEL FOR IBM-BASED MICROCOMPUTERS

REAL-TIME MULTITASKING KERNEL FOR IBM-BASED MICROCOMPUTERS Malaysian Journal of Computer Science, Vol. 9 No. 1, June 1996, pp. 12-17 REAL-TIME MULTITASKING KERNEL FOR IBM-BASED MICROCOMPUTERS Mohammed Samaka School of Computer Science Universiti Sains Malaysia

More information

A hardware operating system kernel for multi-processor systems

A hardware operating system kernel for multi-processor systems A hardware operating system kernel for multi-processor systems Sanggyu Park a), Do-sun Hong, and Soo-Ik Chae School of EECS, Seoul National University, Building 104 1, Seoul National University, Gwanakgu,

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Design and Implementation of a FPGA-based Pipelined Microcontroller

Design and Implementation of a FPGA-based Pipelined Microcontroller Design and Implementation of a FPGA-based Pipelined Microcontroller Rainer Bermbach, Martin Kupfer University of Applied Sciences Braunschweig / Wolfenbüttel Germany Embedded World 2009, Nürnberg, 03.03.09

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Introduction to Computer Systems and Operating Systems

Introduction to Computer Systems and Operating Systems Introduction to Computer Systems and Operating Systems Minsoo Ryu Real-Time Computing and Communications Lab. Hanyang University msryu@hanyang.ac.kr Topics Covered 1. Computer History 2. Computer System

More information

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor

More information

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading

More information

Advanced processor designs

Advanced processor designs Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

QUESTION BANK. EE 6502 / Microprocessor and Microcontroller. Unit I Processor. PART-A (2-Marks)

QUESTION BANK. EE 6502 / Microprocessor and Microcontroller. Unit I Processor. PART-A (2-Marks) QUESTION BANK EE 6502 / Microprocessor and Microcontroller Unit I- 8085 Processor PART-A (2-Marks) YEAR/SEM : III/V 1. What is meant by Level triggered interrupt? Which are the interrupts in 8085 level

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

Embedded Systems: Hardware Components (part II) Todor Stefanov

Embedded Systems: Hardware Components (part II) Todor Stefanov Embedded Systems: Hardware Components (part II) Todor Stefanov Leiden Embedded Research Center, Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Outline Generic Embedded

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

ARM Simulation using C++ and Multithreading

ARM Simulation using C++ and Multithreading International Journal of Innovative Technology and Exploring Engineering (IJITEE) ARM Simulation using C++ and Multithreading Suresh Babu S, Channabasappa Baligar Abstract: - This project is to be produced

More information

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency. Lesson 1 Course Notes Review of Computer Architecture Embedded Systems ideal: low power, low cost, high performance Overview of VLIW and ILP What is ILP? It can be seen in: Superscalar In Order Processors

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

ronny@mit.edu www.cag.lcs.mit.edu/scale Introduction Architectures are all about exploiting the parallelism inherent to applications Performance Energy The Vector-Thread Architecture is a new approach

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

Caches. Hiding Memory Access Times

Caches. Hiding Memory Access Times Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY

More information

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015 Advanced Parallel Architecture Lesson 3 Annalisa Massini - 2014/2015 Von Neumann Architecture 2 Summary of the traditional computer architecture: Von Neumann architecture http://williamstallings.com/coa/coa7e.html

More information

2. List the five interrupt pins available in INTR, TRAP, RST 7.5, RST 6.5, RST 5.5.

2. List the five interrupt pins available in INTR, TRAP, RST 7.5, RST 6.5, RST 5.5. DHANALAKSHMI COLLEGE OF ENGINEERING DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING EE6502- MICROPROCESSORS AND MICROCONTROLLERS UNIT I: 8085 PROCESSOR PART A 1. What is the need for ALE signal in

More information

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1 CMCS 611-101 Advanced Computer Architecture Lecture 8 Control Hazards and Exception Handling September 30, 2009 www.csee.umbc.edu/~younis/cmsc611/cmsc611.htm Mohamed Younis CMCS 611, Advanced Computer

More information

THE OPTIUM MICROPROCESSOR AN FPGA-BASED IMPLEMENTATION

THE OPTIUM MICROPROCESSOR AN FPGA-BASED IMPLEMENTATION THE OPTIUM MICROPROCESSOR AN FPGA-BASED IMPLEMENTATION Radu Balaban Computer Science student, Technical University of Cluj Napoca, Romania horizon3d@yahoo.com Horea Hopârtean Computer Science student,

More information

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability Lecture 27: Pot-Pourri Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood

More information

Lecture-15 W-Z: Increment-Decrement Address Latch:

Lecture-15 W-Z: Increment-Decrement Address Latch: Lecture-15 W-Z: (W) and (Z) are two 8-bit temporary registers not accessible to the user. They are exclusively used for the internal operation by the microprocessor. These registers are used either to

More information

Processing Unit CS206T

Processing Unit CS206T Processing Unit CS206T Microprocessors The density of elements on processor chips continued to rise More and more elements were placed on each chip so that fewer and fewer chips were needed to construct

More information

Microprocessors and Microcontrollers. Assignment 1:

Microprocessors and Microcontrollers. Assignment 1: Microprocessors and Microcontrollers Assignment 1: 1. List out the mass storage devices and their characteristics. 2. List the current workstations available in the market for graphics and business applications.

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

EE 3170 Microcontroller Applications

EE 3170 Microcontroller Applications EE 317 Microcontroller Applications Lecture 5 : Instruction Subset & Machine Language: Introduction to the Motorola 68HC11 - Miller 2.1 & 2.2 Based on slides for ECE317 by Profs. Davis, Kieckhafer, Tan,

More information

A Scalable Multiprocessor for Real-time Signal Processing

A Scalable Multiprocessor for Real-time Signal Processing A Scalable Multiprocessor for Real-time Signal Processing Daniel Scherrer, Hans Eberle Institute for Computer Systems, Swiss Federal Institute of Technology CH-8092 Zurich, Switzerland {scherrer, eberle}@inf.ethz.ch

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

ADVANCED COMPUTER ARCHITECTURE TWO MARKS WITH ANSWERS

ADVANCED COMPUTER ARCHITECTURE TWO MARKS WITH ANSWERS ADVANCED COMPUTER ARCHITECTURE TWO MARKS WITH ANSWERS 1.Define Computer Architecture Computer Architecture Is Defined As The Functional Operation Of The Individual H/W Unit In A Computer System And The

More information

Chapter 5. Introduction ARM Cortex series

Chapter 5. Introduction ARM Cortex series Chapter 5 Introduction ARM Cortex series 5.1 ARM Cortex series variants 5.2 ARM Cortex A series 5.3 ARM Cortex R series 5.4 ARM Cortex M series 5.5 Comparison of Cortex M series with 8/16 bit MCUs 51 5.1

More information

Q.1 Explain Computer s Basic Elements

Q.1 Explain Computer s Basic Elements Q.1 Explain Computer s Basic Elements Ans. At a top level, a computer consists of processor, memory, and I/O components, with one or more modules of each type. These components are interconnected in some

More information