Designing a Dual Core Processor

Size: px
Start display at page:

Download "Designing a Dual Core Processor"

Transcription

1 Designing a Dual Core Processor Manfred Georg Applied Research Laboratory Department of Computer Science and Engineering Washington University in St. Louis St. Louis, MO , USA mgeorg@arl.wustl.edu Abstract In this paper we present the design of a dual core processor. Two simple five stage pipelined processing cores are combined on a single chip. A bus based memory hierarchy is used to communicate between low level caches and higher memories. Requests over the bus are queued until they can be serviced. Basic hardware stalling and hazard avoidance through data forwarding are built into each processing core. A TestAndSet operation is provided for easily synchronizing the processing cores. 1 Introduction The development of increasingly compact and efficient chip manufacturing techniques have lead to increased capabilities on chips. However, the shrinking of chip designs in addition to intrinsically improving performance also allows for larger, more complicated chip designs. Recently, an increasing amount of chip space has been devoted to caches, in an effort to minimize memory latency. Complicated processor designs, such as superscalar processors, require substantially more space than their predecessors. These processors are geared towards increasing the instructional level parallelism in programs. Many times there is no way, or little incentive, to explicitly parallelize a program at the application layer. In these cases, it is best to have a single fast processor which is able to exploit instruction level parallelism. However, many applications are easily split into multiple tasks at the application layer. When split into multiple threads, it is more effective to have more than one slow processor than to have only one fast processor. A typical super scalar processor such as the Alpha EV6 takes as much as four times more chip area as its simpler, single issue counterpart the Alpha EV5 [7]. 2 Related Work There has been much work and many comparisons and proposals in the area of multiprocessing. The use of multiple processors has been used extensively. There is a wide variety of different programming models which accommodate the use of multiple processing cores [4]. In their seminal work, Olukotun et al. propose the use of multiple cores on a single chip for general purpose processors [8]. Kumar et al. present a comparison between the EV5 and EV6 Alpha cores in a heterogeneous chip multiprocessor (CMP). They are able to both improve performance [7] and reduce power consumption [5] using heterogeneous mixes of processing cores on a single chip. Furthermore Kumar et al. are also able to better utilize resources such as memory, bus, and floating point units, through the sharing of resources between on chip processing cores [6]. Crowley et al. presents a comparison of different parallel processing techniques: fine grained multi threading (FGMT), chip multiprocessing (CMP), and symmetric multithreading (SMT) in network interfaces [3]. Wun et al. use the Intel IXP 2850 network processor, which is extremely parallel and includes heterogeneous elements, to speed up easily parallelizable algorithms in computational biology [11]. In addition to adding complexity, deeply pipelined processors use more power and generate more heat. For this reason, even if cost/performance for a single thread is the only metric used, there is still a trade off between complexity and operational cost [9]. Industry has reacted to this phenomenon; every major chip manufacturer has now created a multi-core line of chips. For example, IBM s POWER4 chip combines two PowerPC processors on a single chip and uses a high throughput crossbar switch for communication between the cores and the L2 cache [10]. Whenever multiple processing units are used in conjunction, communication becomes an important issue. The use of multiple cores within the same chip allows for tightly

2 coupled communication to take place that is not feasible in more distant communication. Adve and Gharachorloo discuss the benefits of different shared memory consistency models, arguing that sequential consistency is unnecessarily restrictive causing performance degradation in multiprocessors [1]. Zhang and Asanović present a processor design which leverages simple previously designed and optimized processing cores by shrinking and placing many of them in a grid network on a single chip [12]. Cache coherence is improved through the use of victim replication and the sharing of L2 caches between nearby processors. Barroso et al. noticed that designing increasingly complex processors to exploit instruction level parallelism is becoming cost prohibitive. However, the use of many simple cores, along with simpler manufacturing techniques is able to produce competitive processors able to outperform modern processors in on-line transaction processing (OLTP) [2]. 3 Instruction Set Architecture I use a small subset of the MIPS DLX instruction set. In the spirit of RISC I have reduced the instruction set as much as possible while trying not to compromising on functionality. A list of all instructions in the instruction set can be found in table 1. All instructions are 32 bits long. As is recommended by the principle of orthogonality, all R type instructions also have an I type analog. All calculations which are possible using the original DLX instruction set, are also possible in the new instruction set, using a small number of instructions. set if... instructions are replaced with corresponding branch instructions which test for less than zero or less than or equal to zero. Some functionality is lost in the removal of jump instructions, namely the ability to register jump, which is needed for function calls. However, this is the only place were significant degradation in functionality is observed. We add a TestAndSet instruction to the instruction set so that we can easily synchronize the two processing cores. There are several restrictions which apply to the entire instruction set. First, all memory locations are addressed in words (4 byte groups) with no provisions for byte or half byte integers. This implies that memory accesses are implicitly word aligned. Furthermore, there are no packing or unpacking instructions. Additionally, for simplicity, all numbers are considered to be two s compliment signed integers. 4 Five Stage Pipeline Each core uses a fairly standard five stage pipeline design. The architecture is based on a subset of the MIPS DLX instruction set. There is a single branch delay slot. 4.1 Instruction Fetch The first stage of the pipeline is concerned with the fetching of instructions. The address of the next instruction is stored in the Program Counter (PC). This signal is incremented by 1 each clock cycle. Additionally it can be arbitrarily set by a successful branch instruction. The address is read from the memory hierarchy which is discussed later. 4.2 Register Decode The second stage of the pipeline is a decoding phase. In the first part the instruction is parsed depending on its format. There are two possible instruction formats, registerregister (R) and register-immediate (I) which are also used for branch instructions. Table 1 graphically shows the layout of each instruction. In general, the operational code is 5 bits, the registers are 5 bits (32 registers), and the immediate is 17 bits. The second part of this phase is the fetching of the register contents. Two registers are fetched from the register file at the same time. The first register, register zero, always contains the value zero. This makes it easier to perform a number of simple operations like negation, which are degenerate cases of other operations, in this case XOR with zero. The destination register is not decoded or fetched; it is passed through the pipeline unaltered, until it is needed in the last stage. Branches are also evaluated in this stage. However, because there is not enough time after register fetching to perform intense calculations, branch conditions are always comparisons with zero. If a branch is taken, then the PC is overwritten with a new value. If a branch is not taken, then the PC will have been incremented as usual, and we continue with the pipelined execution as if a null operation had been executed. Effectively, this means that predictive execution always predicts that no branching will occur. Since the immediate value is only 17 bits long and all internal numbers must be 32 bits long, the immediate is sign extended to 32 bits. This allows later stages to not differentiate between values which came from registers and the immediate. 4.3 Computation The computation stage is the phase in which instructions are actually executed in the Arithmetic Logic Unit (ALU). The ALU is able to add and subtract numbers, perform bitwise operations, and bitwise shifts. It is also used to compute memory addresses through addition.

3 Instruction Description Format ADD add R ADDI add immediate I AND and R ANDI and immediate I OR or R ORI or immediate I SLA shift left arithmetic R SLAI shift left arithmetic immediate I SUB subtract R SUBI subtract immediate I XOR xor R XORI xor immediate I BEQZ branch if register equal to zero I BNEZ branch if register not equal to zero I BLTZ branch if register is less than zero I BLEZ branch if register is less than or equal to zero I LW load word I SW store word I TAS Test And Set I Table 1. Instruction Set R type I type op code R1 R2 Rd Unused op code R1 Rd Immediate Figure 1. The instruction format for each type of instruction 4.4 Memory In the memory stage, registers are able to interact with the memory hierarchy by being loaded or stored. In a normal operation, which does not concern memory, nothing happens during this stage and the result from the ALU is simply passed through. However, for a load or a store operation, the result from the ALU is the memory address. For a store operation the value to be stored which was retrieved from a register is also provided from the previous stage. The write back register, when applicable, is also passed unmodified through this stage. 4.5 Write Back The fifth and final stage of the pipeline is the write back stage. In this stage, the result from the ALU or memory is stored into the destination register within the register file. 5 Hazard Avoidance Any time a pipeline is used to execute instructions in parallel, there is the possibility of problems arising due to data dependencies. A hazard occurs when an instruction in the pipeline has an unfulfilled dependence on another instruction which has not yet completed execution. Although complicated pipelines can produce a number of different kinds of hazards, our simple pipeline only gives rise to the read after write hazard. In this case, a read requires the result of an operation which has not yet been written back to the register file. 5.1 Stalling The first step in solving this problem is to add the ability to stall to the processor. Each stage of the processor can be stalled independently. While stalled, a particular stage maintains the same output, regardless of how the input changes. This allows a later stage in the pipeline to wait for a condition, such as the writing of a register, while earlier stages are stalled. The use of stalls is absolutely vital when

4 caches are used, since cache misses will delay the arrival of data significantly and unpredictably. 5.2 Data Forwarding Although, certain situations necessitate stalling, it is sometimes possible to obtain a result directly from a later stage without waiting for it to be written to a register. The most obvious example is an operation which requires a result from the ALU during the memory stage. Results from the ALU are not modified in the memory stage, therefore, the result that will be written to the register file is already present in the pipeline. By placing some data lines and multiplexers in front of the ALU we can allow it to calculate on values that have not yet been written to registers. This minimizes the number of stalls that are necessary. 6 Communication The use of two cores on a chip is trivially simple except for the need to communicate between them. The pipeline is simply duplicated for each core and they are run in parallel with the same frequency. However, the need for caches in each processor to communicate with main memory and keep coherent state complicates matters. 6.1 Bus Communication between the caches and memory is performed through a 34 bit wide bus, 2 bits of which are an operation code. This bus is synchronous, allowing communication to initiate quickly without any preamble. Arbitration is done through a simple slotting algorithm which alternates between the caches. To simplify matters, the bus is held by an individual cache until its request has been completed. However, since the bus is not wide enough for an entire cache line at once, cache lines must be sent over it serially. 6.2 Cache Each cache is a standard one-way associative cache. The low bits of the addressed word are used to look up the proper entry in the cache. The high bits of the address are then compared against the high bits of the word which is stored in cache. If these bits match up, and a dirty bit has not been set, signifying that the cache is valid, then the cache entry is used. Otherwise, a cache miss occurs and the information must be retrieved from memory. Each entry in the cache is a cache line of 4 words, equaling 128 bits. There are 4 dirty bits allowing each word which comprises the cache line to be independently invalidated. Words are invalidated when a different cache performs a write at their address. Since all communication with memory is performed through the bus, all cache modules can hear each others write requests and set dirty bits appropriately. 6.3 Cache Controller Since requests can arrive at the cache faster than they can be served on the bus, we require a queue of outgoing requests which must be serviced. Any number of write requests may be in the queue at one time (up to a maximum of 8), since write requests do not need to stall the processor. However, a read or TestAndSet request will stall the processor until a response is received. In the process, the entire queue of requests will be drained. Every time an entry is written to the cache, the corresponding entry, if clean, is updated. Furthermore, the data is also placed in the queue to be transmitted over the bus to memory. In this way, each write to the cache is also immediately written to memory. 6.4 Memory Memory is accessed through the bus. It is assumed that all memory operations on a word require four cycles to complete and can be pipelined. Therefore, the bus is idle for four cycles between a request and the response. However, in this interval the bus is held and cannot be used for any other request. There are three different kinds of memory requests. A read, a write, and a TestAndSet. For any of these requests, the cache module will send the address of the operation to the memory over the bus. A read request acts to retrieve an entire cache line at once, which, after a delay, is sent on the bus in four consecutive cycles. A write request only acts on a single word of data. This word is sent on the bus in the second cycle of the request after which the bus must remain empty for three cycles while the write operation completes within main memory. A TestAndSet is similar to a read, except that it only acts on a single word, and thus only requires one cycle to send information back. Naturally, a TestAnd- Set operation atomically both returns the current value of the memory address and sets that value to one. 7 Simulation Details The entire dual core processor was written in VHDL and simulated using Xilinx ModelSim. Little effort was made to use realistic gate and wire delays or fine-tune the clock frequency. However, an effort was made to balance the load

5 in each pipeline stage. Furthermore, main memory was assumed to require several clock cycles to respond. And, the cost of wires from memory to cache controllers was minimized. These are some essential points which should be considered when creating a more realistic model. 8 Conclusion The trend in computer design has always been towards exploiting more parallelism. However, we are quickly approaching the point were it no longer makes sense to create more complicated designs that are able to exploit successively more complicated and less significant amounts of instruction level parallelism. The solution is to build systems in which parallelism is explicitly given at the application layer. In such cases, the use of multiple simple cores to boost performance is much more effective than single complex cores. I present a simple dual core design, leveraging two five stage pipelined processing cores combined on a single chip. A bus based memory hierarchy is used to communicate between low level caches and higher memories. Requests over the bus are queued until they can be serviced. Basic hardware stalling and hazard avoidance through data forwarding are built into each processing core. Furthermore, a TestAndSet operation is provided for easily synchronizing the processing cores. [7] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I. Farkas. Single-ISA Heterogeneous Multi-Core Architecture for Multithread Workload Performance. In International Symposium on Computer Architecture, June [8] K. Olukotun, B. Nayfeh, L. Hammond, K. Wilson, and K. Chang. The Case for a Single-Chip Multiprocessor. In International Conference on Architectural Support for Programming Languages and Operating Systems, October [9] V. Srinivasan, D. Brooks, M. Gschwind, and P. Bose. Optimizing Pipelines for Power and Performance. In International Symposium on Microarchitecture, November [10] J. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. Power4 System Microarchitecture. In IBM Journal of Research and Development, [11] B. Wun, J. Buhler, and P. Crowley. Exploiting Coarse- Grained Parallelism to Accelerate Protein Motif Finding with a Network Processor. In International Conference on Parallel Architectures and Compilation Techniques, September [12] M. Zhang and K. Asanović. Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors. In International Symposium on Computer Architecture, References [1] S. V. Adve and K. Gharachorloo. Shared Memory Consistency Models: A Tutorial. Technical Report, Digital Western Research Laboratory, September [2] L. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, and S. Qadeer. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. In International Symposium on Computer Architecture, May [3] P. Crowley, M. E. Fiuczynski, J.-L. Baer, and B. N. Bershad. Characterizing Processor Architectures for Programmable Network Interfaces. In International Conference on Supercomputing, May [4] W. Gropp and E. Lusk. A Taxonomy of Programming Models for Symmetric Multiprocessors and SMP Clusters. In Programming models for massively parallel computers, [5] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen. Single-ISA Heterogeneous Multi-core Architecture: The Potential for Processor Power Reduction. In International Symposium on Microarchitecture, December [6] R. Kumar, N. P. Jouppi, and D. M. Tullsen. Conjoined-core Chip Multiprocessing. In International Symposium on Microarchitecture, 2004.

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

ISA and RISCV. CASS 2018 Lavanya Ramapantulu

ISA and RISCV. CASS 2018 Lavanya Ramapantulu ISA and RISCV CASS 2018 Lavanya Ramapantulu Program Program =?? Algorithm + Data Structures Niklaus Wirth Program (Abstraction) of processor/hardware that executes 3-Jul-18 CASS18 - ISA and RISCV 2 Program

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

ECE 486/586. Computer Architecture. Lecture # 7

ECE 486/586. Computer Architecture. Lecture # 7 ECE 486/586 Computer Architecture Lecture # 7 Spring 2015 Portland State University Lecture Topics Instruction Set Principles Instruction Encoding Role of Compilers The MIPS Architecture Reference: Appendix

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Twos Complement Signed Numbers. IT 3123 Hardware and Software Concepts. Reminder: Moore s Law. The Need for Speed. Parallelism.

Twos Complement Signed Numbers. IT 3123 Hardware and Software Concepts. Reminder: Moore s Law. The Need for Speed. Parallelism. Twos Complement Signed Numbers IT 3123 Hardware and Software Concepts Modern Computer Implementations April 26 Notice: This session is being recorded. Copyright 2009 by Bob Brown http://xkcd.com/571/ Reminder:

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Pipelining. CSC Friday, November 6, 2015

Pipelining. CSC Friday, November 6, 2015 Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not

More information

Kaisen Lin and Michael Conley

Kaisen Lin and Michael Conley Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC

More information

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

R-type Instructions. Experiment Introduction. 4.2 Instruction Set Architecture Types of Instructions

R-type Instructions. Experiment Introduction. 4.2 Instruction Set Architecture Types of Instructions Experiment 4 R-type Instructions 4.1 Introduction This part is dedicated to the design of a processor based on a simplified version of the DLX architecture. The DLX is a RISC processor architecture designed

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

ECE 154B Spring Project 4. Dual-Issue Superscalar MIPS Processor. Project Checkoff: Friday, June 1 nd, Report Due: Monday, June 4 th, 2018

ECE 154B Spring Project 4. Dual-Issue Superscalar MIPS Processor. Project Checkoff: Friday, June 1 nd, Report Due: Monday, June 4 th, 2018 Project 4 Dual-Issue Superscalar MIPS Processor Project Checkoff: Friday, June 1 nd, 2018 Report Due: Monday, June 4 th, 2018 Overview: Some machines go beyond pipelining and execute more than one instruction

More information

EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts

EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts Prof. Sherief Reda School of Engineering Brown University S. Reda EN2910A FALL'15 1 Classical concepts (prerequisite) 1. Instruction

More information

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07 CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07 Objectives ---------- 1. To introduce the basic concept of CPU speedup 2. To explain how data and branch hazards arise as

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) #1 Lec # 2 Fall 2003 9-10-2003 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Lecture 29 Review CPU time: the best metric Be sure you understand CC, clock period Common (and good) performance metrics Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Final Lecture. A few minutes to wrap up and add some perspective

Final Lecture. A few minutes to wrap up and add some perspective Final Lecture A few minutes to wrap up and add some perspective 1 2 Instant replay The quarter was split into roughly three parts and a coda. The 1st part covered instruction set architectures the connection

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Instruction Set Principles. (Appendix B)

Instruction Set Principles. (Appendix B) Instruction Set Principles (Appendix B) Outline Introduction Classification of Instruction Set Architectures Addressing Modes Instruction Set Operations Type & Size of Operands Instruction Set Encoding

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Introducing Multi-core Computing / Hyperthreading

Introducing Multi-core Computing / Hyperthreading Introducing Multi-core Computing / Hyperthreading Clock Frequency with Time 3/9/2017 2 Why multi-core/hyperthreading? Difficult to make single-core clock frequencies even higher Deeply pipelined circuits:

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

A Prototype Multithreaded Associative SIMD Processor

A Prototype Multithreaded Associative SIMD Processor A Prototype Multithreaded Associative SIMD Processor Kevin Schaffer and Robert A. Walker Department of Computer Science Kent State University Kent, Ohio 44242 {kschaffe, walker}@cs.kent.edu Abstract The

More information

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to

More information

THREAD LEVEL PARALLELISM

THREAD LEVEL PARALLELISM THREAD LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 4 is due on Dec. 11 th This lecture

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Computer System Architecture Final Examination Spring 2002

Computer System Architecture Final Examination Spring 2002 Computer System Architecture 6.823 Final Examination Spring 2002 Name: This is an open book, open notes exam. 180 Minutes 22 Pages Notes: Not all questions are of equal difficulty, so look over the entire

More information

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Lecture 26: Parallel Processing. Spring 2018 Jason Tang Lecture 26: Parallel Processing Spring 2018 Jason Tang 1 Topics Static multiple issue pipelines Dynamic multiple issue pipelines Hardware multithreading 2 Taxonomy of Parallel Architectures Flynn categories:

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Lecture 25: Board Notes: Threads and GPUs

Lecture 25: Board Notes: Threads and GPUs Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel

More information

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor

More information

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors. CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University 18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:

More information

Very short answer questions. You must use 10 or fewer words. "True" and "False" are considered very short answers.

Very short answer questions. You must use 10 or fewer words. True and False are considered very short answers. Very short answer questions. You must use 10 or fewer words. "True" and "False" are considered very short answers. [1] Does peak performance track observed performance? [1] Predicting the direction of

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

ECE 341. Lecture # 15

ECE 341. Lecture # 15 ECE 341 Lecture # 15 Instructor: Zeshan Chishti zeshan@ece.pdx.edu November 19, 2014 Portland State University Pipelining Structural Hazards Pipeline Performance Lecture Topics Effects of Stalls and Penalties

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle? CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:

More information

CS3350B Computer Architecture Winter 2015

CS3350B Computer Architecture Winter 2015 CS3350B Computer Architecture Winter 2015 Lecture 5.5: Single-Cycle CPU Datapath Design Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and Design, Patterson

More information

HPC VT Machine-dependent Optimization

HPC VT Machine-dependent Optimization HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

Hardware Design I Chap. 10 Design of microprocessor

Hardware Design I Chap. 10 Design of microprocessor Hardware Design I Chap. 0 Design of microprocessor E-mail: shimada@is.naist.jp Outline What is microprocessor? Microprocessor from sequential machine viewpoint Microprocessor and Neumann computer Memory

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions. MIPS Pipe Line 2 Introduction Pipelining To complete an instruction a computer needs to perform a number of actions. These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously

More information

Simultaneous Multithreading and the Case for Chip Multiprocessing

Simultaneous Multithreading and the Case for Chip Multiprocessing Simultaneous Multithreading and the Case for Chip Multiprocessing John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 2 10 January 2019 Microprocessor Architecture

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too

More information

Very short answer questions. You must use 10 or fewer words. "True" and "False" are considered very short answers.

Very short answer questions. You must use 10 or fewer words. True and False are considered very short answers. Very short answer questions. You must use 10 or fewer words. "True" and "False" are considered very short answers. [1] Predicting the direction of a branch is not enough. What else is necessary? [1] Which

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli 06-1 Vector Processors, Etc. 06-1 Some material from Appendix B of Hennessy and Patterson. Outline Memory Latency Hiding v. Reduction Program Characteristics Vector Processors Data Prefetch Processor /DRAM

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model. Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution

More information

Computer Organization CS 206 T Lec# 2: Instruction Sets

Computer Organization CS 206 T Lec# 2: Instruction Sets Computer Organization CS 206 T Lec# 2: Instruction Sets Topics What is an instruction set Elements of instruction Instruction Format Instruction types Types of operations Types of operand Addressing mode

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

The Processor: Datapath and Control. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

The Processor: Datapath and Control. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University The Processor: Datapath and Control Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction CPU performance factors Instruction count Determined

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

COMP 303 MIPS Processor Design Project 3: Simple Execution Loop

COMP 303 MIPS Processor Design Project 3: Simple Execution Loop COMP 303 MIPS Processor Design Project 3: Simple Execution Loop Due date: November 20, 23:59 Overview: In the first three projects for COMP 303, you will design and implement a subset of the MIPS32 architecture

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Very short answer questions. You must use 10 or fewer words. "True" and "False" are considered very short answers.

Very short answer questions. You must use 10 or fewer words. True and False are considered very short answers. Very short answer questions. You must use 10 or fewer words. "True" and "False" are considered very short answers. [1] Which is on average more effective, dynamic or static branch prediction? [1] Does

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information