Intel Core Microarchitecture

Size: px
Start display at page:

Download "Intel Core Microarchitecture"

Transcription

1 Intel Core Microarchitecture Marco Morosini Matteo Larocca AY 2005/2006 Multimedia System Architectures Presentation Outlook New solutions for old problems Architecture Overview Architecture details New features Conclusions 1

2 New solutions for old problems P4 Scalability is achieved through clockspeed increasing Core scalability through multicore approach A CPU 20% slower saves up to 70% power. Two cores at reduced speed are better than fullspeed one. W/cm The Sun Pentium4 561 Prescott Pentium4 1,4 Willamette Pentium 133 Netburst Cons High clockspeed was achieved through NetBurst technology: long pipeline (P4 has 31 stages) means low latency, high clockspeed. It s getting harder and harder to increase clockspeed, Netburst ends here. 2

3 Core Microarchitecture Core microarchitecture has been designed from scratch to overcome two problems: Excessive power consumption Difficulties in improving performance Microarchitecture 1/5 Core's designers took everything that has already been proven to work and added more of it Violet: fetch stage Orange: decode stage Yellow: reorder stage Blue: execution stage Green: memory access stage. 3

4 Microarchitecture 2/5 Front end: fetch + decode 3 simple decoders (1 more WRT P6) transform 1 x86 instruction into 1 micro-op per clock cycle. 1 complex decoder handles the x86 instructions that translate into 2-4 micro-ops. It can output up to 4 micro-ops per clock cycle. Now more instructions can use simple decoders (i.e. SSE, memory access, and so on.) The micro-ops buffer can receive up to 7 micro-ops per cycle. Microarchitecture 3/5 Core represents the current apex of OOOE design, where as much code and data stream optimization as possible is carried out in silicon. ROB has now 96 entry (up from 40) The goal of the increased performances of the front end and of Core new features is to feed up constantly the OOOE hardware and keep the instruction window and execution core full of code. PRF renames register file in order to avoid Write after Read hazard. 4

5 Microarchitecture 4/5 Enlarged Reservation Station 3 out of 6 issue port for Execution units bit integer execution units: 1 64-bit complex integer unit (CIU) 2 simple integer units (SIUs) Because the 64-bit integer ALUs are on separate issue ports, Core can sustain a total throughput of three 64-bit integer operations per cycle. 2 floating-point execution units bit SSE execution units. Microarchitecture 5/5 3 issue ports dedicated to memory system operations. Thanks to the presence of the MOB, store data and store address operation can be performed in parallel. L2 Cache (4MB, 16 way) is shared between different Cores. Each Core has its own L1 cache. The L1 Caches can exchange data directly. 5

6 Macrofusion 1/2 It is the ability to fuse certain types of x86 instructions together in the predecode phase and send them through a single decoder to be translated into a single micro-op. Not all the x86 instructions can be Macrofused: tipically, Compare and Test instructions can be macrofused with branch instructions Macrofusion 2/2 Up to 1 macrofusion per cycle, performed by any one of the four decoders. Benefits: Simplifies ROB and RS works, since there are fewer micro-ops in-flight for the core to track. This means also power saving. Increases Core's execution width, because a single ALU can execute what is essentially two x86 instructions simultaneously 6

7 Microfusion 1/2 Micro-ops fusion, has some effects similar to macro-fusion ones, but it functions differently Basically, a simple/fast decoder takes in a single x86 instruction that would normally translate into two micro-ops, and it produces a fused pair of micro-ops that are tracked by the ROB using a single entry. When they reach the reservation station, the two members of this fused pair are allowed to issue separately, either in parallel through two different issue ports or serially through the same port, depending on the situation. Microfusion 2/2 Example: Store Instruction Store instructions are broken down into two uops: store-address uop: calculates the address where the data is to be stored store-data uop: writes the data to be stored into the outgoing store data buffer Because the two operations are inherently parallel and are performed by two separate execution units on two separate issue ports, these two uops can be tracked by ROB as one and executed in parallel. 7

8 Digital media boost 1/2 Before Core, SIMD 128bit instructions have to be broken down into two micro ops. Before Core, datapaths are 64bit wide. Digital media boost 2/2 With 128bit datapaths, Core can handle SIMD instructions with only one micro op. Therefore the new design, not only eliminate the latency disadvantage, but it also improves decode, dispatch, and scheduling bandwidth because half as many micro-ops are generated for 128-bit vector instructions. 8

9 Power saving The Core microarchitecture includes an on-die digital thermal sensor, for safety and reliability features; it is also rumored that it may be used to increase the frequency of the MPU when the sensor determines that there is thermal headroom. Each of the two cores is managed independently, and many entire blocks can be put to sleep, such as the microcode sequencer. Most internal buses are gated for power savings. So if a bus is not sending out a full data load each cycle, part of the bus can be put to sleep. Normally, increasing the IPC of a design means higher power consumption, whether the extra resources are used or not. With the extensive power and clock gating, Core MPU, the designers only pay for what is used, which makes a high IPC design much more attractive. Memory Disambiguation 1/2 Out-of-order processors must first put instructions back in program order before officially writing their results out because you can't modify a memory location until all of the previous instructions that read that location have completed execution. Intel architectures before Core are built around a conservative rules. Load is not allowed to be executed before previous store instructions with an undefined address. 9

10 Memory Disambiguation 2/2 Academic researches demonstrate that over 97% of the memory accesses in a processor's instruction window are about unrelated locations. Core tries to execute out of order store, if something wrongs happens, a roll-back will be done. Conclusions Core looks like it has what it takes to carry Intel forward for at least another five years. It will help the software industry gradually make the transition to multithreaded code. It is an ambitious design that improves on its predecessors in nearly every way, especially the fanatical focus on power efficiency (Intel promises 40% more performance for Conroe at 40% less power compared to Pentium D). 10

11 Bibliography Into the Core: Intel's next-generation microarchitecture by Jon Hannibal Stokes core.ars Intel's Next Generation Microarchitecture Unveiled by David Kanter rticleid=rwt Wikipedia 11

Inside Intel Core Microarchitecture

Inside Intel Core Microarchitecture White Paper Inside Intel Core Microarchitecture Setting New Standards for Energy-Efficient Performance Ofri Wechsler Intel Fellow, Mobility Group Director, Mobility Microprocessor Architecture Intel Corporation

More information

Next Generation Technology from Intel Intel Pentium 4 Processor

Next Generation Technology from Intel Intel Pentium 4 Processor Next Generation Technology from Intel Intel Pentium 4 Processor 1 The Intel Pentium 4 Processor Platform Intel s highest performance processor for desktop PCs Targeted at consumer enthusiasts and business

More information

How to write powerful parallel Applications

How to write powerful parallel Applications How to write powerful parallel Applications 08:30-09.00 09.00-09:45 09.45-10:15 10:15-10:30 10:30-11:30 11:30-12:30 12:30-13:30 13:30-14:30 14:30-15:15 15:15-15:30 15:30-16:00 16:00-16:45 16:45-17:15 Welcome

More information

Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance

Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance Page 1 Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance Ofri Wechsler Copyright Intel Corporation 2006. *Third-party brands and names are the property of their

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012 CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations

More information

Inside Intel Core Microarchitecture By Gabriel Torres on April 12, 2006 Page 1 of 7

Inside Intel Core Microarchitecture By Gabriel Torres on April 12, 2006 Page 1 of 7 http://www.hardwaresecrets.com/printpage/313/1 31-10-2007 18:21 1 of 1 By Gabriel Torres on April 12, 2006 Page 1 of 7 AMD Athlon Black Edition Dual-core 5000+ Premium Performance Great Value. Do you Dare?

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

The Pentium II/III Processor Compiler on a Chip

The Pentium II/III Processor Compiler on a Chip The Pentium II/III Processor Compiler on a Chip Ronny Ronen Senior Principal Engineer Director of Architecture Research Intel Labs - Haifa Intel Corporation Tel Aviv University January 20, 2004 1 Agenda

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance

More information

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01. Hyperthreading ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.pdf Hyperthreading is a design that makes everybody concerned believe that they are actually using

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

Processing Unit CS206T

Processing Unit CS206T Processing Unit CS206T Microprocessors The density of elements on processor chips continued to rise More and more elements were placed on each chip so that fewer and fewer chips were needed to construct

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I

More information

Pentium 4 Processor Block Diagram

Pentium 4 Processor Block Diagram FP FP Pentium 4 Processor Block Diagram FP move FP store FMul FAdd MMX SSE 3.2 GB/s 3.2 GB/s L D-Cache and D-TLB Store Load edulers Integer Integer & I-TLB ucode Netburst TM Micro-architecture Pipeline

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 9: Limits of ILP, Case Studies Lecture Outline Speculative Execution Implementing Precise Interrupts

More information

Pipelined Processor Design. EE/ECE 4305: Computer Architecture University of Minnesota Duluth By Dr. Taek M. Kwon

Pipelined Processor Design. EE/ECE 4305: Computer Architecture University of Minnesota Duluth By Dr. Taek M. Kwon Pipelined Processor Design EE/ECE 4305: Computer Architecture University of Minnesota Duluth By Dr. Taek M. Kwon Concept Identification of Pipeline Segments Add Pipeline Registers Pipeline Stage Control

More information

Superscalar Machines. Characteristics of superscalar processors

Superscalar Machines. Characteristics of superscalar processors Superscalar Machines Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any performance

More information

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Advanced processor designs

Advanced processor designs Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1 Pipelining COMP375 Computer Architecture and dorganization Parallelism The most common method of making computers faster is to increase parallelism. There are many levels of parallelism Macro Multiple

More information

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue

More information

Intel released new technology call P6P

Intel released new technology call P6P P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Introduction. The Intel "P8" CPU architecture in a nutshell. print this page

Introduction. The Intel P8 CPU architecture in a nutshell. print this page 12-11-2007 16:04 1 of 13 print this page Intel Core versus AMD's K8 architecture Date: May 1, 2006 Type: CPU & Chipset Manufacturer: Intel Author: Johan De Gelas Page 1 Introduction Wide Dynamic Execution,

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Intel Enterprise Processors Technology

Intel Enterprise Processors Technology Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers

More information

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale Datapoint 2200 IA-32 Nicholas FitzRoy-Dale At the forefront of the computer revolution - Intel Difficult to explain and impossible to love - Hennessy and Patterson! Released 1970! 2K shift register main

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Chapter 4 The Processor (Part 4)

Chapter 4 The Processor (Part 4) Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline

More information

Design and Implementation of a Super Scalar DLX based Microprocessor

Design and Implementation of a Super Scalar DLX based Microprocessor Design and Implementation of a Super Scalar DLX based Microprocessor 2 DLX Architecture As mentioned above, the Kishon is based on the original DLX as studies in (Hennessy & Patterson, 1996). By: Amnon

More information

BOBCAT: AMD S LOW-POWER X86 PROCESSOR

BOBCAT: AMD S LOW-POWER X86 PROCESSOR ARCHITECTURES FOR MULTIMEDIA SYSTEMS PROF. CRISTINA SILVANO LOW-POWER X86 20/06/2011 AMD Bobcat Small, Efficient, Low Power x86 core Excellent Performance Synthesizable with smaller number of custom arrays

More information

This Material Was All Drawn From Intel Documents

This Material Was All Drawn From Intel Documents This Material Was All Drawn From Intel Documents A ROAD MAP OF INTEL MICROPROCESSORS Hao Sun February 2001 Abstract The exponential growth of both the power and breadth of usage of the computer has made

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

XT Node Architecture

XT Node Architecture XT Node Architecture Let s Review: Dual Core v. Quad Core Core Dual Core 2.6Ghz clock frequency SSE SIMD FPU (2flops/cycle = 5.2GF peak) Cache Hierarchy L1 Dcache/Icache: 64k/core L2 D/I cache: 1M/core

More information

Page 1. Review: Dynamic Branch Prediction. Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400)

Page 1. Review: Dynamic Branch Prediction. Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400) CS252 Graduate Computer Architecture Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400) April 4, 2001 Prof. David A. Patterson Computer Science 252 Spring 2001 Lec

More information

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Instruction Execution Consider simplified MIPS: lw/sw rt, offset(rs) add/sub/and/or/slt

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

MICROPROCESSOR TECHNOLOGY

MICROPROCESSOR TECHNOLOGY MICROPROCESSOR TECHNOLOGY Assis. Prof. Hossam El-Din Moustafa Lecture 20 Ch.10 Intel Core Duo Processor Architecture 2-Jun-15 1 Chapter Objectives Understand the concept of dual core technology. Look inside

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , ) Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use

More information

Pentium IV-XEON. Computer architectures M

Pentium IV-XEON. Computer architectures M Pentium IV-XEON Computer architectures M 1 Pentium IV block scheme 4 32 bytes parallel Four access ports to the EU 2 Pentium IV block scheme Address Generation Unit BTB Branch Target Buffer I-TLB Instruction

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Chapter. Out of order Execution

Chapter. Out of order Execution Chapter Long EX Instruction stages We have assumed that all stages. There is a problem with the EX stage multiply (MUL) takes more time than ADD MUL ADD We can clearly delay the execution of the ADD until

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr Ti5317000 Parallel Computing PIPELINING Michał Roziecki, Tomáš Cipr 2005-2006 Introduction to pipelining What is this What is pipelining? Pipelining is an implementation technique in which multiple instructions

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Chapter 4 The Processor 1. Chapter 4D. The Processor

Chapter 4 The Processor 1. Chapter 4D. The Processor Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 5

ECE 571 Advanced Microprocessor-Based Design Lecture 5 ECE 571 Advanced Microprocessor-Based Design Lecture 5 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 6 February 2018 Announcements HW#1 graded HW#2 due Thursday 1 HW#1 Review

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 22 Advanced Processors III 2005-4-12 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Ted Hong and David Marquardt www-inst.eecs.berkeley.edu/~cs152/

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors Portland State University ECE 587/687 The Microarchitecture of Superscalar Processors Copyright by Alaa Alameldeen and Haitham Akkary 2011 Program Representation An application is written as a program,

More information

Lecture 12. Motivation. Designing for Low Power: Approaches. Architectures for Low Power: Transmeta s Crusoe Processor

Lecture 12. Motivation. Designing for Low Power: Approaches. Architectures for Low Power: Transmeta s Crusoe Processor Lecture 12 Architectures for Low Power: Transmeta s Crusoe Processor Motivation Exponential performance increase at a low cost However, for some application areas low power consumption is more important

More information

Advanced Instruction-Level Parallelism

Advanced Instruction-Level Parallelism Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Limitations of Scalar Pipelines

Limitations of Scalar Pipelines Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

COSC 243. Computer Architecture 2. Lecture 12 Computer Architecture 2. COSC 243 (Computer Architecture)

COSC 243. Computer Architecture 2. Lecture 12 Computer Architecture 2. COSC 243 (Computer Architecture) COSC 243 1 Overview This Lecture Architectural topics CISC RISC Multi-core processors Source: Chapter 15,16 and 17(10 th edition) 2 Moore s Law Gordon E. Moore Co-founder of Intel April 1965 The complexity

More information

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle. CS 320 Ch. 16 SuperScalar Machines A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle. A superpipelined machine is one in which a

More information

Static vs. Dynamic Scheduling

Static vs. Dynamic Scheduling Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor

More information

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović Computer Architecture and Engineering CS52 Quiz #3 March 22nd, 202 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 0 Pages Notes: Not all questions are

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 4 ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted

More information

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 18-447 Computer Architecture Lecture 14: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed

More information

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Portland State University ECE 588/688. IBM Power4 System Microarchitecture Portland State University ECE 588/688 IBM Power4 System Microarchitecture Copyright by Alaa Alameldeen 2018 IBM Power4 Design Principles SMP optimization Designed for high-throughput multi-tasking environments

More information

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle? CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 19 Advanced Processors III 2006-11-2 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/ 1 Last

More information

Pentium Pro Case Study ECE/CS 752 Fall 2017

Pentium Pro Case Study ECE/CS 752 Fall 2017 Pentium Pro Case Study ECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti Pentium Pro Case Study Microarchitecture

More information

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101 18 447 Lecture 21: Parallelism ILP to Multicores S 10 L21 1 James C. Hoe Dept of ECE, CMU April 7, 2010 Announcements: Handouts: Lab 4 due this week Optional reading assignments below. The Microarchitecture

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Last lecture. Some misc. stuff An older real processor Class review/overview.

Last lecture. Some misc. stuff An older real processor Class review/overview. Last lecture Some misc. stuff An older real processor Class review/overview. HW5 Misc. Status issues Answers posted Returned on Wednesday (next week) Project presentation signup at http://tinyurl.com/470w14talks

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,

More information

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 The Processor: C Multiple Issue Based on P&H Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in

More information

Superscalar Processor

Superscalar Processor Superscalar Processor Design Superscalar Architecture Virendra Singh Indian Institute of Science Bangalore virendra@computer.orgorg Lecture 20 SE-273: Processor Design Superscalar Pipelines IF ID RD ALU

More information

Agenda. Pentium III Processor New Features Pentium 4 Processor New Features. IA-32 Architecture. Sunil Saxena Principal Engineer Intel Corporation

Agenda. Pentium III Processor New Features Pentium 4 Processor New Features. IA-32 Architecture. Sunil Saxena Principal Engineer Intel Corporation IA-32 Architecture Sunil Saxena Principal Engineer Corporation September 11, 2000 Copyright 2000 Corporation. Linux Supercluster Users Conference Agenda Pentium III Processor New Features Pentium 4 Processor

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

b) Register renaming c) CDB, register file, and ROB d) 0,1,X (output of a gate is never Z)

b) Register renaming c) CDB, register file, and ROB d) 0,1,X (output of a gate is never Z) 1) a) Issuing stores to memory and (maybe) writing results back to register file (depends if we have a distinct physical register file). Instruction dispatch is usually done in program order, but can be

More information