Hyperthreading Technology

Similar documents
Simultaneous Multithreading on Pentium 4

Multi-core Programming Evolution

Multithreaded Processors. Department of Electrical Engineering Stanford University

Simultaneous Multithreading Processor

CS425 Computer Systems Architecture

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

LIMITS OF ILP. B649 Parallel Architectures and Programming

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??


AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Advanced Processor Architecture

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Simultaneous Multithreading Architecture

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

CS420/520 Computer Architecture I

Introducing Multi-core Computing / Hyperthreading

Hardware-Based Speculation

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

MICROPROCESSOR TECHNOLOGY

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

" # " $ % & ' ( ) * + $ " % '* + * ' "

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

Kaisen Lin and Michael Conley

EECS 452 Lecture 9 TLP Thread-Level Parallelism

Simultaneous Multithreading: a Platform for Next Generation Processors

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Handout 2 ILP: Part B

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Simultaneous Multithreading (SMT)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture 14: Multithreading

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Computer Architecture

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

THREAD LEVEL PARALLELISM

November 7, 2014 Prediction

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Pentium IV-XEON. Computer architectures M

The Processor: Instruction-Level Parallelism

Multi-core Architectures. Dr. Yingwu Zhu

WHY PARALLEL PROCESSING? (CE-401)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

CSE502: Computer Architecture CSE 502: Computer Architecture

Superscalar Processors

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

ECSE 425 Lecture 25: Mul1- threading

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Control Hazards. Branch Prediction

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale

The Intel move from ILP into Multi-threading

Parallel Processing SIMD, Vector and GPU s cont.

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Simultaneous Multithreading (SMT)

One-Level Cache Memory Design for Scalable SMT Architectures

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Architecture-Conscious Database Systems

Lecture-13 (ROB and Multi-threading) CS422-Spring

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Processor (IV) - advanced ILP. Hwansoo Han

Simultaneous Multithreading (SMT)

Simultaneous Multithreading and the Case for Chip Multiprocessing

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

! Readings! ! Room-level, on-chip! vs.!

ECE/CS 757: Homework 1

Computer Architecture Spring 2016

Exploring the Effects of Hyperthreading on Scientific Applications

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Control Hazards. Prediction

32 Hyper-Threading on SMP Systems

Two hours. No special instructions. UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date. Time

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

Processor Architecture

Parallelism via Multithreaded and Multicore CPUs. Bradley Dutton March 29, 2010 ELEC 6200

Transcription:

Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading? Trends in microarchitecture Exploiting thread-level parallelism Hyperthreading architecture Microarchitecture choices and trade-offs 3/2/2005 A. Milenkovic 2

What is hyperthreading? SMT - Simultaneous multithreading Make one physical processor appear as multiple logical processors to the OS and software Intel Xeon for the server market, early 2002 Pentium4 for the consumer market, November 2002 Motivation: boost performance for up to 25% at the cost of 5% increase in additional die area 3/2/2005 A. Milenkovic 3 Trends in microarchitecture Higher clock speeds To achieve high clock frequency make pipeline deeper (superpipelining) Events that disrupt pipeline (branch mispredictions, cache misses, etc) become very expensive in terms of lost clock cycles ILP: Instruction Level Parallelism Extract parallelism in a single program Superscalar processors have multiple execution units working in parallel Challenge to find enough instructions that can be executed concurrently Out-of-order execution => instructions are sent to execution units based on instruction dependencies rather than program order 3/2/2005 A. Milenkovic 4

Trends in microarchitecture Cache hierarchies Processor-memory speed gap Use caches to reduce memory latency Multiple levels of caches: smaller and faster closer to the processor core Thread-level Parallelism Multiple programs execute concurrently Web-servers have an abundance of software threads Users: surfing the web, listening to music, encoding/decoding video streams, etc. 3/2/2005 A. Milenkovic 5 Exploiting thread-level parallelism CMP Chip Multiprocessing Multiple processors, each with a full set of architectural resources, reside on the same die Processors may share an on-chip cache or each can have its own cache Examples: HP Mako, IBM Power4 Challenges: Power, Die area (cost) Time-slice multithreading Processor switches between software threads after a predefined time slice Can minimize the effects of long lasting events Still, some execution slots are wasted 3/2/2005 A. Milenkovic 6

Exploiting thread-level parallelism Switch-on-event multithreading Processor switches between software threads after an event (e.g., cache miss) Works well, but still coarse-grained parallelism (e.g., data dependences and branch mispredictions are still wasting cycles) SMT Simultaneous multithreading Multiple software threads execute on a single processor without switching Have potential to maximize performance relative to the transistor count and power 3/2/2005 A. Milenkovic 7 Hyperthreading architecture One physical processor appears as multiple logical processors HT implementation on NetBurst microarchitecture has 2 logical processors Architectural State Architectural State Architectural state: - general purpose registers Processor execution resources - control registers - APIC: advanced programmable interrupt controller 3/2/2005 A. Milenkovic 8

Hyperthreading architecture Main processor resources are shared caches, branch predictors, execution units, buses, control logic Duplicated resources register alias tables (map the architectural registers to physical rename registers) next instruction pointer and associated control logic return stack pointer instruction streaming buffer and trace cache fill buffers 3/2/2005 A. Milenkovic 9 Die Size and Complexity 3/2/2005 A. Milenkovic 10

Resources sharing schemes Partition dedicate equal resources to each logical processors Good when expect high utilization and somewhat unpredicatable Threshold flexible resource sharing with a limit on maximum resource usage Good for small resources with bursty utilization and when the micro-ops stay in the structure for short predictable periods Full sharing flexible with no limits Good for large structures, with variable working-set sizes 3/2/2005 A. Milenkovic 11 Shared vs. partitioned queues shared partitioned 3/2/2005 A. Milenkovic 12

NetBurst Pipeline partitioned threshold 3/2/2005 A. Milenkovic 13 Shared vs. partitioned resources Partitioned E.g., major pipeline queues Threshold Puts a threshold on the number of resource entries a logical processor can have E.g., scheduler Fully shared resources E.g., caches Modest interference Benefit if we have shared code and/or data 3/2/2005 A. Milenkovic 14

Scheduler occupancy 3/2/2005 A. Milenkovic 15 Shared vs. partitioned cache 3/2/2005 A. Milenkovic 16

Performance Improvements 3/2/2005 A. Milenkovic 17