Simultaneous Multithreading on Pentium 4

Similar documents
Hyperthreading Technology

CS425 Computer Systems Architecture

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

32 Hyper-Threading on SMP Systems


Multithreaded Processors. Department of Electrical Engineering Stanford University

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Simultaneous Multithreading Architecture

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Programming Evolution

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 06: Multithreaded Processors

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Multi-core Architectures. Dr. Yingwu Zhu

Lecture 14: Multithreading

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Intel Enterprise Processors Technology

Pentium IV-XEON. Computer architectures M

Introduction II. Overview

MICROPROCESSOR TECHNOLOGY

Intel Atom Processor Based Platform Technologies. Intelligent Systems Group Intel Corporation

Hyper-Threading Performance with Intel CPUs for Linux SAP Deployment on ProLiant Servers. Session #3798. Hein van den Heuvel

Parallel Processing SIMD, Vector and GPU s cont.

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

The Intel move from ILP into Multi-threading

Handout 2 ILP: Part B

Intel released new technology call P6P

Is Intel s Hyper-Threading Technology Worth the Extra Money to the Average User?

Simultaneous Multithreading (SMT)

Online Course Evaluation. What we will do in the last week?

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

COSC 6385 Computer Architecture. - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Computer Architecture Spring 2016

THREAD LEVEL PARALLELISM

Parallel Programming Multicore systems

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

CS 654 Computer Architecture Summary. Peter Kemper

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Pentium 4 Processor Block Diagram

1 Simultaneous Multi-Threading on SMP Systems

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Microarchitecture Overview. Performance

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Computer Systems Architecture

Microarchitecture Overview. Performance

Exploring the Effects of Hyperthreading on Scientific Applications

CSE502: Computer Architecture CSE 502: Computer Architecture

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

! Readings! ! Room-level, on-chip! vs.!

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Intel Architecture for Software Developers

Objectives and Functions Convenience. William Stallings Computer Organization and Architecture 7 th Edition. Efficiency

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

Advanced Processor Architecture

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Page 1. Review: Dynamic Branch Prediction. Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400)

ECSE 425 Lecture 25: Mul1- threading

High-Performance Processors Design Choices

Superscalar Processors

WHY PARALLEL PROCESSING? (CE-401)

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Hardware-Based Speculation

Operating System Support

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

An Overview of MIPS Multi-Threading. White Paper

Comp. Org II, Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring

Computer Systems Architecture

Understanding Dual-processors, Hyper-Threading Technology, and Multicore Systems

Parallel Processing & Multicore computers

Assembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam

One-Level Cache Memory Design for Scalable SMT Architectures

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Comp. Org II, Spring

Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000

IBM's POWER5 Micro Processor Design and Methodology

INTEL Architectures GOPALAKRISHNAN IYER FALL 2009 ELEC : Computer Architecture and Design

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

Transcription:

Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32

Overview Multiple threads executing on a single processor without switching. 1. Threads 2. SMT 3. Hyper-Threading on P4 4. OS and Compiler Support 5. Performance for Different Applications CS203B-Advanced Computer Architecture, Spring 2004 p.2/32

Threads Process: A task being run by the computer. Context: Describes a process s current state of execution (registers, flags, PC...). Thread: A light-weight process (has its own PC and SP, but single address space and global variables). Each process consists of at least one thread. Threads allow faster context-switching and fine-grain multitasking. CS203B-Advanced Computer Architecture, Spring 2004 p.3/32

Single-Threaded CPU A lot of bubbles in the instruction issue and in the pipeline! CS203B-Advanced Computer Architecture, Spring 2004 p.4/32

Single-Threaded SMP Executing processes are doubled, but bubbles are doubled as well! CS203B-Advanced Computer Architecture, Spring 2004 p.5/32

Superthreaded CPU Each issue and each pipeline stage can contain instructions of the same thread only. CS203B-Advanced Computer Architecture, Spring 2004 p.6/32

Hyper-Threaded CPU (SMT) Instructions of different threads can be scheduled on the same stage. CS203B-Advanced Computer Architecture, Spring 2004 p.7/32

SMT vs TeraMTA Each processor of the TeraMTA has 128 streams, that include a PC and 32 registers. Each stream is assigned to a thread. Instructions from different streams can be pipelined on the same processor. However, in TeraMTA only a single thread is active on any given cycle. CS203B-Advanced Computer Architecture, Spring 2004 p.8/32

SMT Benefits SMT: Gives the OS the illusion of several (currently two) logical processors. Makes efficient use of resources. Overcomes the barrier of limited amount of ILP within just one thread. Is implemented by dividing processor resources to replicated, partitioned, and shared. CS203B-Advanced Computer Architecture, Spring 2004 p.9/32

Replicated Resources Each logical processor has independent: Instruction Pointer Register Renaming Logic Instruction TLB Return Stack Predictor Advanced Programmable Interrupt Controller Other architectural registers CS203B-Advanced Computer Architecture, Spring 2004 p.10/32

Partitioned Resources Each logical processors gets exactly half of: Re-order buffers (ROBs) Load/Store buffers Several queues (e.g. scheduling, uop (micro-operations)) Partitioning prohibits a logical processor from monopolizing the resources. CS203B-Advanced Computer Architecture, Spring 2004 p.11/32

Statically Partitioned Queue Specific positions are assigned to each processor. CS203B-Advanced Computer Architecture, Spring 2004 p.12/32

Dynamically Partitioned Queue A limit is imposed to the positions each processor can use, but no specific positions are assigned. CS203B-Advanced Computer Architecture, Spring 2004 p.13/32

Shared Resources Each logical processor shares SMT-unaware resources: Execution Units Microarchitectural registers (GPRs, FPRs) Caches: trace cache, L1, L2, L3 Sharing: + Enables efficient use of resources, but... - Allows a thread to monopolize a resource (e.g. cache thrashing). CS203B-Advanced Computer Architecture, Spring 2004 p.14/32

Pentium 4 32-bit NetBurst microarchitecture (hyper-pipelined) Hyper-Threading technology 2.4 to 3.4 GHz clock frequency 800 MHz system bus 0.13-micron technology 8KB L1 data cache, 12KB L1 instruction cache, 256KB to 1MB L2 cache, 2MB L3 cache CS203B-Advanced Computer Architecture, Spring 2004 p.15/32

Front-End Pipeline (a) Trace Cache Hit (b) Trace Cache Miss CS203B-Advanced Computer Architecture, Spring 2004 p.16/32

Out-Of-Order Execution Engine Pipeline CS203B-Advanced Computer Architecture, Spring 2004 p.17/32

Implementation Goals Achieved Minimal die area cost (less than 5% more die area). Stall of one logical processor does not stall the other (buffering queues between pipeline logic blocks). When only one thread is running, speed should be the same as without H-T (partitioned resources are dedicated to it). CS203B-Advanced Computer Architecture, Spring 2004 p.18/32

Single- and Multi-Task Modes Partitioned resources are dedicated to one of the logical processors when the other is HALTed. CS203B-Advanced Computer Architecture, Spring 2004 p.19/32

Operating System Optimizations When the OS schedules threads to logical processors it should: HALT an inactive logical processor, to avoid wasting resources for idle loops (continuously checking for available work). Schedule threads to logical processors on different physical processors instead of the same (when possible), to avoid using the same physical execution resources. CS203B-Advanced Computer Architecture, Spring 2004 p.20/32

OS Optimizations The Linux kernel (2.6 series) distinguishes between logical and physical processors: H-T-aware passive and active load-balancing H-T-aware task pickup H-T-aware affinity H-T-aware wakeup CS203B-Advanced Computer Architecture, Spring 2004 p.21/32

Compiler Optimizations Intel 8.0 C++ and FORTRAN compilers: Automatic optimizations: Vectorization Advanced instruction selection Programmer-controlled optimizations: Insertion of Streaming-SIMD-Extensions 3 (SSE3) instructions Insertion of OpenMP directives CS203B-Advanced Computer Architecture, Spring 2004 p.22/32

Performance gain from automatic optimizations SPEC CPU 2000 shows significant speedup not only from H-T specific (QxP) but even for general P4 (QxN) optimizations. CS203B-Advanced Computer Architecture, Spring 2004 p.23/32

Performance gain from manual optimizations SPEC OMPM 2001 shows speedup achieved by automatic optimizations in combination with OpenMP directives. CS203B-Advanced Computer Architecture, Spring 2004 p.24/32

Thread-level Parallelism of Desktop Applications Unlike server workloads, interactive desktop applications focus on response time and not on end-to-end throughput. Average response time improvement on dual- vs uni-processor measured 22%. The application programmer has to exploit multi-threading. More than 2 processors yield no great improvements. CS203B-Advanced Computer Architecture, Spring 2004 p.25/32

Performance in Client-Server Applications While H-T offers no gain or degradation in API calls and user application workloads, it achieves considerable speedups in multi-threaded workloads. CS203B-Advanced Computer Architecture, Spring 2004 p.26/32

Performance in File Server Workloads Good speedups in multi-threaded workloads, whether filesystem and socket calls, or just socket calls. CS203B-Advanced Computer Architecture, Spring 2004 p.27/32

Performance in Online Transaction Processing 21% performance gain in the case of 1 and 2 processors. CS203B-Advanced Computer Architecture, Spring 2004 p.28/32

Performance in Web Serving 16 to 28% performance gain. CS203B-Advanced Computer Architecture, Spring 2004 p.29/32

Conclusions Hyper-Threading enables thread-level parallelism by duplicating the architectural state of the processor, while sharing one set of processor execution resources. When scheduling threads, the OS sees two logical processors. While not providing the performance achieved by adding a second processor, Hyper-Threading can offer a 30% improvement. Resource contention limits the performance benefits for certain applications. Performance gains are evident in multi-threaded CS203B-Advanced Computer Architecture, Spring 2004 p.30/32 workloads, which are usually found in servers.

References 1. D. Marr et al., Hyper-Threading Technology Architecture and Microarchitecture, Intel Technology Journal, Volume 06-Issue 01, 2002. 2. D. Tulsen et al., Simultaneous Multithreading: Maximizing On-Chip Parallelism, ISCA, 1995. 3. J. Stokes, Introduction to Multithreading, Superthreading and Hyperthreading, Ars Technica, 2002. 4. K. Smith et al., Support for the Intel Pentium 4 Processor with Hyper-Threading Technology in Intel 8.0 Compilers, Intel Technology Journal, Volume 08-Issue 01, 2004. 5. D. Vianney, Hyper-Threading speeds Linux, IBM Linux developerworks, 2003. 6. J.Hennessy, D. Patterson, Computer Architecture: A Quantitative Approach, 3rd Edition, pp. 608 615, 2003. 7. Hyper-Threading Technology on the Intel Xeon Processor Family for Servers, Intel White Paper, 2004. 8. K. Flautner et al., Thread-level Parallelism and Interactive Performance of Desktop Applications, ASPLOS, 2000. CS203B-Advanced Computer Architecture, Spring 2004 p.31/32

Thank you! Questions/Comments? CS203B-Advanced Computer Architecture, Spring 2004 p.32/32