Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Similar documents
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages

Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

Appendix D. Controller Implementation

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition.

This Unit: Dynamic Scheduling. Can Hardware Overcome These Limits? Scheduling: Compiler or Hardware. The Problem With In-Order Pipelines

Uniprocessors. HPC Prof. Robert van Engelen

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Computer Architecture Lecture 8: SIMD Processors and GPUs. Prof. Onur Mutlu ETH Zürich Fall October 2017

CS252 Spring 2017 Graduate Computer Architecture. Lecture 6: Out-of-Order Processors

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

Design of Digital Circuits Lecture 21: SIMD Processors II and Graphics Processing Units

Design of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018

Instruction and Data Streams

CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago

1&1 Next Level Hosting

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu

Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2018

Design of Digital Circuits Lecture 16: Out-of-Order Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018

Transforming Irregular Algorithms for Heterogeneous Computing - Case Studies in Bioinformatics

Switching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1

Arquitectura de Computadores

Threads and Concurrency in Java: Part 1

Threads and Concurrency in Java: Part 1

Fundamentals of. Chapter 1. Microprocessor and Microcontroller. Dr. Farid Farahmand. Updated: Tuesday, January 16, 2018

Traditional queuing behaviour in routers. Scheduling and queue management. Questions. Scheduling mechanisms. Scheduling [1] Scheduling [2]

n Explore virtualization concepts n Become familiar with cloud concepts

Outline. CSCI 4730 Operating Systems. Questions. What is an Operating System? Computer System Layers. Computer System Layers

Isn t It Time You Got Faster, Quicker?

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers *

A collection of open-sourced RISC-V processors

Operating System Concepts. Operating System Concepts

UNIVERSITY OF MORATUWA

Computer Science Foundation Exam. August 12, Computer Science. Section 1A. No Calculators! KEY. Solutions and Grading Criteria.

ECE5917 SoC Architecture: MP SoC Part 1. Tae Hee Han: Semiconductor Systems Engineering Sungkyunkwan University

CMSC Computer Architecture Lecture 1: Introduction. Prof. Yanjing Li Department of Computer Science University of Chicago

Software development of components for complex signal analysis on the example of adaptive recursive estimation methods.

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Lecture 28: Data Link Layer

Elementary Educational Computer

COMPUTER ORGANIZATION AND DESIGN

Programming with Shared Memory PART II. HPC Spring 2017 Prof. Robert van Engelen

SCI Reflective Memory

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Determined by ISA and compiler. Determined by CPU hardware

Session Initiated Protocol (SIP) and Message-based Load Balancing (MBLB)

Τεχνολογία Λογισμικού

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory!

Design of Digital Circuits Lecture 14: Pipelining. Prof. Onur Mutlu ETH Zurich Spring April 2018

GPUMP: a Multiple-Precision Integer Library for GPUs

Goals of the Lecture UML Implementation Diagrams

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS

Lazy Type Changes in Object-oriented Database. Shan Ming Woo and Barbara Liskov MIT Lab. for Computer Science December 1999

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization

CMSC Computer Architecture Lecture 15: Multi-Core. Prof. Yanjing Li University of Chicago

. Written in factored form it is easy to see that the roots are 2, 2, i,

Evaluation scheme for Tracking in AMI

n Learn how resiliency strategies reduce risk n Discover automation strategies to reduce risk

Panel for Adobe Premiere Pro CC Partner Solution

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 17 GPUs

A Study on the Performance of Cholesky-Factorization using MPI

CS2410 Computer Architecture. Flynn s Taxonomy

Computer Architecture ELEC2401 & ELEC3441

Cache-Optimal Methods for Bit-Reversals

CMSC Computer Architecture Lecture 3: ISA and Introduction to Microarchitecture. Prof. Yanjing Li University of Chicago

Using the Keyboard. Using the Wireless Keyboard. > Using the Keyboard

ΕΠΛ 605 Εργαστήριο 5. Παναγιώτα Νικολάου 11/10/18. Slides from: Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin

Computer Architecture ELEC3441

Lecture 1: Introduction and Fundamental Concepts 1

Multiprocessors. HPC Prof. Robert van Engelen

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Quality of Service. Spring 2018 CS 438 Staff - University of Illinois 1

Chapter 5: Processor Design Advanced Topics. Microprogramming: Basic Idea

Lecture 1: Introduction and Strassen s Algorithm

Computer Graphics Hardware An Overview

UNIVERSITY OF MORATUWA

Review: The ACID properties

Xiaozhou (Steve) Li, Atri Rudra, Ram Swaminathan. HP Laboratories HPL Keyword(s): graph coloring; hardness of approximation

The Magma Database file formats

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Reliable Transmission. Spring 2018 CS 438 Staff - University of Illinois 1

Lower Bounds for Sorting

Announcements. Reading. Project #4 is on the web. Homework #1. Midterm #2. Chapter 4 ( ) Note policy about project #3 missing components

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

A Parallel Reconfigurable Architecture for Real-Time Stereo Vision

Quiz: Bad Chinese Food Challenge. Three aspects of design. Lessons from the market. From market to silicon. Eighty percent of success is showing up.

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

Recall: What is an operating system? Very Brief History of OS. Very Brief History of OS. CS162 Operating Systems and Systems Programming Lecture 2

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

3D Model Retrieval Method Based on Sample Prediction

Transcription:

Multi-Threadig Hyper-, Multi-, ad Simultaeous Thread Executio 1

Performace To Date Icreasig processor performace Pipeliig. Brach predictio. Super-scalar executio. Out-of-order executio. Caches. Hyper-Threadig Itel s implemetatio of Simultaeous Multithreadig Techology (SMT). Itroduced i the Foster MP-based Xeo ad the 3.06 GHz Northwood-based Petium 4 i 2002. Multiple Threads 2

Threadig A thread is the smallest sequece of programmed istructios that ca be maaged idepedetly. A thread is composed of multiple istructios. Threads exist withi the same process ad could share the same resources while differet processes do ot share the same resources. Multi-threadig is a type of executio model that allows multiple threads to exist withi the cotext of a process such that they execute idepedetly but share their process resources. A thread maitais a list of iformatio relevat to its executio icludig the priority schedule, exceptio hadlers, a set of CPU registers, ad stack state i the address space of its hostig process. Multithreadig o A Chip Multithreadig helps fid a way to hide true data depedecy stalls, cache miss stalls, ad brach stalls by fidig istructios (from other process threads) that are idepedet of these stallig istructios. Hardware multithreadig icrease the utilizatio of resources o a chip by allowig multiple processes (threads) to share the fuctioal uits of a sigle processor. Processor must duplicate the state hardware for each thread a separate register file, PC, istructio buffer, ad store buffer for each thread. The caches, TLBs, BHT, BTB, ca be shared (although the miss rates may icrease if they are ot sized accordigly). The memory ca be shared through virtual memory mechaisms. Hardware must support efficiet thread cotext switchig. 3

Overview Hyper-threadig Uses processors resources i a highly efficiet way. Cosists of two logical processors for every physical core. Separate threads ca be executed o each logical processor simultaeously. Here is a dumb video Multi-threadig is ehaced by Multi-core techology which allows the parallel executio of software threads across multiple processor cores. Simultaeous Multithreadig (SMT) A variatio o multithreadig that uses the resources of a multiple-issue, dyamically scheduled processor to exploit both program ILP ad thread-level parallelism (TLP) Most superscaler processors have more machie level parallelism tha most programs ca effectively use (i.e., tha have ILP). With register reamig ad dyamic schedulig, multiple istructios from idepedet threads ca be issued without regard to depedecies amog them Need separate reame tables (RUUs) for each thread or eed to be able to idicate which thread the etry belogs to. Need the capability to commit from multiple threads i oe cycle. Itel s Petium 4 SMT is called hyper-threadig Supports just two threads (doubles the architecture state). 4

Advatages ad Disadvatages Advatages of Hyper-threadig: Performace boost by as much as 30% whe compared to idetical processig core without hyper-threadig. Disadvatages of Hyper-threadig: Possible icrease i core size of about 5 percet caused by the duplicatio of certai sectios of the CPU core (Itel's claim). Overall power cosumptio is higher. Icreases cache thrashig (the repeated displacig ad loadig of cache blocks), which ARM claims could be as much as 42%. Hyper-Threadig Architecture Makes a sigle physical processor appear as multiple logical processors. Each logical processor has a copy of architecture state. Logical processors share a sigle set of physical executio resources. From a architecture perspective we have to worry about the logical processors usig shared resources. Caches, executio uits, brach predictors, cotrol logic, ad buses. 5

HT vs. Dual Processors Dual Processor System resources are duplicated. Hyper-Threadig Architectural state is duplicated Data, segmet, cotrol, ad debug registers. Resources shared by the logical CPUs Executio egie, caches, system-bus iterface ad firmware. This ca icreases the performace of the processor up too 30%. Implemetig Multithreadig Two mai questios whe Multithreadig: Thread schedulig policy Whe to switch threads? Pipelie partitioig How do threads share the pipelie? The choices deped o How much sacrifice you are willig to take i sigle thread performace. What kid of latecies you are willig to tolerate. 6

OS Support The OS must support HT sice it maages the threads. Types of Multithreadig Fie-Grai switch threads o every istructio issue. Roud-robi thread iterleavig (skippig stalled threads). Processor must be able to switch threads o every clock cycle. Advatage ca hide throughput losses that come from both short ad log stalls. Disadvatage slows dow the executio of a idividual thread sice a thread that is ready to execute without stalls is delayed by istructios from other threads. Coarse-Grai switches threads oly o costly stalls (e.g., L2 cache misses). Advatages thread switchig does t have to be essetially free ad much less likely to slow dow the executio of a idividual thread. Disadvatage limited, due to pipelie start-up costs, i its ability to overcome throughput loss. Pipelie must be flushed ad refilled o thread switches. 7

Types of Multithreadig Coarse-grai multithreadig (CGMT). Fie-grai multithreadig (FGMT). Simultaeous multithreadig (SMT). Fie-Grai (FGMT) Sacrifices sigificat sigle thread performace. Tolerates latecies (L2 misses, mis-predicted braches). Thread schedulig Roud-robi thread switchig after each cycle. Pipelie partitioig No flushig, dyamic. Multiple threads i pipelie at oce. Lots of threads eeded. Fidig a home i GPU s. 8

Software Implemetatio From a software aspect - sychroizatio of objects is ofte required whe implemetig software apps with multi-threadig. These objects are used to protect memory from beig modified by multiple threads at the same time. A mutex is oe such object. It is a lock which ca be locked by a thread, ad ay successive attempt to lock it by aother thread or by the same thread, will be blocked util the mutex is ulocked, thus keepig a item from beig accessed more tha oce at the same time. Computer Scietists like to refer to pieces of code protected by objects like mutexes as Critical Sectios. Software Implemetatio It is best to keep Critical Sectios as short as possible to allow the applicatio to be as parallel as possible, because the larger the Critical Sectio the greater the chace that multiple threads will access it at the same time, thus causig stalls. Itel provides several helpful templates to make implemetatio easier. 9

Coclusios Hardware support for multi-, hyper-, ad simultaeous threadig exists i all major processors. Operatig system must support multi-threadig. Types of multi-threadig Coarse-grai. Fie-grai. Simultaeous. Idividual latecies are sacrificed for the good of the etire program (or process). 10