Multi-core Programming Evolution

Similar documents
Hyperthreading Technology

Simultaneous Multithreading on Pentium 4

CS425 Computer Systems Architecture

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 06: Multithreaded Processors

CS420/520 Computer Architecture I

Multithreaded Processors. Department of Electrical Engineering Stanford University

! Readings! ! Room-level, on-chip! vs.!

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Multi-core Architectures. Dr. Yingwu Zhu

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

CSE502: Computer Architecture CSE 502: Computer Architecture

Multi-core Architectures. Dr. Yingwu Zhu

COSC 6385 Computer Architecture. - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Simultaneous Multithreading Processor

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Parallel Programming Multicore systems

Computer Architecture Spring 2016

THREAD LEVEL PARALLELISM

Lecture 28 Multicore, Multithread" Suggested reading:" (H&P Chapter 7.4)"

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Computer Systems Architecture

Control Hazards. Prediction

Parallel Processing SIMD, Vector and GPU s cont.

Control Hazards. Branch Prediction

WHY PARALLEL PROCESSING? (CE-401)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Parallelism via Multithreaded and Multicore CPUs. Bradley Dutton March 29, 2010 ELEC 6200

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Chapter 2. OS Overview

EECS 452 Lecture 9 TLP Thread-Level Parallelism

Hardware-Based Speculation

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Chapter 18 - Multicore Computers

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Computer Architecture

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallelism in Hardware

Simultaneous Multithreading (SMT)

Lecture 25: Board Notes: Threads and GPUs

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Introduction II. Overview

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Intel Architecture for Software Developers

Processor Performance and Parallelism Y. K. Malaiya

Understanding Dual-processors, Hyper-Threading Technology, and Multicore Systems

Multicore Hardware and Parallelism

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Advanced Processor Architecture

Computer Architecture Today (I)

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen


250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Multiprocessor Synchronization

CS377P Programming for Performance Multicore Performance Multithreading

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

CSE 392/CS 378: High-performance Computing - Principles and Practice

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Show Me the $... Performance And Caches

IBM's POWER5 Micro Processor Design and Methodology

Handout 2 ILP: Part B

White Paper. First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Computer Science 146. Computer Architecture

Multiprocessors and Locking

Computer System Overview

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

ECE/CS 250 Computer Architecture. Summer 2016

CUDA GPGPU Workshop 2012

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

ECE 588/688 Advanced Computer Architecture II

Professional Multicore Programming. Design and Implementation for C++ Developers

32 Hyper-Threading on SMP Systems

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Processing Unit CS206T

Multi-core Programming - Introduction

CS 654 Computer Architecture Summary. Peter Kemper

The Intel move from ILP into Multi-threading

Simultaneous Multithreading (SMT)

Parallel Computing: Parallel Architectures Jin, Hai

Transcription:

Multi-core Programming Evolution Based on slides from Intel Software ollege and Multi-ore Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts, Evolution of Multi-ore Technology Single-threading: Only one task processes at one time. Singlethreading Multitasking and Multithreading: Two or more tasks execute at one time by using content switching (Functionality)). Multithreading HT Technology: Two single threads execute simultaneously on the same processor core. HT Technology Multi-core Technology: omputational work of an application is divided and spread over multiple execution cores (Performance). Multi-core Technology Evolution of Multi-ore Technology 2008/1/17 2 1

Formal Definition of Threads, Processors, and Hyper-Threading in Terms of Resources Thread Basic unit of PU utilization Program counter, PU state information, stack Processor Architecture state: general purpose PU registers, interrupt controller registers, caches, buses, execution units, branch prediction logic Logical Processor Duplicates the architecture space of processor can be shared among different processors HyperThreading Intel version of Simultaneous Multi-Threading (SMT) Makes single physical processor appear as multiple logical processors OS can schedule to logical processors 2008/1/17 3 HyperThreading and Multicore HyperThreading Instructions from logical processors are persistent and execute on shared execution resources Up to microarchitecture how and when to interleave the instructions When one thread stalls, another can proceed This includes cache misses, and branch mispredictions Multicore hip multiprocessing 2 or more execution cores in a single processor 2 or more processors on a single die an share an on-chip cache an be combined with SMT 2008/1/17 4 2

Single Processor Multiprocessor 2008/1/17 5 Hyper-Threading Multi-core Processor 2008/1/17 6 3

Multicore Shared ache Multicore with HypeThreading Technology 2008/1/17 7 Defining Multi-ore Technology With Multi-core technology: Two or more complete computational engines are placed in a single processor. PU State Interrupt Logic ache Units PU State Interrupt Logic ache Units omputers split the computational work of a threaded application and spread it over multiple execution cores. Processor with Multi-ore Technology omputational Work More tasks get completed in less time increasing the performance and responsiveness of the system. ore 1 ore 2 ore 3 omputational Work Distributed Between Multiple ores 2008/1/17 8 4

Evolution to Dual ore Basic Processor Pipelined 1 Instr. / ycle 2008/1/17 9 Evolution to Dual ore Superscalar Parallel Functional Units 2008/1/17 10 5

Evolution to Dual ore Dual ore Double the transistors 2008/1/17 11 Scope Instruction Level encoding/pipelined execution increase throughput Instruction Level scheduling, superscalar Outermost Loops, whole program data/functional 2008/1/17 12 6

HT Technology and Dual-ore Dual Processors Processor w/ HT API API API API API API ore ore ore API: Advanced Programmable System Interrupt Bus ontrollers: Solve interrupt routing efficiency issues in multiprocessor D-Processor computer systems. D and HT Logical Logical Processor Processor API API API API API API ore ore ore ore System Bus 2008/1/17 13 Hyper-Threading vs. Dual ore Both threads without Multitasking or Hyper-Threading Technology: Both threads with MultiTasking Technology: Both threads with Dual ore processors: Time saved Time saved Time 2008/1/17 14 7

Driving Greater Parallelism Normalized Performance vs. initial Intel Pentium 4 Processor DUAL/MULTI-ORE PERFORMANE Performance 10X SINGLE-ORE PERFORMANE 3X 2000 2004 2008 FOREAST 2008/1/17 15 Power Vs. Frequency 2f P ~ 4f 2 = 4P0 4x Power Increase f P 0 ~ f 2 f P = 4P0 ore Die/Socket f f P ~ 2f 2 = 2P0 only 2x 2008/1/17 16 8

How HT Technology and Dual-ore Work Physical processor cores Logical processors visible to OS Thread 2 Thread 1 Resource 1 Resource 2 Resource 3 Physical processor resource allocation Time Throughput Thread 1 Thread 2 Resource 1 Resource 2 Resource 3 Thread 1 Thread 2 Resource 1 Resource 2 Resource 3 Resource 1 Resource 2 Resource 3 2008/1/17 17 Today? Multi-ore Power ache Large ore 4 3 2 1 Performance 2 1 Power = 1/4 Performance = 1/2 Small ore 1 1 1 3 ache 2 4 4 3 2 1 4 3 2 1 Multi-ore: Power efficient Better power and thermal management 2008/1/17 18 9

The Future? Many-core General Purpose ores SP SP SP SP Special Purpose HW Interconnect fabric Heterogeneous Multi-ore Platform 2008/1/17 19 Taking full advantage of Multi-ore requires multi-threaded software Increased responsiveness and worker productivity Increased application responsiveness when different tasks run in parallel Improved performance in parallel environments When running computations on multiple processors More computation per cubic foot of data center Web-based apps are often multi-threaded by nature 2008/1/17 20 10

Threading - the most common parallel technique OS creates process for each program loaded Additional threads can be created within the process Each thread has its own Stack and Instruction Pointer All threads share code and data Single shared address space Thread local storage if required ontrast with Distributed address parallelism: Message Passing Interface (MPI) thread2() Stack IP Process Data ode thread1() Stack IP threadn() Stack IP 2008/1/17 21 Independent Programs Parallel Programs Application Application Application Application Application Application Threads Process Switch Thread Switch Threads allows one application to utilize the power of multiple processors 2008/1/17 22 11

Multi-Threading on Single-ore v Multi-ore Optimal application performance on multi-core architectures will be achieved by effectively using threads to partition software workloads On single core, multi-threading can only be used to hide latency Ex: rather than block UI when printing spawn a thread, free UI thread to respond to user Multi-threaded applications running on multi-core platforms have different design considerations than do multi-threaded applications running on single-core platforms On single core can make assumptions to simplify writing and debugging These may not be valid on multi-core platforms e.g areas like memory caching and thread priority 2008/1/17 23 Memory aching Example Each core may have own cache May be out of sync Ex: 2 thread on dual-core reading and writing neighboring memory locations Data values may be in same cache line because of data locality Memory system may mark the line as invalid due to write Result is cache miss even though the data in cache being read is valid Not an issue on single-core since only one cache 2008/1/17 24 12

Thread Priorities Example Applications with two threads of different priorities To improve performance developer assumes that higher thread will run without interference from lower priority thread On single core valid since scheduler will not yield PU to lower priority thread On multi-core may schedule 2 threads on different cores, and they may run simultaneously making code unstable 2008/1/17 25 13