CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Similar documents
Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

An Introduction to Parallel Programming

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 18: Exam 2 Review Session. Prof. Yanjing Li University of Chicago

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture Crash course


Online Course Evaluation. What we will do in the last week?

Parallelism. CS6787 Lecture 8 Fall 2017

Computer Architecture Spring 2016

Introduction to Parallel Computing

Review: Creating a Parallel Program. Programming for Performance

Computer Systems Architecture

Cloud Computing CS

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Parallel Computing. Hwansoo Han (SKKU)

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014

Martin Kruliš, v

CSE502: Computer Architecture CSE 502: Computer Architecture

Thinking parallel. Decomposition. Thinking parallel. COMP528 Ways of exploiting parallelism, or thinking parallel

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Multiprocessors & Thread Level Parallelism

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Systems Architecture

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Parallelism Marco Serafini

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Multiprocessors - Flynn s Taxonomy (1966)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

High Performance Computing. Introduction to Parallel Computing

Lect. 2: Types of Parallelism

WHY PARALLEL PROCESSING? (CE-401)

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

The Art of Parallel Processing

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Chap. 4 Multiprocessors and Thread-Level Parallelism

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Lecture 1: Introduction

Modern Processor Architectures. L25: Modern Compiler Design

15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University

Multiprocessor Systems. Chapter 8, 8.1

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Distributed Systems CS /640

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

High Performance Computing in C and C++

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Parallel Architecture. Hwansoo Han

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Parallel Algorithm Engineering

What is a parallel computer?

CS671 Parallel Programming in the Many-Core Era

Lec 25: Parallel Processors. Announcements

EECS4201 Computer Architecture

27. Parallel Programming I

Copyright 2012, Elsevier Inc. All rights reserved.

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

27. Parallel Programming I

Multithreaded Processors. Department of Electrical Engineering Stanford University

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CS 654 Computer Architecture Summary. Peter Kemper

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

27. Parallel Programming I

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Lecture 10 Midterm review

Moore s Law. Computer architect goal Software developer assumption

Administration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice

Introduction II. Overview

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

CS 475: Parallel Programming Introduction

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

Multiprocessors and Locking

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

CS 426 Parallel Computing. Parallel Computing Platforms

Computer Science 146. Computer Architecture


EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Lecture 27 Programming parallel hardware" Suggested reading:" (see next slide)"

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Lecture 28 Multicore, Multithread" Suggested reading:" (H&P Chapter 7.4)"

ECE 588/688 Advanced Computer Architecture II

Transcription:

CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago

Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on Thursday 2

Where We Are in Lecture Schedule! ISA! Uarch " Datapath, control " Single cycle, multi cycle! Pipelining: basic, dependency handling, branch prediction! Advanced uarch: OOO, SIMD, VLIW, superscalar! Caches! Multi-core! Virtual memory, DRAM, 3

Lecture Outline! Multi-core " Motivation " Overview and fundamental concepts! Challenges for programmers! Challenges for computer architects 4

Paradigm Shift: Single-Core to Multi-Core

Microarchitecture: before early/mid-2000 s! Pushing for single-core performance! Clock frequency scaling ( free from technology scaling)! Fast memory access: on-chip caches! Exploiting instruction level parallelism (ILP)! Pipelining (branch prediction, deep pipeline)! Superscalar! Out-of-order processing! SIMD! 6

Microarchitecture: after early/mid-2000 s! Focus on task-level parallelism " Multi-core era " Proliferation of CMP (chip multi-processor) Image source: Intel 7

Why Single-Core to Multi-Core?! Power wall " Beyond what s allowed by technology scaling! More complexity # more transistors # more power! Higher clock rate # more switching # more power " What limits power?! Cooling! No more large benefits from ILP " Diminishing returns " Degrees of ILP is limited [Olukutun Queue 05] " Pallock s rule: the complexity of all the additional logic required to find parallel instructions dynamically is approximately proportional to the square of the number of instructions that can be issued simultaneously. 8

Multi-Core Benefits! Performance " Latency (execution time) " Throughput! Power! Others " Complexity, yield, reliability! What are the tradeoffs? 9

Power Benefits of Multi-Core! N units at frequency F/N consume less power than 1 unit at frequency F! (Dynamic) power modeled as α * C * V 2 * F Switching activity voltage capacitance frequency! Assume same workload, uarch, technology # α * C is constant! Lower F # lower V (linear) # cubic reduction in Power 10

Multi-Core Fundamentals 11

Task-Level Parallelism! Different tasks/threads executed in parallel " Contrast with ILP, or data parallelism (SIMD)! How to create tasks? " Partition a single problem into multiple related tasks (threads)! Explicitly: parallel programming " Run many independent tasks (processes) together! Easy when there are many processes " Cloud computing workloads! Does not improve the performance of a single task 12

Computers to Exploit Task-Level Parallelism! Two types: loosely coupled vs. tightly coupled! Loosely coupled " No shared global memory address space " Multicomputer network (e.g., datacenters, HPCs) " Data sharing is explicit, e.g., via message passing! Tightly coupled " Shared global memory address space " E.g., Multi-core processors, multithreaded processors " Data sharing is implicit (through memory)! Operations on shared data require synchronization 13

Tightly Coupled/Shared Memory Processors! Logical view! Many possible physical implementations " Levels of caches; uniform memory access (UMA) vs. non uniform memory access (NUMA) 14

Brief Introduction to Parallel Programming

How Do Programmers Leverage Multi-Core Benefits?! Given a single problem, cannot just rely on compilers or hardware to improve performance like how it was done in the past! Programmers must explicitly partition the problem into multiple related tasks (threads) " Different programming models! Pthreads! OpenMP!... " Some programs easy to partition; others are more difficult " How to guarantee correctness? 16

Example! Unpredictable results, called race conditions, can happen if we don t control access to shared variables! A concurrency problem; can occur in single processors also! E.g., x++ from multiple threads! assume x is initialized to 0. What is the value of x after the following execution? CPU 1 Ld r1, x CPU2 Ld r1, x Add r1, r1, 1 Add r1, r1, 1 St r1, x St r1, x 17

Coordinating Access to Shared Data (I)! Locks: simple primitive to ensure updates to single variables occur within a critical section " Many variations (spinlocks, semaphores, ) CPU 1 LOCK x Ld r1, x Add r1, r1, 1 St r1, x UNLOCK x CPU2 LOCK x wait wait lock acquired Ld r1, x Add r1, r1, 1 18

Locks: Performance vs. Correctness! Few locks (coarse-grain locking) " E.g., use one lock for an entire shared array + Easy to write -- Poor performance )processors spend a lot of time stalled waiting to acquire locks)! Many locks (fine-grain locking) " E.g., use one lock for each element in a shared array + Good performance (minimize contention to locks) -- More difficult to write -- Higher chance of incorrect program (deadlock)! Need to consider the tradeoffs very carefully! Privatize as much as possible to avoid locking! 19

Coordinating Access to Shared Data (II)! Barriers: globally get all processors to the same point in program " Divides a program into easily-understood phases 20

Barrier Example For i = 1 to N A[i] = (A[i] + B[i] ) * C[i] sum = sum + A[i] For i = 1 to N A[i] = (A[i] + B[i] ) * C[i] // independent operations For i = 1 to N sum = sum + A[i] // reduction BARRIER 21

Barriers: Pros and Cons + Generally easy to reason about # easy to debug + Reduces the need for locks (no lock for the variable sum ) -- Overhead - Fast processors are stalled waiting at the barrier 22

Performance Analysis 23

Parallel Speedup! Speedup with P cores = t 1 /t p " t 1 and t p: execution time using a single core and p cores, respectively 24

Parallel Speedup Example! a 4 x 4 + a 3 x 3 + a 2 x 2 + a 1 x + a 0! Assume add/mul operations take 1 cycle and no communication cost! How fast is this with a single core? " Assume no pipelining or concurrent execution of instructions! How fast is this with 3 cores? 25

Superlinear Speedup! Can speedup be greater than N with N processing elements?! Unfair comparisons " Compare best parallel algorithm to wimpy serial algorithm! Cache/memory effects " More processors # more caches # fewer misses " Sometimes, to eliminate cache effects, dataset is also increased by a factor of N 26

Parallel Speedup is Usually Sublinear. Why? 27

Limits of Parallel Speedup 28

I. Serial Bottleneck: Amdahl s Law! α: Parallelizable fraction of a program! N: Number of processors Speedup = (1 α) + 1 α N " Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, AFIPS 1967.! As N goes to infinity, speedup = 1/(1-α) " α = 99% $ max speedup = 100! Maximum speedup limited by serial portion: Serial bottleneck 29

Sequential Bottleneck! Observations " Diminishing returns for adding more cores " Speedup remains small until α is large 200 150 100 N=10 N=100 N=1000 50 0 0 0.04 0.08 0.12 0.16 0.2 0.24 0.28 0.32 0.36 0.4 0.44 0.48 0.52 0.56 0.6 0.64 0.68 0.72 0.76 0.8 0.84 0.88 0.92 0.96 1 α (parallel fraction) 30

Why the Sequential Bottleneck?! Parallel machines have the sequential bottleneck! Main cause: Non-parallelizable operations on data (e.g. nonparallelizable loops) for ( i = 0 ; i < N; i++) A[i] = (A[i] + A[i-1]) / 2! Other causes: " Single thread prepares data and spawns parallel tasks (usually sequential) " Repeated code 31

What Else Can be a Bottleneck?! In Amdahl s law, Parallelizable code is perfect, i.e., no overhead InitPriorityQueue(PQ); SpawnThreads(); ForEach Thread: A LEGEND A,E: Amdahl s serial part B: Parallel Portion C1,C2: Critical Sections D: Outside critical section while (problem not solved) Lock (X) SubProblem = PQ.remove(); C1 Unlock(X); Solve(SubProblem); If(problem solved) break; D1 B NewSubProblems = Partition(SubProblem); Lock(X) PQ.insert(NewSubProblems); C2 Unlock(X)... D2 PrintSolution(); E 32

II. Bottlenecks in Parallel Portion! Synchronization: Operations manipulating shared data cannot be parallelized " Locks / barrier synchronization " Communication: Tasks may need values from each other - Causes thread serialization when shared data is contended! Load Imbalance: Parallel tasks may have different lengths " E.g., due to imperfect parallelization (e.g., 103 elements, 10 cores) - Reduces speedup in parallel portion! Resource Contention: Parallel tasks can share hardware resources, delaying each other - Additional latency not present when each task runs alone 33

Remember: Critical Sections! Enforce mutually exclusive access to shared data! Only one thread can be executing it at a time! Contended critical sections make threads wait # threads causing serialization can be on the critical path Each thread: loop { Compute N lock(a) Update shared data unlock(a) C } 34

Remember: Barriers! Synchronization point! Threads have to wait until all threads reach the barrier! Last thread arriving to the barrier is on the critical path Each thread: loop1 { Compute } barrier loop2 { Compute } 35

Parallel Programming is Challenging! Getting parallel programs to work correctly AND! Optimizing performance in the presence of bottlenecks! Much of parallel computer architecture is about " Making programmer s job easier in writing correct and highperformance parallel programs " Designing techniques that overcome the sequential and parallel bottlenecks to achieve higher performance and efficiency 36