CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago

Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on Thursday 2

Where We Are in Lecture Schedule! ISA! Uarch " Datapath, control " Single cycle, multi cycle! Pipelining: basic, dependency handling, branch prediction! Advanced uarch: OOO, SIMD, VLIW, superscalar! Caches! Multi-core! Virtual memory, DRAM, 3

Lecture Outline! Multi-core " Motivation " Overview and fundamental concepts! Challenges for programmers! Challenges for computer architects 4

Paradigm Shift: Single-Core to Multi-Core

Microarchitecture: before early/mid-2000 s! Pushing for single-core performance! Clock frequency scaling ( free from technology scaling)! Fast memory access: on-chip caches! Exploiting instruction level parallelism (ILP)! Pipelining (branch prediction, deep pipeline)! Superscalar! Out-of-order processing! SIMD! 6

Microarchitecture: after early/mid-2000 s! Focus on task-level parallelism " Multi-core era " Proliferation of CMP (chip multi-processor) Image source: Intel 7

Why Single-Core to Multi-Core?! Power wall " Beyond what s allowed by technology scaling! More complexity # more transistors # more power! Higher clock rate # more switching # more power " What limits power?! Cooling! No more large benefits from ILP " Diminishing returns " Degrees of ILP is limited [Olukutun Queue 05] " Pallock s rule: the complexity of all the additional logic required to find parallel instructions dynamically is approximately proportional to the square of the number of instructions that can be issued simultaneously. 8

Multi-Core Benefits! Performance " Latency (execution time) " Throughput! Power! Others " Complexity, yield, reliability! What are the tradeoffs? 9

Power Benefits of Multi-Core! N units at frequency F/N consume less power than 1 unit at frequency F! (Dynamic) power modeled as α * C * V 2 * F Switching activity voltage capacitance frequency! Assume same workload, uarch, technology # α * C is constant! Lower F # lower V (linear) # cubic reduction in Power 10

Multi-Core Fundamentals 11

Task-Level Parallelism! Different tasks/threads executed in parallel " Contrast with ILP, or data parallelism (SIMD)! How to create tasks? " Partition a single problem into multiple related tasks (threads)! Explicitly: parallel programming " Run many independent tasks (processes) together! Easy when there are many processes " Cloud computing workloads! Does not improve the performance of a single task 12

Computers to Exploit Task-Level Parallelism! Two types: loosely coupled vs. tightly coupled! Loosely coupled " No shared global memory address space " Multicomputer network (e.g., datacenters, HPCs) " Data sharing is explicit, e.g., via message passing! Tightly coupled " Shared global memory address space " E.g., Multi-core processors, multithreaded processors " Data sharing is implicit (through memory)! Operations on shared data require synchronization 13

Tightly Coupled/Shared Memory Processors! Logical view! Many possible physical implementations " Levels of caches; uniform memory access (UMA) vs. non uniform memory access (NUMA) 14

Brief Introduction to Parallel Programming

How Do Programmers Leverage Multi-Core Benefits?! Given a single problem, cannot just rely on compilers or hardware to improve performance like how it was done in the past! Programmers must explicitly partition the problem into multiple related tasks (threads) " Different programming models! Pthreads! OpenMP!... " Some programs easy to partition; others are more difficult " How to guarantee correctness? 16

Example! Unpredictable results, called race conditions, can happen if we don t control access to shared variables! A concurrency problem; can occur in single processors also! E.g., x++ from multiple threads! assume x is initialized to 0. What is the value of x after the following execution? CPU 1 Ld r1, x CPU2 Ld r1, x Add r1, r1, 1 Add r1, r1, 1 St r1, x St r1, x 17

Coordinating Access to Shared Data (I)! Locks: simple primitive to ensure updates to single variables occur within a critical section " Many variations (spinlocks, semaphores, ) CPU 1 LOCK x Ld r1, x Add r1, r1, 1 St r1, x UNLOCK x CPU2 LOCK x wait wait lock acquired Ld r1, x Add r1, r1, 1 18

Locks: Performance vs. Correctness! Few locks (coarse-grain locking) " E.g., use one lock for an entire shared array + Easy to write -- Poor performance )processors spend a lot of time stalled waiting to acquire locks)! Many locks (fine-grain locking) " E.g., use one lock for each element in a shared array + Good performance (minimize contention to locks) -- More difficult to write -- Higher chance of incorrect program (deadlock)! Need to consider the tradeoffs very carefully! Privatize as much as possible to avoid locking! 19

Coordinating Access to Shared Data (II)! Barriers: globally get all processors to the same point in program " Divides a program into easily-understood phases 20

Barrier Example For i = 1 to N A[i] = (A[i] + B[i] ) * C[i] sum = sum + A[i] For i = 1 to N A[i] = (A[i] + B[i] ) * C[i] // independent operations For i = 1 to N sum = sum + A[i] // reduction BARRIER 21

Barriers: Pros and Cons + Generally easy to reason about # easy to debug + Reduces the need for locks (no lock for the variable sum ) -- Overhead - Fast processors are stalled waiting at the barrier 22

Performance Analysis 23

Parallel Speedup! Speedup with P cores = t 1 /t p " t 1 and t p: execution time using a single core and p cores, respectively 24

Parallel Speedup Example! a 4 x 4 + a 3 x 3 + a 2 x 2 + a 1 x + a 0! Assume add/mul operations take 1 cycle and no communication cost! How fast is this with a single core? " Assume no pipelining or concurrent execution of instructions! How fast is this with 3 cores? 25

Superlinear Speedup! Can speedup be greater than N with N processing elements?! Unfair comparisons " Compare best parallel algorithm to wimpy serial algorithm! Cache/memory effects " More processors # more caches # fewer misses " Sometimes, to eliminate cache effects, dataset is also increased by a factor of N 26

Parallel Speedup is Usually Sublinear. Why? 27

Limits of Parallel Speedup 28

I. Serial Bottleneck: Amdahl s Law! α: Parallelizable fraction of a program! N: Number of processors Speedup = (1 α) + 1 α N " Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, AFIPS 1967.! As N goes to infinity, speedup = 1/(1-α) " α = 99% $ max speedup = 100! Maximum speedup limited by serial portion: Serial bottleneck 29

Sequential Bottleneck! Observations " Diminishing returns for adding more cores " Speedup remains small until α is large 200 150 100 N=10 N=100 N=1000 50 0 0 0.04 0.08 0.12 0.16 0.2 0.24 0.28 0.32 0.36 0.4 0.44 0.48 0.52 0.56 0.6 0.64 0.68 0.72 0.76 0.8 0.84 0.88 0.92 0.96 1 α (parallel fraction) 30

Why the Sequential Bottleneck?! Parallel machines have the sequential bottleneck! Main cause: Non-parallelizable operations on data (e.g. nonparallelizable loops) for ( i = 0 ; i < N; i++) A[i] = (A[i] + A[i-1]) / 2! Other causes: " Single thread prepares data and spawns parallel tasks (usually sequential) " Repeated code 31

What Else Can be a Bottleneck?! In Amdahl s law, Parallelizable code is perfect, i.e., no overhead InitPriorityQueue(PQ); SpawnThreads(); ForEach Thread: A LEGEND A,E: Amdahl s serial part B: Parallel Portion C1,C2: Critical Sections D: Outside critical section while (problem not solved) Lock (X) SubProblem = PQ.remove(); C1 Unlock(X); Solve(SubProblem); If(problem solved) break; D1 B NewSubProblems = Partition(SubProblem); Lock(X) PQ.insert(NewSubProblems); C2 Unlock(X)... D2 PrintSolution(); E 32

II. Bottlenecks in Parallel Portion! Synchronization: Operations manipulating shared data cannot be parallelized " Locks / barrier synchronization " Communication: Tasks may need values from each other - Causes thread serialization when shared data is contended! Load Imbalance: Parallel tasks may have different lengths " E.g., due to imperfect parallelization (e.g., 103 elements, 10 cores) - Reduces speedup in parallel portion! Resource Contention: Parallel tasks can share hardware resources, delaying each other - Additional latency not present when each task runs alone 33

Remember: Critical Sections! Enforce mutually exclusive access to shared data! Only one thread can be executing it at a time! Contended critical sections make threads wait # threads causing serialization can be on the critical path Each thread: loop { Compute N lock(a) Update shared data unlock(a) C } 34

Remember: Barriers! Synchronization point! Threads have to wait until all threads reach the barrier! Last thread arriving to the barrier is on the critical path Each thread: loop1 { Compute } barrier loop2 { Compute } 35

Parallel Programming is Challenging! Getting parallel programs to work correctly AND! Optimizing performance in the presence of bottlenecks! Much of parallel computer architecture is about " Making programmer s job easier in writing correct and highperformance parallel programs " Designing techniques that overcome the sequential and parallel bottlenecks to achieve higher performance and efficiency 36