CS3350B Computer Architecture

Similar documents
CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

CS 61C: Great Ideas in Computer Architecture. Synchronization, OpenMP

CS 110 Computer Architecture

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

CS 61C: Great Ideas in Computer Architecture TLP and Cache Coherency

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Lecture 21 Thread Level Parallelism II!

M4 Parallelism. Implementation of Locks Cache Coherence

Online Course Evaluation. What we will do in the last week?

Review!of!Last!Lecture! CS!61C:!Great!Ideas!in!! Computer!Architecture! Synchroniza+on,- OpenMP- Great!Idea!#4:!Parallelism! Agenda!

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

CS 61C: Great Ideas in Computer Architecture. Synchroniza+on, OpenMP. Senior Lecturer SOE Dan Garcia

2 GHz = 500 picosec frequency. Vars declared outside of main() are in static. 2 # oset bits = block size Put starting arrow in FSM diagrams

Computer Architecture

CSE 2021: Computer Organization

MIPS R-format Instructions. Representing Instructions. Hexadecimal. R-format Example. MIPS I-format Example. MIPS I-format Instructions

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Shared Symmetric Memory Systems

Thomas Polzer Institut für Technische Informatik

Chapter 7. Multicores, Multiprocessors, and Clusters

Synchronization. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. P&H Chapter 2.11

Mestrado em Informática

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

Multiprocessors & Thread Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

CS5460: Operating Systems

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

COMPUTER ORGANIZATION AND DESIGN

Cache Coherence and Atomic Operations in Hardware

Synchronization. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. P&H Chapter 2.11

EE382 Processor Design. Processor Issues for MP

Control Instructions. Computer Organization Architectures for Embedded Computing. Thursday, 26 September Summary

Control Instructions

Chapter 5. Multiprocessors and Thread-Level Parallelism

Handout 3 Multiprocessor and thread level parallelism

Two processors sharing an area of memory. P1 writes, then P2 reads Data race if P1 and P2 don t synchronize. Result depends of order of accesses

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Multiprocessor Synchronization

Architecture II. Computer Systems Laboratory Sungkyunkwan University

Synchronization. P&H Chapter 2.11 and 5.8. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Introduction II. Overview

Chapter 5: Thread-Level Parallelism Part 1

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread Level Parallelism

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

You Are Here! Peer Instruction: Synchronization. Synchronization in MIPS. Lock and Unlock Synchronization 7/19/2011

CS3350B Computer Architecture Winter 2015

Parallel Processing SIMD, Vector and GPU s cont.

Chapter 2. Instructions: Language of the Computer. Adapted by Paulo Lopes

Chap. 4 Multiprocessors and Thread-Level Parallelism

Multiprocessor Systems. COMP s1

Chapter 7. Multicores, Multiprocessors, and Clusters

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Anne Bracy CS 3410 Computer Science Cornell University

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CS3350B Computer Architecture

ECE Sample Final Examination

CS 61C: Great Ideas in Computer Architecture TLP and Cache Coherency

Parallelism, Multicore, and Synchronization

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

ECE 3056: Architecture, Concurrency, and Energy of Computation. Sample Problem Sets: Pipelining

5008: Computer Architecture

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Computer Architecture

Concurrent Preliminaries

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Hank Levy 412 Sieg Hall

CS 3410 Computer Science Cornell University. [K. Bala, A. Bracy, M. Martin, S. McKee, A. Roth E. Sirer, and H. Weatherspoon]

Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

LECTURE 10. Pipelining: Advanced ILP

5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner. Topic 7: Support for Parallelism

Multiprocessor Systems. Chapter 8, 8.1

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Computer Science 146. Computer Architecture

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Tieing the Threads Together

Flynn's Classification

Administrivia. Minute Essay From 4/11

Parallel Programming Multicore systems

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

CS 316: Procedure Calls/Pipelining

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Symmetric Multiprocessors: Synchronization and Sequential Consistency

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

ECE260: Fundamentals of Computer Engineering

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

CS 61C: Great Ideas in Computer Architecture Lecture 20: Thread- Level Parallelism (TLP) and OpenMP Part 2

Computer Architecture

Review: Easy Piece 1

Computer Architecture. Lecture 6.1: Fundamentals of

Transcription:

CS3350B Computer Architecture Winter 2015 Lecture 7.2: Multicore TLP (1) Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and Design, Patterson & Hennessy, 4 th or 5 th edition, 2011] 0

Review: Multiprocessor Systems (MIMD) Multiprocessor (Multiple Instruction Multiple Data): a computer system with at least 2 processors Processor Processor Processor Cache Cache Cache Interconnection Network Memory I/O Deliver high throughput for independent jobs via job-level parallelism on top of ILP Improve the run time of a single program that has been specially crafted to run on a multiprocessor - a parallel processing program Now Use term core for processor ( Multicore ) because Multiprocessor Microprocessor too redundant

Review Sequential software is slow software SIMD and MIMD only path to higher performance Multiprocessor (Multicore) uses Shared Memory (single address space) (SMP) Cache coherency implements shared memory even with multiple copies in multiple caches False sharing a concern MESI Protocol ensures cache consistency and has optimizations for common cases. 2

Multiprocessors and You Only path to performance is parallelism Clock rates flat or declining SIMD: 2X width every 3-4 years - 128b wide now, 256b 2011, 512b in 2014?, 1024b in 2018? - Advanced Vector Extensions are 256-bits wide! MIMD: Add 2 cores every 2 years: 2, 4, 6, 8, 10, A key challenge is to craft parallel programs that have high performance on multiprocessors as the number of processors increase i.e., that scale Scheduling, load balancing, time for synchronization, overhead for communication

Example: Sum Reduction Sum 100,000 numbers on 100 processor SMP Each processor has ID: 0 Pn 99 Phase I: Partition 1000 numbers per processor; Initial summation on each processor sum[pn] = 0; // 0 Pn 99 for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[pn] = sum[pn] + A[i]; Phase II: Add these partial sums Reduction: divide and conquer Half the processors add pairs, then quarter, Need to synchronize between reduction steps 4

Example: Sum Reduction Second Phase: After each processor has computed its local sum This code runs simultaneously on each core half = 100; repeat synch(); /*Proc 0 sums extra element if there is one */ if (half%2!= 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; half = half/2; /* dividing line on who sums */ if (Pn < half) sum[pn] = sum[pn] + sum[pn+half]; until (half == 1); 5

An Example with 10 Processors sum[p0] sum[p1] sum[p2] sum[p3] sum[p4] sum[p5] sum[p6] sum[p7] sum[p8] sum[p9] P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 half = 10 P0 P1 P2 P3 P4 half = 5 P0 P1 half = 2 P0 half = 1 6

Threads thread of execution: smallest unit of processing scheduled by operating system Threads have their own state or context: Program counter, Register file, Stack pointer, Threads share a memory address space Note: A process is a heavier-weight construct, which has its own address space. A process typically contains one or more threads. Not to be confused with a processor, which is a physical device (i.e., a core) 7

Memory Model for Multi-threading Process CAN BE SPECIFIED IN A LANGUAGE WITH MIMD SUPPORT such as OpenMP and CilkPlus 8

Multithreading On a single processor, multithreading occurs by timedivision multiplexing: Processor switched between different threads - may be pre-emptive or non pre-emptive Context switching happens frequently enough that user perceives threads as running at the same time On a multiprocessor, threads run at the same time, with each processor running a thread 9

Multithreading vs. Multicore Basic idea: Processor resources are expensive and should not be left idle For example: Long latency to memory on cache miss? Hardware switches threads to bring in other useful work while waiting for cache miss Cost of thread context switch must be much less than cache miss latency Put in redundant hardware so don t have to save context on every thread switch: PC, Registers, Attractive for applications with abundant TLP 10

Data Races and Synchronization Two memory accesses form a data race if from different threads, to same location, and at least one is a write, and they occur one after another If there is a data race, result of program can vary depending on chance (which thread ran first?) Avoid data races by synchronizing writing and reading to get deterministic behavior Synchronization done by user-level routines that rely on hardware synchronization instructions 11

Question: Consider the following code when executed concurrently by two threads. What possible values can result in *($s0)? # *($s0) = 100 lw $t0,0($s0) addi $t0,$t0,1 sw $t0,0($s0) 101 or 102 100, 101, or 102 100 or 101 12

Lock and Unlock Synchronization Lock used to create region (critical section) where only one thread can operate Given shared memory, use memory location as synchronization point: lock, semaphore or mutex Thread reads lock to see if it must wait, or OK to go into critical section (and set to locked) 0 => lock is free / open / unlocked / lock off 1 => lock is set / closed / locked / lock on Set the lock Critical section (only one thread gets to execute this section of code at a time) e.g., change shared variables Unset the lock 13

Possible Lock Implementation Lock (a.k.a. busy wait) Get_lock: # $s0 -> addr of lock addiu $t1,$zero,1 # t1 = Locked value Loop: lw $t0,0($s0) # load lock bne $t0,$zero,loop # loop if locked Lock: sw $t1,0($s0) # Unlocked, so lock Unlock Unlock: sw $zero,0($s0) Any problems with this? 14

Possible Lock Problem Thread 1 Thread 2 addiu $t1,$zero,1 Loop: lw $t0,0($s0) addiu $t1,$zero,1 Loop: lw $t0,0($s0) bne $t0,$zero,loop Lock: sw $t1,0($s0) Time bne $t0,$zero,loop Lock: sw $t1,0($s0) Both threads think they have set the lock! Exclusive access not guaranteed! 15

Hardware-supported Synchronization Hardware support required to prevent interloper (either thread on other core or thread on same core) from changing the value Atomic read/write memory operation No other access to the location allowed between the read and write Could be a single instruction e.g., atomic swap of register memory or an atomic pair of instructions 16

Synchronization in MIPS Load linked: ll rt, off(rs) Load rt with the contents at Mem(off+rs) and reserves the memory address off+rs by storing it in a special link register (R link ) Store conditional: sc rt, off(rs) Check if the reservation of the memory address is valid in the link register. If so, the contents of rt is written to Mem(off+rs) and rt is set to 1; otherwise no memory store is performed and 0 is written into rt. Returns 1 (success) if location has not changed since the ll Returns 0 (failure) if location has changed Note that sc clobbers the register value being stored (rt)! Need to have a copy elsewhere if you plan on repeating on failure or using value later 17

Synchronization in MIPS Example Atomic swap (to test/set lock variable) Exchange contents of register and memory: $s4 Mem($s1) try: add $t0,$zero,$s4 #copy value ll $t1,0($s1) #load linked sc $t0,0($s1) #store conditional beq $t0,$zero,try #loop if sc fails add $s4,$zero,$t1 #load value in $s4 sc would fail if another thread executes sc here 18

Test-and-Set In a single atomic operation: Test to see if a memory location is set (contains a 1) Set it (to 1) if it isn t (it contained a zero when tested) Otherwise indicate that the Set failed, so the program can try again While accessing, no other instruction can modify the memory location, including other Test-and-Set instructions Useful for implementing lock operations 19

Test-and-Set in MIPS Single atomic operation Example: MIPS sequence for implementing a T&S at ($s1) Try: addiu $t0,$zero,1 ll $t1,0($s1) bne $t1,$zero,try sc $t0,0($s1) beq $t0,$zero,try Locked: critical section sw $zero,0($s1) 20

Summary Sequential software is slow software SIMD and MIMD only path to higher performance Multiprocessor (Multicore) uses Shared Memory (single address space) Cache coherency implements shared memory even with multiple copies in multiple caches False sharing a concern Synchronization via hardware primitives: MIPS does it with Load Linked + Store Conditional 21