Lecture 21 Thread Level Parallelism II!

Similar documents
CS3350B Computer Architecture

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

CS 61C: Great Ideas in Computer Architecture TLP and Cache Coherency

CS 61C: Great Ideas in Computer Architecture TLP and Cache Coherency

Mul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014

CS 61C: Great Ideas in Computer Architecture Lecture 20: Thread- Level Parallelism (TLP) and OpenMP Part 2

CS 61C: Great Ideas in Computer Architecture Lecture 20: Thread- Level Parallelism (TLP) and OpenMP Part 2

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread Level Parallelism

CS 61C: Great Ideas in Computer Architecture Thread Level Parallelism (TLP)

Instructor: Randy H. Katz hbp://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #16. Warehouse Scale Computer

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread- Level Parallelism (TLP) and OpenMP

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Computer Architecture

CS 61C: Great Ideas in Computer Architecture. Synchroniza+on, OpenMP. Senior Lecturer SOE Dan Garcia

Dr. George Michelogiannakis. EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

Thread- Level Parallelism. ECE 154B Dmitri Strukov

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread- Level Parallelism (TLP) and OpenMP. Serializes the execunon of a thread

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Consistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

Threads. COMP 401 Fall 2017 Lecture 22

Opera&ng Systems ECE344

Instructor: Randy H. Katz hbp://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #13. Warehouse Scale Computer

Multiprocessors & Thread Level Parallelism

CS 61C: Great Ideas in Computer Architecture Direct- Mapped Caches. Increasing distance from processor, decreasing speed.

Online Course Evaluation. What we will do in the last week?

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Multiprocessors. Key questions: How do parallel processors share data? How do parallel processors coordinate? How many processors?

ECSE 425 Lecture 1: Course Introduc5on Bre9 H. Meyer

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Computer Systems Architecture

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Mestrado em Informática

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

CS 110 Computer Architecture

Page 1. Cache Coherence

Concurrent systems Lecture 7: Crash recovery, lock- free programming, and transac<onal memory. Dr Robert N. M. Watson

Computer Systems Architecture

Parallel Computing Introduction

1. Memory technology & Hierarchy

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

Aleksandar Milenkovich 1

Chap. 4 Multiprocessors and Thread-Level Parallelism

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 7. Multicores, Multiprocessors, and Clusters

Administrivia. Minute Essay From 4/11

Advanced OpenMP. Lecture 3: Cache Coherency

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

Processor Architecture and Interconnect

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

CS 61C: Great Ideas in Computer Architecture (Machine Structures)

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

High Performance Computing Systems

Handout 3 Multiprocessor and thread level parallelism

EC 513 Computer Architecture

Instructors: Randy H. Katz David A. PaGerson hgp://inst.eecs.berkeley.edu/~cs61c/fa10. 10/4/10 Fall Lecture #16. Agenda

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

The Cache-Coherence Problem

COSC 6385 Computer Architecture - Multi Processor Systems

NFS 3/25/14. Overview. Intui>on. Disconnec>on. Challenges

CSE Opera,ng System Principles

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Shared Memory Architectures. Approaches to Building Parallel Machines

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Shared- Memory Programming in OpenMP Advanced Research Computing

CS61C : Machine Structures

10/27/11. Is The Queue Correct? Concurrent Computaton. Objects. A Concurrent FIFO Queue head. A Concurrent FIFO Queue head

Bus-Based Coherent Multiprocessors

1" 0" d" c" b" a" ...! ! Benefits of Larger Block Size. " Very applicable with Stored-Program Concept " Works well for sequential array accesses

Parallel Numerical Algorithms

Carnegie Mellon Performance of Mul.threaded Code

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

UCB CS61C : Machine Structures

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

NFS. CSE/ISE 311: Systems Administra5on

CS 61C: Great Ideas in Computer Architecture Lecture 16: Caches, Part 3

Processor Architecture

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

UC Berkeley CS61C : Machine Structures

Multiprocessor Cache Coherency. What is Cache Coherence?

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

COMPUTER ORGANIZATION AND DESIGN

Lecture 33 Caches III What to do on a write hit? Block Size Tradeoff (1/3) Benefits of Larger Block Size

DD2451 Parallel and Distributed Computing --- FDD3008 Distributed Algorithms

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

UCB CS61C : Machine Structures

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Lecture 25: Multiprocessors

Transcription:

inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 21 Thread Level Parallelism II! DaMniki GaLurscitiag! A lot of compu+ng pioneers were women for decades, the number of women studying computer science was growing faster than the number of men. But in 1984, something changed. early personal computers weren't much more than toys. And these toys were marketed almost en+rely to men and boys. http://www.npr.org/blogs/money/2014/10/21/357629765/when-women-stopped-coding! CS61C L21 Thread Level Parallelism II (1)

Parallel Processing: Mul+processor Systems (MIMD) MP - A computer system with at least 2 processors: Processor Processor Processor Cache Cache Cache Interconnec:on Network Memory Q1 How do they share data? Q2 How do they coordinate? Q3 How many processors can be supported? I/O CS61C L21 Thread Level Parallelism II (2)

Shared Memory Mul+processor (SMP) Q1 Single address space shared by all processors/cores Q2 Processors coordinate/communicate through shared variables in memory (via loads and stores) Use of shared data must be coordinated via synchroniza+on primi+ves (locks) that allow access to data to only one processor at a +me All mul+core computers today are SMP CS61C L21 Thread Level Parallelism II (3)

CS10 Review : Higher Order Functions with CS10: Sum Reduction Useful HOFs (you can build your own!) map Reporter over List! Report a new list, every element E of List becoming Reporter(E) keep items such that Predicate from List! Report a new list, keeping only elements E of List if Predicate(E)! combine with Reporter over List! Combine all the elements of List with Reporter(E)! This is also known as reduce! CS61C L21 Thread Level Parallelism II (4)

Is this a good model when you have multiple cores to help you? combine with Reporter over List a b c d CS61C L21 Thread Level Parallelism II (5)

CS61C Example: Sum Reduc+on Sum 100,000 numbers on 100 processor SMP Each processor has ID: 0 Pn 99 Par++on 1000 numbers per processor Ini+al summa+on on each processor [Phase I] Aka, the map phase sum[pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[pn] = sum[pn] + A[i]; Now need to add these par+al sums [Phase II] Reduc+on: divide and conquer in the reduce phase Half the processors add pairs, then quarter, Need to synchronize between reduc+on steps CS61C L21 Thread Level Parallelism II (6)

Example: Sum Reduc+on Second Phase: Aber each processor has computed its local sum This code runs simultaneously on each core half = 100;! repeat! synch();! /* Proc 0 sums extra element if there is one */!!if (half%2!= 0 && Pn == 0)! sum[0] = sum[0] + sum[half-1];! half = half/2; /* dividing line on who sums */! if (Pn < half) sum[pn] = sum[pn] + sum[pn+half];! until (half == 1);! CS61C L21 Thread Level Parallelism II (7)

An Example with 10 Processors sum[p0] sum[p1] sum[p2] sum[p3] sum[p4] sum[p5] sum[p6] sum[p7] sum[p8] sum[p9] P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 half = 10 CS61C L21 Thread Level Parallelism II (8)

An Example with 10 Processors sum[p0] sum[p1] sum[p2] sum[p3] sum[p4] sum[p5] sum[p6] sum[p7] sum[p8] sum[p9] P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 half = 10 P0 P1 P2 P3 P4 half = 5 if (half%2!= 0 && Pn == 0)! sum[0] = sum[0] + sum[half-1]; P0 P1 half = 2 P0 half = 1 CS61C L21 Thread Level Parallelism II (9)

Memory Model for Mul+- threading CAN BE SPECIFIED IN A LANGUAGE WITH MIMD SUPPORT SUCH AS OPENMP CS61C L21 Thread Level Parallelism II (10)

half = 100;! Peer Instruc+on repeat! synch();! /* Proc 0 sums extra element if there is one */!!if (half%2!= 0 && Pn == 0)! sum[0] = sum[0] + sum[half-1];! half = half/2; /* dividing line on who sums */! if (Pn < half) sum[pn] = sum[pn] + sum[pn+half];! until (half == 1);! What goes in Shared? What goes in Private? (a) PRIVATE PRIVATE PRIVATE (b) PRIVATE PRIVATE SHARED Fall 2011 - - Lecture #21 (c) PRIVATE SHARED PRIVATE (d) SHARED SHARED PRIVATE (e) SHARED SHARED SHARED CS61C L21 Thread Level Parallelism II (11) half sum Pn

half = 100;! Peer Instruc+on repeat! synch();! /* Proc 0 sums extra element if there is one */!!if (half%2!= 0 && Pn == 0)! sum[0] = sum[0] + sum[half-1];! half = half/2; /* dividing line on who sums */! if (Pn < half) sum[pn] = sum[pn] + sum[pn+half];! until (half == 1);! What goes in Shared? What goes in Private? (b) PRIVATE PRIVATE SHARED Fall 2011 - - Lecture #21 (c) PRIVATE SHARED PRIVATE (d) SHARED SHARED PRIVATE (e) SHARED SHARED SHARED CS61C L21 Thread Level Parallelism II (12) half sum Pn (a) PRIVATE PRIVATE PRIVATE

Three Key Ques+ons about Mul+processors Q3 How many processors can be supported? Key bopleneck in an SMP is the memory system Caches can effec+vely increase memory bandwidth/open the bopleneck But what happens to the memory being ac+vely shared among the processors through the caches? CS61C L21 Thread Level Parallelism II (13)

Shared Memory and Caches What if? Processors 1 and 2 read Memory[1000] (value 20) Processor 0 Processor 1 Processor 2 1000 1000 1000 1000 Cache Cache Cache Interconnec:on Network Memory 20 20 I/O CS61C L21 Thread Level Parallelism II (14)

Shared Memory and Caches What if? Processors 1 and 2 read Memory[1000] Processor 0 writes Memory[1000] with 40 1000 Processor 0 Processor 1 Processor 2 1000 20 1000 20 1000 Cache 40 Cache Cache Interconnec:on Network 1000 Memory 40 I/O Processor 0 Write Invalidates Other Copies (Easy! Turn Valid bit off) CS61C L21 Thread Level Parallelism II (15)

Keeping Mul+ple Caches Coherent Architect s job: shared memory è keep cache values coherent Idea: When any processor has cache miss or writes, no+fy other processors via interconnec+on network If only reading, many processors can have copies If a processor writes, invalidate all other copies Shared wripen result can ping- pong between caches CS61C L21 Thread Level Parallelism II (16)

How Does HW Keep $ Coherent? Each cache tracks state of each block in cache: Shared: up- to- date data, not allowed to write other caches may have a copy copy in memory is also up- to- date Modified: up- to- date, changed (dirty), OK to write no other cache has a copy, copy in memory is out- of- date - must respond to read request Invalid: Not really in the cache CS61C L21 Thread Level Parallelism II (17)

2 Op+onal Performance Op+miza+ons of Cache Coherency via new States Exclusive: up- to- date data, OK to write (change to modified) no other cache has a copy, copy in memory up- to- date Avoids wri+ng to memory if block replaced Supplies data on read instead of going to memory Owner: up- to- date data, OK to write (if invalidate shared copies first then change to modified) other caches may have a copy (they must be in Shared state) copy in memory not up- to- date So, owner must supply data on read instead of going to memory CS61C L21 Thread Level Parallelism II (18)

Common Cache Coherency Protocol: MOESI (snoopy protocol) Each block in each cache is in one of the following states: Modified (in cache) Owned (in cache) Exclusive (in cache) Shared (in cache) Invalid (not in cache) Compatability Matrix: Allowed states for a given cache block in any pair of caches CS61C L21 Thread Level Parallelism II (19)

Common Cache Coherency Protocol: MOESI (snoopy protocol) Each block in each cache is in one of the following states: Modified (in cache) Owned (in cache) Exclusive (in cache) Shared (in cache) Invalid (not in cache) Compatability Matrix: Allowed states for a given cache block in any pair of caches CS61C L21 Thread Level Parallelism II (20)

CS61C L21 Thread Level Parallelism II (21)

Cache Coherency and Block Size Suppose block size is 32 bytes Suppose Processor 0 reading and wri+ng variable X, Processor 1 reading and wri+ng variable Y Suppose in X loca+on 4000, Y in 4012 What will happen? Effect called false sharing How can you prevent it? CS61C L21 Thread Level Parallelism II (22)

DaMniki s Laptop? sysctl hw hw.ncpu: 2 hw.byteorder: 1234 hw.memsize: 8589934592 hw.ac+vecpu: 2 hw.physicalcpu: 2 hw.physicalcpu_max: 2 hw.logicalcpu: 2 hw.logicalcpu_max: 2 hw.cputype: 7 hw.cpusubtype: 4 hw.cpu64bit_capable: 1 hw.cpufamily: 2028621756 hw.cacheconfig: 2 1 2 0 0 0 0 0 0 0 hw.cachesize: 8321499136 32768 6291456 0 0 0 0 0 0 0 hw.pagesize: 4096 hw.busfrequency: 1064000000 hw.busfrequency_min: 1064000000 hw.busfrequency_max: 1064000000 hw.cpufrequency: 3060000000 CS61C L21 Thread Level Parallelism II (23) Be careful! You can *change* some of these values with the wrong flags! hw.cpufrequency_min: 3060000000 hw.cpufrequency_max: 3060000000 hw.cachelinesize: 64 hw.l1icachesize: 32768 hw.l1dcachesize: 32768 hw.l2cachesize: 6291456 hw.tbfrequency: 1000000000 hw.packages: 1 hw.op+onal.floa+ngpoint: 1 hw.op+onal.mmx: 1 hw.op+onal.sse: 1 hw.op+onal.sse2: 1 hw.op+onal.sse3: 1 hw.op+onal.supplementalsse3: 1 hw.op+onal.sse4_1: 1 hw.op+onal.sse4_2: 0 hw.op+onal.x86_64: 1 hw.op+onal.aes: 0 hw.op+onal.avx1_0: 0 hw.op+onal.rdrand: 0 hw.op+onal.f16c: 0 hw.op+onal.enfstrg: 0 hw.machine = x86_64

And In Conclusion, Sequen+al sobware is slow sobware SIMD and MIMD only path to higher performance Mul+processor (Mul+core) uses Shared Memory (single address space) Cache coherency implements shared memory even with mul+ple copies in mul+ple caches False sharing a concern Next Time: OpenMP as simple parallel extension to C CS61C L21 Thread Level Parallelism II (24)