CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

Size: px

Start display at page:

Download "CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1"

Gavin Moore
5 years ago
Views:

1 CS252 Spring 2017 Graduate Computer Architecture Lecture 14: Multithreading Part 2 Synchronization 1 Lisa Wu, Krste Asanovic WU UCB CS252 SP17

2 Last Time in Lecture 13 Shared memory multiprocessor cache coherence Directory protocol Multithreading Traditional multithreading Simultaneous multithreading Resource management: replication, sharing, dynamic partitioning, static partitioning 2 WU UCB CS252 SP17

3 Summary: Multithreaded Categories Time (processor cycle) Superscalar Fine-Grained Coarse-Grained Multiprocessing Simultaneous Multithreading Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot CS252, Fall 2015, Lecture 14 3

4 O-o-O Simultaneous Multithreading [Tullsen, Eggers, Emer, Levy, Stamm, Lo, DEC/UW, 1996] Add multiple contexts and fetch engines and allow instructions fetched from different threads to issue simultaneously Utilize wide out-of-order superscalar processor issue queue to find instructions to issue from multiple threads OOO instruction window already has most of the circuitry required to schedule from multiple threads Any single thread can utilize whole machine 4

5 Icount Choosing Policy Fetch from thread with the least instructions in flight. Why does this enhance throughput? 5

Denelcor HEP (Burton Smith, 1982) First commercial machine to use hardware threading in main CPU - 120

6 Denelcor HEP (Burton Smith, 1982) First commercial machine to use hardware threading in main CPU threads per processor - 10 MHz clock rate - Up to 8 processors - precursor to Tera MTA (Multithreaded Architecture) 6

7 A typical HEP configuration: - 28 nodes - 4 processors - 4 data memory modules - 1 I/O cache - 1 I/O control processor - 4 other I/O devices Architecture and Applications of the HEP Multiprocessor Computer System, Burton J. Smith, Danelcor, Inc., WU UCB CS252 SP17

Tera MTA (1990-) Up to 256 processors Up to 128 active threads per processor Processors and memory modules populate a sparse 3D torus interconnection fabric Flat, shared main memory - No data cache -

8 Tera MTA (1990-) Up to 256 processors Up to 128 active threads per processor Processors and memory modules populate a sparse 3D torus interconnection fabric Flat, shared main memory - No data cache - Sustains one main memory access per cycle per processor GaAs logic in prototype, 260MHz - Second version CMOS, MTA-2, 50W/processor - Newer version, XMT, fits into AMD Opteron socket, runs at 500MHz - Newest version, XMT2, has higher memory bandwidth and capacity 8

9 The TERA Topology 9 WU UCB CS252 SP17

10 W Write Pool Issue Pool Retry Pool Interconnection Network Memory pipeline MTA Pipeline Inst Fetch M Memory Pool A W C W Every cycle, one VLIW instruction from one active thread is launched into pipeline Instruction pipeline is 21 cycles long Memory operations incur ~150 cycles of latency Assuming a single thread issues one instruction every 21 cycles, and clock rate is 260 MHz What is single-thread performance? Effective single-thread issue rate is 260/21 = 12.4 MIPS 10

11 Coarse-Grain Multithreading Tera MTA designed for supercomputing applications with large data sets and low locality - No data cache - Many parallel threads needed to hide large memory latency Other applications are more cache friendly - Few pipeline bubbles if cache mostly has hits - Just add a few threads to hide occasional cache miss latencies - Swap threads on cache misses 11

12 Synchronization The need for synchronization arises whenever there are concurrent processes in a system (even in a uniprocessor system). Two classes of synchronization: Producer-Consumer: A consumer process must wait until the producer process has produced data Mutual Exclusion: Ensure that only one process uses a resource at a given time producer P1 consumer P2 Shared Resource 12

13 TERA s Lightweight Synchronization It is based on the producer-consumer paradigm Implemented using full/empty bits Full: (bit = 1) the value has been produced and can be consumed Empty: (bit = 0) the value has not been produced and cannot be consumed A read stalls until the bit is Full, after the read completes, the bit is set to Empty A write stalls until the bit is Empty, after the write completes, the bit is set to Full 13 WU UCB CS252 SP17

14 Simple Producer-Consumer Example xflagp Producer xdatap flag data Memory xflagp Consumer xdatap sd xdata, (xdatap) li xflag, 1 sd xflag, (xflagp) Initially flag=0 spin: ld xflag, (xflagp) beqz xflag, spin ld xdata, (xdatap) Is this correct? CS252, Fall 2015, Lecture 14 14

15 Memory Model Sequential ISA only specifies that each processor sees its own memory operations in program order Memory model describes what values can be returned by load instructions across multiple threads CS252, Fall 2015, Lecture 14 15

16 Simple Producer-Consumer Example Producer flag data Consumer sd xdata, (xdatap) li xflag, 1 sd xflag, (xflagp) Initially flag=0 spin: ld xflag, (xflagp) beqz xflag, spin ld xdata, (xdatap) Can consumer read flag=1 before data written by producer? CS252, Fall 2015, Lecture 14 16

17 Sequential Consistency A Memory Model P P P P P P M A system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program Leslie Lamport Sequential Consistency = arbitrary order-preserving interleaving of memory references of sequential programs CS252, Fall 2015, Lecture 14 17

18 Simple Producer-Consumer Example Producer flag data Consumer sd xdata, (xdatap) li xflag, 1 sd xflag, (xflagp) Initially flag =0 spin: ld xflag, (xflagp) beqz xflag, spin ld xdata, (xdatap) Dependencies from sequential ISA Dependencies added by sequentially consistent memory model CS252, Fall 2015, Lecture 14 18

19 Implementing SC in hardware Only a few commercial systems implemented SC - Neither x86 nor ARM are SC Requires either severe performance penalty - Wait for stores to complete before issuing new store Or, complex hardware - Speculatively issue loads but squash if memory inconsistency with later-issued store discovered (MIPS R10K) CS252, Fall 2015, Lecture 14 19

20 Hardware Primitives to Support Synchronization Atomic operations: A set of hardware primitives with the ability to atomically read and modify a memory location The read-modify-write needs to be indivisible (no instructions can be executed in between) The primitives will be used by system programmers to build a synchronization library Example atomic operations: Atomic exchange (compare and swap) Test-and-Set Fetch-and-Add (as seen in TERA) A pair of instructions (next lecture): Load-Linked and Store-Conditional 20 WU UCB CS252 SP17

21 Software reorders too! //Producer code *datap = x/y; *flagp = 1; //Consumer code while (!*flagp) ; d = *datap; Compiler can reorder/remove memory operations unless made aware of memory model - Instruction scheduling, move loads before stores if to different address - Register allocation, cache load value in register, don t check memory Prohibiting these optimizations would result in very poor performance CS252, Fall 2015, Lecture 14 21

22 Relaxed Memory Models Not all dependencies assumed by SC are supported, and software has to explicitly insert additional dependencies were needed Which dependencies are dropped depends on the particular memory model - IBM370, TSO, PSO, WO, PC, Alpha, RMO, How to introduce needed dependencies varies by system - Explicit FENCE instructions (sometimes called sync or memory barrier instructions) - Implicit effects of atomic memory instructions How on earth are programmers supposed to work with this???? CS252, Fall 2015, Lecture 14 22

23 Fences in Producer-Consumer Example Producer flag data Consumer sd xdata, (xdatap) li xflag, 1 fence.w.w //Write-write fence sd xflag, (xflagp) Initially flag =0 spin: ld xflag, (xflagp) beqz xflag, spin fence.r.r //Read-read fence ld xdata, (xdatap) CS252, Fall 2015, Lecture 14 23

24 MIT Alewife (1990) Modified SPARC chips - register windows hold different thread contexts Up to four threads per node Thread switch on local cache miss 24

25 IBM PowerPC RS64-IV (2000) Commercial coarse-grain multithreading CPU Based on PowerPC with quad-issue in-order fivestage pipeline Each physical CPU supports two virtual CPUs On L2 cache miss, pipeline is flushed and execution switches to second thread - short pipeline minimizes flush penalty (4 cycles), small compared to memory access latency - flush pipeline to simplify exception handling 25

26 Oracle/Sun Niagara processors Target is datacenters running web servers and databases, with many concurrent requests Provide multiple simple cores each with multiple hardware threads, reduced energy/operation though much lower single thread performance Niagara-1 [2004], 8 cores, 4 threads/core Niagara-2 [2007], 8 cores, 8 threads/core Niagara-3 [2009], 16 cores, 8 threads/core T4 [2011], 8 cores, 8 threads/core T5 [2012], 16 cores, 8 threads/core M5 [2012], 6 cores, 8 threads/core M6 [2013], 12 cores, 8 threads/core 26

27 Oracle/Sun Niagara-3, Rainbow Falls

28 Oracle M

29 Oracle M

30 Oracle M

31 Pentium-4 Hyperthreading (2002) First commercial SMT design (2-way SMT) Logical processors share nearly all resources of the physical processor - Caches, execution units, branch predictors Die area overhead of hyperthreading ~ 5% When one logical processor is stalled, the other can make progress - No logical processor can use all entries in queues when two threads are active Processor running only one active software thread runs at approximately same speed with or without hyperthreading Hyperthreading dropped on OoO P6 based followons to Pentium-4 (Pentium-M, Core Duo, Core 2 Duo), until revived with Nehalem generation machines in Intel Atom (in-order x86 core) has two-way vertical multithreading - Hyperthreading == (SMT for Intel OoO & Vertical for Intel InO) 31

32 IBM Power 4 Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle. 32

33 Power 4 Power 5 2 commits (architected register sets) 2 fetch (PC), 2 initial decodes 33

34 Power 5 data flow... Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck 34

35 Initial Performance of SMT Pentium 4 Extreme SMT yields 1.01 speedup for SPECint_rate benchmark and 1.07 for SPECfp_rate - Pentium 4 is dual threaded SMT - SPECRate requires that each SPEC benchmark be run against a vendorselected number of copies of the same benchmark Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (26 2 runs) speed-ups from 0.90 to 1.58; average was 1.20 Power 5, 8-processor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECfp_rate Power 5 running 2 copies of each app speedup between 0.89 and Most gained some - Fl.Pt. apps had most cache conflicts and least gains 35

36 Acknowledgements This course is partly inspired by previous MIT and Berkeley CS252 computer architecture courses created by my collaborators and colleagues: - Krste Asanovic (UCB) - Arvind (MIT) - Joel Emer (Intel/MIT) - James Hoe (CMU) - John Kubiatowicz (UCB) - David Patterson (UCB) 36

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

CS 152 Computer Architecture and Engineering Lecture 14: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste