Multithreaded Programming

Size: px

Start display at page:

Download "Multithreaded Programming"

Harold Lucas
5 years ago
Views:

2 Overall Goal Acquire mental model of how parallel computers work Learn basic guidelines on designing parallel programs Know what you need to look up later Emphasis is on multithreaded shared-memory programs, but most of the high-level concepts apply to any parallel program. Copyright 2005 Intel Corporation 2

6 Functionality Sometimes parallel version is easier to write. Often the case in stream processing expand sed 's/ */ /' fold s Responsiveness of human interface Decouple human interface from computing portions Compute in background Decouple portions of user interface Not have window block because another window s dialog box is open. Copyright 2005 Intel Corporation 6

8 Turn-Around Solve problem faster Typically by applying more resources Parallel solution is sometimes better algorithm! Parallel branch-and-bound may tighten bound quicker, resulting in less wasted work. Examples Practical weather prediction 24-hour forecast that takes 24 hours to compute is not useful. Real time (e.g. video games, cybernetics) Must meet deadline. Copyright 2005 Intel Corporation 8

9 Capacity Big parallel processors have huge memory capacity Can solve problem in core without having to resort to slow out of core techniques RAM on Top 5 Supercomputers in June 2005 Blue Gene/L BGW Columbia Earth Simulator MareNostrum NASA/Ames IBM Thomas J. Watson Research Center NSA/Ames Japan Barcelona 32 Terabytes? 20 Terabytes? 20 Terabytes 10 Terabytes 9.6 Terabytes Copyright 2005 Intel Corporation 9

10 Basic Parallel Programming Models Single Instruction Multiple Data (SIMD) Single thread operating on long vector Multiple Instruction Multiple Data (MIMD) Message passing Each thread has its own address space Threads communicate by sending and receiving messages Shared memory Threads share a common address space Threads communicate by writing and reading shared memory locations Hybrid Networks of shared-memory machines Virtual shared memory across a cluster Totally implicit E.g., applicative programming languages Copyright 2005 Intel Corporation 10

11 Parallelism vs. Concurrency Concurrency = multiple threads in progress at same time Achievable on a single processor by interleaving Parallelism = multiple threads executing simultaneously Requires multiple execution units Copyright 2005 Intel Corporation 11

12 Reasons for Concurrency Throughput Hide I/O latency. Example: Let task A compute while task B waits to read disk. Functionality can improve functionality (e.g. listen to music file and surf Web) Copyright 2005 Intel Corporation 12

13 Reasons for Parallelism Functionality Multithreaded CPU: avoids annoying pauses when a task hogs a physical thread. Throughput Achieve good frame rate in video game: CPU simulates universe while graphics processor renders. Turn-Around Almost all champion chess programs. Capacity Most large-scale scientific problems Copyright 2005 Intel Corporation 13

14 Speedup Speedup measures improvement from parallelization speedup(p) = time for best serial version time for version with p physical threads Efficiency is relative to ideal speedup efficiency(p) = speedup p Copyright 2005 Intel Corporation 14

15 Typical Speedup Chart Speedup (Relative to Serial Version) Speedup of Parallel Foo Geomean of 10 runs [blocksize=16x16 pixels, HT-enabled: Xeon with 8 TPUs, 512K Cache ] Physical Threads IDEAL NEW AND IMPROVED! BRAND X Why does this chart violate our definition of speedup? Copyright 2005 Intel Corporation 15

16 Use Best Serial Code As Baseline Speedup(1)>1 in example indicates that the original serial code was not the best serial code. This commonly happens While tuning for parallelism, serial improvements are found. Example: no alias information useful for parallelizing can also help compiler generate better serial code Example: reblocking loops to help parallel decomposition sometimes helps serial code s reuse of cache. Reference: Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers Copyright 2005 Intel Corporation 16

17 Amdahl s ``Law Suppose program takes 1 unit of time to execute serially. Part of program is inherently serial (unparallelizable). s = fraction of time spent in serial portion 1-s = remaining parallelizable time time on single processor s 1-s s (1-s)/p time on parallel processor (in ideal world) Copyright 2005 Intel Corporation 17

18 Amdahl s Formulae speedup(p) 1 s p 1 + s serial execution time best possible parallel execution time Speedup limited by serial portion speedup( ) 1 s Situation is worse in real world, because parallel portion usually has inefficiencies. Copyright 2005 Intel Corporation 18

19 Diminishing Returns Suppose program has 20% serial portion speedup(2) speedup(4) = 2.5 Doubling the number of processors from 2 to 4 of processors got us only 50% improvement! speedup( ) = 5 This is depressing. Is there a way to evade Amdahl s law? Copyright 2005 Intel Corporation 19

20 Gustafson/Barsis s Argument Scale up problem size N as number of processors increases. Assume that work in parallel portion grows as N does E.g., finer mesh and more particles for gorier video game Parallelizable portion now requires time N(1-s) to execute serially Keep serial portion independent of problem size This is often achievable, though it requires care. E.g., must never loop over N in serial portion. E.g., if amount of I/O depends on N, I/O must be parallel too. Net effect is to shrink serial fraction as number of processors grows. Copyright 2005 Intel Corporation 20

21 Gustafson/Barsis s Formulae Revised speedup: speedup(p) Now set N=p: s+n(1-s) N(1 s) p + s serial execution time best possible parallel execution time speedup(p) s + p(1-s) Copyright 2005 Intel Corporation 21

22 Growing Returns Revisit program with 20% serial portion Assume that parallel work scales up with N speedup(2).2 + 2(1 0.2) = 1.8 speedup(4).2 + 4(1 0.2) = 3.4 Doubling the number of processors and doubling the problem size now almost doubles speedup. speedup( ).2+ (1 0.2) = Reference: John Gustafson, Reevaluatng Amdahl s Law, CACM May 1988, Volume 31. Copyright 2005 Intel Corporation 22

23 Summary of Episode I Reasons to Write Parallel Programs functionality throughput turn-around capacity Two flavors shared memory message passing Two laws Amdahl: speedup limited by serial portion Gustafson/Barsis: evade limit by scaling up problem Copyright 2005 Intel Corporation 23

Pentium 4 Processor Pentium 4 is a trademark or registered trademark of Intel Corporation

27 Latency vs. Throughput Latency = how long to execute an instruction. Throughput = instructions per unit time Example: Pentium 4 integer multiply latency = clock cycles throughput is one result per 5 cycles Reference: Appendix C of IA-32 Intel Archtecture Optimization Reference Manual Copyright 2005 Intel Corporation 27

28 Instruction Level Parallelism SIMD Instruction operates on vector of data 4-element vectors have become very popular Super scalar Execute more than one instruction at once No clear winner in in-order vs. out-of-order debate In-order Instructions executed in program order Smaller, simpler, less power, but suffers from cache misses Out-of-order Instructions executed out of order so that instructions waiting for data do no delay subsequent independent instructions Bigger, more complex, almost quadratic power demand, but more tolerant of cache misses Copyright 2005 Intel Corporation 28

29 Multithreaded CPUs Like expert chess player playing against multiple opponents Each board is state of a thread Move on to next board while waiting for response (hide latency) Susan Polgar set record of 326 concurrent games (309 wins!) Been around since 1950s Simultaneous Multithreading (SMT) Feed execution units from multiple instruction streams Good fit for out-of-order machine Switch On Event Multithreading (SOEMT) Switch to another instruction stream while waiting on an outerlevel cache miss. Good fit for in-order machine Copyright 2005 Intel Corporation 29

30 Multicore Chips Multiple CPUs in same socket or die. Multiple cores on same die has both advantages and disadvantages Enables communication at chip speed instead of slower board speed. Lower yield from wafer. Assuming random distribution of failures, if fraction x of CPUs are good, then only x 2 CPU-pairs are good. Copyright 2005 Intel Corporation 30

31 Future Speed Improvements Will Mostly Be From Multithreading Improvements in past have been for single thread Examples Higher clock rate Bigger functional units (e.g. full array multipliers) Pipelining and out-of-order execution. These all consume superlinear transistors or power Approximately quadratic Cooling capacity limited (cannot put toaster in lap) Improvements in future are for multiple threads Multiple cores and more threads per core As long as program scales significantly better than p, we are ahead. Copyright 2005 Intel Corporation 31

32 Memory Memory also has latency and throughput Both are slower than CPU speeds Latency is hundreds of clocks Bandwidth can be 20x slower than CPU See STREAM benchmark at Memory cannot be large, fast, and cheap Hence cache hierarchy Some small fast expensive memory at top Lots of big slow cheap memory at bottom Copyright 2005 Intel Corporation 32

33 Example: One Core of Pentium D Processor register set 16 KB Level 1 Data Cache 12 KB Level 1 Instruction Cache 1 MB Level 2 Cache Main Memory Bandwidth width of arrow (Would be 5x21 feet if 1 GB were drawn to paper scale) Copyright 2005 Intel Corporation 33

34 Off-Chip Memory Is Slow! HT0 HT1 L1 cache L2 cache Latency length of arrow Bandwidth width of arrow Design your parallel programs to minimize off-chip communication. Intrachip locality is good. Interchip locality is bad. memory Copyright 2005 Intel Corporation 34

35 Shared Bus CPU CPU CPU CPU memory memory memory Uniform Memory Access (UMA) Any processor can reach any memory in the same time All devices on bus share the bus bandwidth Uniformly slow for everyone Copyright 2005 Intel Corporation 35

36 Point to Point memory memory CPU CPU CPU CPU All sorts of topologies possible Mesh, toruses, hypercube, fat trees,... memory memory Non-Uniform Memory Access (NUMA) Processors can reach near memory faster than far memory. Bus bandwidth grows as nodes are added Do not have to share bandwidth Copyright 2005 Intel Corporation 36

37 Cluster Each node in cluster is single or multiprocessor box. Nodes interconnected Bus (e.g. antique Ethernet*) Point to point (e.g. Scalable Coherent Interconnect*) Switched fabric (e.g. Infiniband*) *Other names and brands may be claimed as the property of others. Copyright 2005 Intel Corporation 37

38 Some Interconnect Metrics Bisection bandwidth Adversary cuts machine in half. Measure bandwidth between halves. Latency Near vs. far reference For message passing, software adds order of magnitude (or worse) to hardware latency. Copyright 2005 Intel Corporation 38

39 Shared Memory Is Really Message Passing Messages are cache lines/sectors Typically around bytes per line Pentium 4 Processor has sectored lines 64-byte sector and 128-byte line Writes invalidate sectors Reads pull in lines For cache without sectors, sector is same as line When a processor writes to a cache sector, the other processors have to be notified of the change. Copyright 2005 Intel Corporation 39

40 Example: False Sharing Consider two processors without shared cache. If two processors write different locations that are on same cache sector, and at least one access is write, then cache sector must move between processors. Copyright 2005 Intel Corporation 40

42 8 Threads on 4-Socket HT Machine Thread i does "x[i*stride]++" Normalized Time (Log Scale) (Geomean of 3 Runs) Pop quiz: Cache sector is 64 bytes. What is sizeof(x[0])? Stride Copyright 2005 Intel Corporation 42

43 Summary of Episode II Approaches to single-chip parallelism SIMD Super-scalar Multithreaded CPU Multicore Most of future improvement in chip performance will come from parallelism Off-chip memory is slow Current machines depend on caches Uniform vs. Non-Uniform Memory Access Cache line/sector is unit of information interchange Copyright 2005 Intel Corporation 43

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the