Parallel Programming Concepts. Summary. Dr. Peter Tröger M.Sc. Frank Feinbube

Size: px

Start display at page:

Download "Parallel Programming Concepts. Summary. Dr. Peter Tröger M.Sc. Frank Feinbube"

Alexina Fisher
5 years ago
Views:

1 1

2 Parallel Programming Concepts Summary Dr. Peter Tröger M.Sc. Frank Feinbube

3 Course Topics 3 The Parallelization Problem Power wall, memory wall, Moore s law Terminology and metrics Shared Memory Parallelism Theory of concurrency, hardware today and in the past Programming models, optimization, profiling Shared Nothing Parallelism Theory of concurrency, hardware today and in the past Programming models, optimization, profiling Accelerators Patterns Future trends

4 Scaring Students with Word Clouds...

The Free Lunch Is Over 5 Clock speed curve

execution through clock speed improvements

5 The Free Lunch Is Over 5 Clock speed curve flattened in 2003 Heat Power consumption Leakage 2-3 GHz since 2001 (!) Speeding up the serial instruction execution through clock speed improvements no longer works We stumbled into the Many-Core Era [Herb Sutter, 2009]

6 OpenHPI Parallel Programming Concepts Dr. Peter Tröger The Power Wall 6 Air cooling capabilities are limited Maximum temperature of C, hot spot problem Static and dynamic power consumption must be limited Power consumption increases with Moore s law, but grow of hardware performance is expected Further reducing voltage as compensation We can t do that endlessly, lower limit around 0.7V Strange physical effects Next-generation processors need to use even less power Lower the frequencies, scale them dynamically Use only parts of the processor at a time ( dark silicon ) Build energy-efficient special purpose hardware No chance for faster processors through frequency increase

7 Memory Wall 7 Caching: Well established optimization technique for performance Relies on data locality Some instructions are often used (e.g. loops) Some data is often used (e.g. local variables) Hardware keeps a copy of the data in the faster cache On read attempts, data is taken directly from the cache On write, data is cached and eventually written to memory Similar to ILP, the potential is limited Larger caches do not help automatically At some point, all data locality in the code is already exploited Manual vs. compiler-driven optimization [arstechnica.com]

8 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Memory Wall 8 If caching is limited, we simply need faster memory The problem: Shared memory is shared Interconnect contention Memory bandwidth Memory transfer speed is limited by the power wall Memory transfer size is limited by the power wall Transfer technology cannot keep up with GHz processors Memory is too slow, effects cannot be hidden through caching completely à Memory wall [dell.com]

9 The Situation 9 Hardware people Number of transistors N is still increasing Building larger caches no longer helps (memory wall) ILP is out of options (ILP wall) Voltage / power consumption is at the limit (power wall) Some help with dynamic scaling approaches Frequency is stalled (power wall) Only possible offer is to use increasing N for more cores For faster software in the future... Speedup must come from the utilization of an increasing core count, since F is now fixed Software must participate in the power wall handling, to keep F fixed Software must tackle the memory wall

10 Three Ways Of Doing Anything Faster [Pfister] 10 Core Core Core Core Core Problem CPU Work harder (clock speed) Ø Power wall problem Ø Memory wall problem Work smarter (optimization, caching) Ø ILP wall problem Ø Memory wall problem Get help (parallelization) More cores per single CPU Software needs to exploit them in the right way Ø Memory wall problem OpenHPI Parallel Programming Concepts Dr. Peter Tröger

11 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Parallelism on Different Levels 11 A processor chip (socket) Chip multi-processing (CMP) Multiple CPU s per chip, called cores Multi-core / many-core Simultaneous multi-threading (SMT) Interleaved execution of tasks on one core Example: Intel Hyperthreading Chip multi-threading (CMT) = CMP + SMT Instruction-level parallelism (ILP) Parallel processing of single instructions per core Multiple processor chips in one machine (multi-processing) Symmetric multi-processing (SMP) Multiple processor chips in many machines (multi-computer)

12 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Parallelism on Different Levels 12 ILP, SMT ILP, SMT ILP, SMT ILP, SMT CMP Architecture [arstechnica.com] ILP, SMT ILP, SMT ILP, SMT ILP, SMT

13 Parallelism on Different Levels 13 Blue Gene/Q 3. Compute card: One chip module, 16 GB DDR3 Memory, Heat Spreader for H 2 O Cooling 4. Node Card: 32 Compute Cards, Optical Modules, Link Chips; 5D Torus 2. Single Chip Module 1. Chip: 16+2!P cores 5b. IO drawer: 8 IO cards w/16 GB 8 PCIe Gen2 x8 slots 3D I/O torus 7. System: 96 racks, 20PF/s 5a. Midplane: 16 Node Cards Sustained single node perf: 10x P, 20x L MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria) Software and hardware support for programming models for exploitation of node hardware concurrency 6. Rack: 2 Midplanes 2011 IBM Corporation

Small Registers Processor Caches volatile Random Access

14 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Memory on Different Levels 14 Fast Expensive Small Registers Processor Caches volatile Random Access Memory (RAM) non-volatile Flash / SSD Memory Hard Drives Slow Cheap Large Tapes

15 A Wild Mixture 15 Network

16 GF100 GF100 16

17 1 A Wild Mixture GDDR5 Dual Gigabit LAN Dual Gigabit LAN GDDR5 DDR3 Core CPU Core 16x PCIE MIC QPI GDDR5 CPU DDR3 CPU Core Core Core Core 16x PCIE GPU MIC 16x PCIE Core Core DDR3 Dual Gigabit LAN GDDR5 DDR3 CPU Core Core 16x PCIE MIC GDDR5 QPI DDR3 QPI CPU Core Core Core Core 16x PCIE GDDR5 GPU CPU Dual Gigabit LAN GDDR5 GPU 16x PCIE Core Core GPU Core DDR3 DDR3 DDR3 CPU Core Core QPI CPU Core Core Core Core 16x PCIE 16x PCIE MIC GDDR5 GPU

18 The Parallel Programming Problem 18 Configuration Type Flexible Parallel Application Match? Execution Environment

19 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Hardware Abstraction: Flynn s Taxonomy 19 Classify parallel hardware architectures according to their capabilities in the instruction and data processing dimension Data Item Data Items Instruction Processing Step Output Instruction Processing Step Output Single Instruction, Single Data (SISD) Single Instruction, Multiple Data (SIMD) Instructions Data Item Processing Step Output Instructions Data Items Processing Step Output Multiple Instruction, Single Data (MISD) Multiple Instruction, Multiple Data (MIMD)

20 Hardware Abstraction: Tasks + Processing Elements 20 Program Program Program Process Process Process Process Task Process Process Process Process Task Process Process Process Process Task PE PE PE PE PE Memory Node PE PE Memory PE PE PE Memory PE PE PE Memory Network PE PE PE Memory OpenHPI Parallel Programming Concepts Dr. Peter Tröger

Hardware Abstraction: PRAM 21 RAM assumptions: Constant memory access time, unlimited memory PRAM assumptions: Non-conflicting shared bus, no assumption

21 Hardware Abstraction: PRAM 21 RAM assumptions: Constant memory access time, unlimited memory PRAM assumptions: Non-conflicting shared bus, no assumption on synchronization support, unlimited number of processors Alternative models: BSP, LogP CPU CPU CPU CPU Shared Bus Input Memory Output Input Memory Output

22 Hardware Abstraction: BSP 22 Leslie G. Valiant. A Bridging Model for Parallel Computation, 1990 Success of von Neumann model Bridge between hardware and software High-level languages can be efficiently compiled on this model Hardware designers can optimize the realization of this model Similar model for parallel machines Should be neutral about the number of processors Program should be written for v virtual processors that are mapped to p physical ones When v >> p, the compiler has options BSP computation consists of a series of supersteps: 1.) Concurrent computation on all processors 2.) Exchange of data between all processes 3.) Barrier synchronization

23 Hardware Abstraction: CSP 23 Behavior of real-world objects can be described through their interaction with other objects Leave out internal implementation details Interface of a process is described as set of atomic events Event examples for an ATM: card insertion of a credit card in an ATM card slot money extraction of money from the ATM dispenser Events for a printer: {accept, print} Alphabet - set of relevant (!) events for an object description Event may never happen in the interaction Interaction is restricted to this set of events αatm = {card, money} A CSP process is the behavior of an object, described with its alphabet

24 Hardware Abstraction: LogP 24 Criticism on overly simplification in PRAM-based approaches, encourage exploitation of,formal loopholes (e.g. communication) Trend towards multicomputer systems with large local memories Characterization of a parallel machine by: P: Number of processors g (gap): Minimum time between two consecutive transmissions Reciprocal corresponds to per-processor communication bandwidth L (latency): Upper bound on messaging time o (overhead): Exclusive processor time needed for send / receive operation L, o, G in multiples of processor cycles

25 Hardware Abstraction: OpenCL 25 Private Per work-item Local Shared within a workgroup Global/ Constant Visible to all workgroups [4] ParProg GPU Computing FF2013 Host Memory On the CPU

26 The Parallel Programming Problem 26 Configuration Type Flexible Parallel Application Match? Execution Environment

27 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Software View: Concurrency vs. Parallelism 27 Concurrency means dealing with several things at once Programming concept for the developer In shared-memory systems, implemented by time sharing Parallelism means doing several things at once Demands parallel hardware Parallel programming is a misnomer Concurrent programming aiming at parallel execution Any parallel software is concurrent software Note: Some researchers disagree, most practitioners agree Concurrent software is not always parallel software Many server applications achieve scalability by optimizing concurrency only (web server) Concurrency Parallelism

28 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Server Example: No Concurrency, No Parallelism 28

29 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Server Example: Concurrency for Throughput 29

30 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Server Example: Parallelism for Throughput 30

31 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Server Example: Parallelism for Speedup 31

32 Concurrent Execution 32 Program as sequence of atomic statements Atomic : Executed without interruption Concurrent execution is the interleaving of atomic statements from multiple tasks Tasks may share resources (variables, operating system handles, ) Operating system timing is not predictable, so interleaving is not predictable May impact the result of the application Since parallel programs are concurrent programs, we need to deal with that! y=x y=y-1 x=y y=x y=y-1 x=y z=x z=z+1 x=z x=1 y=x y=y-1 y=x y=y+1 x=y x=y x=0 OpenHPI Parallel Programming Concepts Dr. Peter Tröger x=1 z=x z=z+1 x=z Case 1 Case 2 y=x z=x y=y-1 z=z+1 x=y x=z x=2 Case 3 Case 4 y=x z=x y=y-1 z=z+1 x=z x=y x=0

33 Critical Section 33 N threads has some code - critical section - with shared data access Mutual Exclusion demand Only one thread at a time is allowed into its critical section, among all threads that have critical sections for the same resource. Progress demand If no other thread is in the critical section, the decision for entering should not be postponed indefinitely. Only threads that wait for entering the critical section are allowed to participate in decisions. Bounded Waiting demand It must not be possible for a thread requiring access to a critical section to be delayed indefinitely by other threads entering the section (starvation problem)

34 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Critical Sections with Mutexes 34 Waiting Queue T1 T2 T3 m.lock() T3 T2 T3 T2 m.unlock() Critical Section m.lock() m.lock() Critical Section m.unlock() Critical Section m.unlock()

35 Critical Sections with High-Level Primitives 35 Today: Multitude of high-level synchronization primitives Spinlock Perform busy waiting, lowest overhead for short locks Reader / Writer Lock Special case of mutual exclusion through semaphores Multiple Reader processes can enter the critical section at the same time, but Writer process should gain exclusive access Different optimizations possible: minimum reader delay, minimum writer delay, throughput, Mutex Semaphore that works amongst operating system processes Concurrent Collections Blocking queues and key-value maps with concurrency support

36 Critical Sections with High-Level Primitives 36 Reentrant Lock Lock can be obtained several times without locking on itself Useful for cyclic algorithms (e.g. graph traversal) and problems were lock bookkeeping is very expensive Reentrant mutex needs to remember the locking thread(s), which increases the overhead Barriers All concurrent activities stop there and continue together Participants statically defined at compile- or start-time Newer dynamic barrier concept allows late binding of participants (e.g. X10 clocks, Java phasers) Memory barrier or memory fence enforce separation of memory operations before and after the barrier Needed for low-level synchronization implementation

37 Nasty Stuff Deadlock Two or more processes / threads are unable to proceed Each is waiting for one of the others to do something Livelock Two or more processes / threads continuously change their states in response to changes in the other processes / threads No global progress for the application Race condition Two or more processes / threads are executed concurrently Final result of the application depends on the relative timing of their execution 37

38 Coffman Conditions E.G. Coffman and A. Shoshani. Sequencing tasks in multiprocess systems to avoid deadlocks. All conditions must be fulfilled to allow a deadlock to happen Mutual exclusion condition - Individual resources are available or held by no more than one thread at a time Hold and wait condition Threads already holding resources may attempt to hold new resources No preemption condition Once a thread holds a resource, it must voluntarily release it on its own Circular wait condition Possible for a thread to wait for a resource held by the next thread in the chain Avoiding circular wait turned out to be the easiest solution for deadlock avoidance Avoiding mutual exclusion leads to non-blocking synchronization These algorithms no longer have a critical section

39 Terminology Starvation A runnable process / thread is overlooked indefinitely Although it is able to proceed, it is never chosen to run (dispatching / scheduling) Atomic Operation Function or action implemented as a sequence of one or more instructions Appears to be indivisible - no other process / thread can see an intermediate state or interrupt the operation Executed as a group, or not executed at all Mutual Exclusion The requirement that when one process / thread is using a resource, no other shall be allowed to do that 39

40 Is it worth the pain? 40 Parallelization metrics are application-dependent, but follow a common set of concepts Speedup: More resources lead less time for solving the same task Linear speedup: n times more resources à n times speedup Scaleup: More resources solve a larger version of the same task in the same time Linear scaleup: n times more resources à n times larger problem solvable The most important goal depends on the application Transaction processing usually heads for throughput (scalability) Decision support usually heads for response time (speedup)

41 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Speedup 41 Application Idealized assumptions All tasks are equal sized All code parts can run in parallel Tasks: v=12 Processing elements: N=1 Time needed: T 1 =12 t Tasks: v=12 Processing elements: N= 3 Time needed: T 3 = 4 (Linear) Speedup: T 1 /T 3 =12/4=3 t

42 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Speedup with Load Imbalance 42 Assumptions Application Tasks have different size, best-possible speedup depends on optimized resource usage All code parts can run in parallel Tasks: v=12 Processing elements: N=1 Time needed: T 1 =16 t t Tasks: v=12 Processing elements: N= 3 Time needed: T 3 = 6 Speedup: T 1 /T 3 =16/6=2.67

43 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Speedup with Serial Parts 43 Each application has inherently non-parallelizable serial parts Algorithmic limitations Shared resources acting as bottleneck Overhead for program start Communication overhead in shared-nothing systems t SER1 t PAR1 t SER2 t PAR2 t SER3

44 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Amdahl s Law 44 Gene Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. AFIPS 1967 Serial parts T SER = t SER1 + t SER2 + t SER3 + Parallelizable parts T PAR = t PAR1 + t PAR2 + t PAR3 + Execution time with one processing element: T 1 = T SER +T PAR Execution time with N parallel processing elements: T N >= T SER + T PAR / N Equal only on perfect parallelization, e.g. no load imbalance Amdahl s Law for maximum speedup with N processing elements S = T 1 T N S = T SER + T PAR T SER + T PAR /N

45 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Amdahl s Law 45

46 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Amdahl s Law 46 Speedup through parallelism is hard to achieve For unlimited resources, speedup is bound by the serial parts: Assume T 1 =1 S N!1 = T 1 T N!1 S N!1 = 1 T SER Parallelization problem relates to all system layers Hardware offers some degree of parallel execution Speedup gained is bound by serial parts: Limitations of hardware components Necessary serial activities in the operating system, virtual runtime system, middleware and the application Overhead for the parallelization itself

47 Gustafson-Barsis Law (1988) 47 Gustafson and Barsis pointed out that people are typically not interested in the shortest execution time Rather solve the biggest problem in reasonable time Problem size could then scale with the number of processors Leads to larger parallelizable part with increasing N Typical goal in simulation problems Time spend in the sequential part is usually fixed or grows slower than the problem size à linear speedup possible Formally: P N : Portion of the program that benefits from parallelization, depending on N (and implicitly the problem size) Maximum scaled speedup by N processors:

48 The Parallel Programming Problem 48 Configuration Type Flexible Parallel Application Match? Execution Environment

49 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Programming Model for Shared Memory 49 Concurrent Processes Concurrent Threads Memory Main Thread Process Memory Main Thread Process Memory Main Thread Thread Thread Process Explicitly Shared Memory Concurrent Tasks Task Task Memory Task Task Thread Thread Process Different programming models for concurrency in shared memory Processes and threads mapped to processing elements (cores) Process- und thread-based programming typically part of operating system lectures

forks into declared tasks Runtime environment may run them in parallel,

50 OpenHPI Parallel Programming Concepts Dr. Peter Tröger OpenMP 50 Programming with the fork-join model Master thread forks into declared tasks Runtime environment may run them in parallel, based on dynamic mapping to threads from a pool Worker task barrier before finalization (join) [Wikipedia]

51 Task Scheduling 51 Classical task scheduling with central queue All worker threads fetch tasks from a central queue Scalability issue with increasing thread (resp. core) count Work stealing in OpenMP (and other libraries) Task queue per thread Idling thread steals tasks from another thread Independent from thread scheduling Only mutual synchronization No central queue Thread Next Task Task Queue Thread Next Task Work Stealing Task Queue New Task New Task

52 OpenHPI Parallel Programming Concepts Dr. Peter Tröger PGAS Languages 52 Non-uniform memory architectures (NUMA) became default But: Understanding of memory in programming is flat All variables are equal in access time Considering the memory hierarchy is low-level coding (e.g. cache-aware programming) Partitioned global address space (PGAS) approach Driven by high-performance computing community Modern approach for large-scale NUMA Explicit notion of memory partition per processor Data is designated as local (near) or global (possibly far) Programmer is aware of NUMA nodes Performance optimization for deep memory hierarchies

53 Parallel Programming for Accelerators 53 [4] OpenCL exposes CPUs, GPUs, and other Accelerators as devices Each device contains one or more compute units, i.e. cores, SMs,... Each compute unit contains one or more SIMD processing elements ParProg GPU Computing FF2013

54 The BIG idea behind OpenCL 54 OpenCL execution model execute a kernel at each point in a problem domain. E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions

55 Message Passing Programming paradigm targeting shared-nothing infrastructures Implementations for shared memory available, but typically not the best-possible approach Multiple instances of the same application on a set of nodes (SPMD) Instance 0 Instance 1 Submission Host Instance 2 Instance 3 Execution Hosts

56 Single Program Multiple Data (SPMD) 56 seq. program and data distribution seq. node program with message passing P0 P1 P2 P3 identical copies with different process identifications

57 Actor Model 57 Carl Hewitt, Peter Bishop and Richard Steiger. A Universal Modular Actor Formalism for Artificial Intelligence IJCAI Another mathematical model for concurrent computation No global system state concept (relationship to physics) Actor as computation primitive Makes local decisions Concurrently creates more actors Concurrently sends / receives messages Asynchronous one-way messaging with changing topology (CSP communication graph is fixed), no order guarantees Recipient is identified by mailing address Everything is an actor

58 Actor Model 58 Interaction with asynchronous, unordered, distributed messaging Fundamental aspects Emphasis on local state, time and name space No central entity Actor A gets to know actor B only by direct creation, or by name transmission from another actor C Computation Not global state sequence, but partially ordered set of events Event: Receipt of a message by a target actor Each event is a transition from one local state to another Events may happen in parallel Messaging reliability declared as orthogonal aspect

59 Message Passing Interface (MPI) 59 MPI_GATHER ( IN sendbuf, IN sendcount, IN sendtype, OUT recvbuf, IN recvcount, IN recvtype, IN root, IN comm ) Each process sends its buffer to the root process, including root Incoming messages are stored in rank order Receive buffer is ignored for all non-root processes MPI_GATHERV allows varying count of data to be received Returns if the buffer is re-usable (no finishing promised)

60 The Parallel Programming Problem 60 Configuration Type Flexible Parallel Application Match? Execution Environment

61 Execution Environment Mapping 61 Multiple Instruction, Multiple Data (MIMD) Single Instruction, Multiple Data (SIMD)

Patterns for Parallel Programming [Mattson] 62 Finding Concurrency Design Space task / data decomposition, task grouping and ordering due to data flow dependencies, design evaluation Algorithm

62 Patterns for Parallel Programming [Mattson] 62 Finding Concurrency Design Space task / data decomposition, task grouping and ordering due to data flow dependencies, design evaluation Algorithm Structure Design Space Task parallelism, divide and conquer, geometric decomposition, recursive data, pipeline, event-based coordination Mapping of concurrent design elements to execution units Supporting Structures Design Space SPMD, master / worker, loop parallelism, fork / join, shared data, shared queue, distributed array Program structures and data structures used for code creation Implementation Mechanisms Design Space

Designing Parallel Algorithms [Foster] 63 Map workload problem on an execution environment Concurrency for speedup Data locality for speedup Scalability Best parallel

63 Designing Parallel Algorithms [Foster] 63 Map workload problem on an execution environment Concurrency for speedup Data locality for speedup Scalability Best parallel solution typically differs massively from the sequential version of an algorithm Foster defines four distinct stages of a methodological approach Example: Parallel Sum

64 Example: Parallel Reduction 64 Reduce a set of elements into one, given an operation Example: Sum

65 Designing Parallel Algorithms [Foster] 65 A) Search for concurrency and scalability Partitioning Decompose computation and data into small tasks Communication Define necessary coordination of task execution B) Search for locality and other performance-related issues Agglomeration Consider performance and implementation costs Mapping Maximize processor utilization, minimize communication Might require backtracking or parallel investigation of steps

66 Partitioning 66 Expose opportunities for parallel execution fine-grained decomposition Good partition keeps computation and data together Data partitioning leads to data parallelism Computation partitioning leads task parallelism Complementary approaches, can lead to different algorithms Reveal hidden structures of the algorithm that have potential Investigate complementary views on the problem Avoid replication of either computation or data, can be revised later to reduce communication overhead Step results in multiple candidate solutions

data are handled separately Rule of thumb:

data structures Functional Decomposition

67 Partitioning - Decomposition Types 67 Domain Decomposition Define small data fragments Specify computation for them Different phases of computation on the same data are handled separately Rule of thumb: First focus on large or frequently used data structures Functional Decomposition Split up computation into disjoint tasks, ignore the data accessed for the moment With significant data overlap, domain decomposition is more appropriate

68 Partitioning Strategies [Breshears] 68 Produce at least as many tasks as there will be threads / cores But: Might be more effective to use only fraction of the cores (granularity) Computation must pay-off with respect to overhead Avoid synchronization, since it adds up as overhead to serial execution time Patterns for data decomposition By element (one-dimensional) By row, by column group, by block (multi-dimensional) Influenced by ratio of computation and synchronization

69 Partitioning - Checklist 69 Checklist for resulting partitioning scheme Order of magnitude more tasks than processors? -> Keeps flexibility for next steps Avoidance of redundant computation and storage requirements? -> Scalability for large problem sizes Tasks of comparable size? -> Goal to allocate equal work to processors Does number of tasks scale with the problem size? -> Algorithm should be able to solve larger tasks with more processors Resolve bad partitioning by estimating performance behavior, and eventually reformulating the problem

70 Communication Step 70 Specify links between data consumers and data producers Specify kind and number of messages on these links Domain decomposition problems might have tricky communication infrastructures, due to data dependencies Communication in functional decomposition problems can easily be modeled from the data flow between the tasks Categorization of communication patterns Local communication (few neighbors) vs. global communication Structured communication (e.g. tree) vs. unstructured communication Static vs. dynamic communication structure Synchronous vs. asynchronous communication

71 Communication - Hints 71 Distribute computation and communication, don t centralize algorithm Bad example: Central manager for parallel summation Divide-and-conquer helps as mental model to identify concurrency Unstructured communication is hard to agglomerate, better avoid it Checklist for communication design Do all tasks perform the same amount of communication? -> Distribute or replicate communication hot spots Does each task performs only local communication? Can communication happen concurrently? Can computation happen concurrently?

72 Ghost Cells 72 Domain decomposition might lead to chunks that demand data from each other for their computation Solution 1: Copy necessary portion of data (,ghost cells ) If no synchronization is needed after update Data amount and frequency of update influences resulting overhead and efficiency Additional memory consumption Solution 2: Access relevant data,remotely Delays thread coordination until the data is really needed Correctness ( old data vs. new data) must be considered on parallel progress

73 Agglomeration Step 73 Algorithm so far is correct, but not specialized for some execution environment Check again partitioning and communication decisions Agglomerate tasks for efficient execution on some machine Replicate data and / or computation for efficiency reasons Resulting number of tasks can still be greater than the number of processors Three conflicting guiding decisions Reduce communication costs by coarser granularity of computation and communication Preserve flexibility with respect to later mapping decisions Reduce software engineering costs (serial -> parallel version)

74 74 Agglomeration [Foster]

75 Agglomeration Granularity vs. Flexibility 75 Reduce communication costs by coarser granularity Sending less data Sending fewer messages (per-message initialization costs) Agglomerate, especially if tasks cannot run concurrently Reduces also task creation costs Replicate computation to avoid communication (helps also with reliability) Preserve flexibility Flexible large number of tasks still prerequisite for scalability Define granularity as compile-time or run-time parameter

76 Agglomeration - Checklist 76 Communication costs reduced by increasing locality? Does replicated computation outweighs its costs in all cases? Does data replication restrict the range of problem sizes / processor counts? Does the larger tasks still have similar computation / communication costs? Does the larger tasks still act with sufficient concurrency? Does the number of tasks still scale with the problem size? How much can the task count decrease, without disturbing load balancing, scalability, or engineering costs? Is the transition to parallel code worth the engineering costs?

77 Mapping Step 77 Only relevant for shared-nothing systems, since shared memory systems typically perform automatic task scheduling Minimize execution time by Place concurrent tasks on different nodes Place tasks with heavy communication on the same node Conflicting strategies, additionally restricted by resource limits In general, NP-complete bin packing problem Set of sophisticated (dynamic) heuristics for load balancing Preference for local algorithms that do not need global scheduling state

78 Surface-To-Volume Effect [Foster, Breshears] 78 Visualize the data to be processed (in parallel) as sliced 3D cube Synchronization requirements of a task Proportional to the surface of the data slice it operates upon Visualized by the amount of,borders of the slice Computation work of a task Proportional to the volume of the data slice it operates upon Represents the granularity of decomposition Ratio of synchronization and computation High synchronization, low computation, high ratio à bad Low synchronization, high computation, low ratio à good Ratio decreases for increasing data size per task Coarse granularity by agglomerating tasks in all dimensions For given volume, the surface then goes down à good

79 Surface-To-Volume Effect [Foster, Breshears] 79 (C) nicerweb.com

80 Surface-to-Volume Effect [Foster] 80 Computation on 8x8 grid (a): 64 tasks, one point each 64x4=256 synchronizations 256 data values are transferred (b): 4 tasks, 16 points each 4x4=16 synchronizations 16x4=64 data values are transferred

81 Designing Parallel Algorithms [Breshears] 81 Parallel solution must keep sequential consistency property Mentally simulate the execution of parallel streams Check critical parts of the parallelized sequential application Amount of computation per parallel task Always introduced by moving from serial to parallel code Speedup must offset the parallelization overhead (Amdahl) Granularity: Amount of parallel computation done before synchronization is needed Fine-grained granularity overhead vs. coarse-grained granularity concurrency Iterative approach of finding the right granularity Decision might be only correct only for a chosen execution environment

82 OK?!?

83 83 Certificate for free

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment