Parallel Numerics, WT 2017/ Introduction. page 1 of 127

Size: px

Start display at page:

Download "Parallel Numerics, WT 2017/ Introduction. page 1 of 127"

Lauren Hart
5 years ago
Views:

1 Parallel Numerics, WT 2017/ Introduction page 1 of 127

2 Scope Revise standard numerical methods considering parallel computations! Change method or implementation! page 2 of 127

3 Scope Revise standard numerical methods considering parallel computations! Change method or implementation! Required knowledge Numerics Parallel Programming Graphs page 2 of 127

4 Scope Revise standard numerical methods considering parallel computations! Change method or implementation! Required knowledge Numerics Parallel Programming Graphs Literature Dongarra, Duff, Sorensen, van der Vorst: Numerical Linear Algebra for High-Performance Computers (MAT 657f) Pacheco: A User s Guide to MPI (web) Parallel Programming with MPI Schüle: Paralleles Rechnen Source for some images: comp/ Additionally: Parallel in time/matrix formats/cg on GPU/energy/resilience page 2 of 127

5 Main aspects to keep in mind: Parallel efficiency (single instructions, algorithms) Accuracy Stability - Resilience Energy consumption page 3 of 127

6 Main aspects to keep in mind: Parallel efficiency (single instructions, algorithms) Accuracy Stability - Resilience Energy consumption Be aware of certain effects when writing programs like data access, break down of processors, idle processors... page 3 of 127

7 Computer Science Aspects of Parallel Numerics Computer Science Aspects of Parallel Numerics Numerical Problems Numerical Problems Nume Why parallel computing? Huge problems like weather prediction, quantum simulation, data mining, uncertainty quantification, modelling the universe,... TOP500 SuperMUC page 4 of 127

8 Contents 1 Introduction 1.1 Computer Science Aspects 1.2 Numerical Problems 1.3 Graphs 1.4 Loop Manipulations 2 Elementary Linear Algebra Problems 2.1 BLAS 2.2 Matrix-Vector Operations 2.3 Matrix-Matrix-Product 3 Linear Systems of Equations with Dense Matrices 3.1 Gaussian Elimination 3.2 Parallelization 3.3 QR-Decomposition with Householder matrices 4 Sparse Matrices 4.1 General Properties, Storage 4.2 Sparse Matrices and Graphs 4.3 Reordering 4.4 Gaussian Elimination for Sparse Matrices 5 Iterative Methods for Sparse Matrices 5.1 Stationary Methods 5.2 Nonstationary Methods 5.3 Preconditioning 6 Domain Decomposition 6.1 Overlapping Domain Decomposition 6.2 Non-overlapping Domain Decomposition 6.3 Schur Complements 7 Eigenvalues, Fourier Transform, Quantum Computing page 5 of 127

9 1.1. Computer Science Aspects of Parallel Numerics Pipelining in the CPU Goal: Acceleration of single instructions To notice: All ops are created equal Elementary operations are carried out in pipelines: Divide task into smaller subtasks Each subtask is executed on piece of hardware that operates concurrently with other stages of the pipeline Compare Production line, car manufacturing! page 6 of 127

10 1.2. Computer Science Aspects of Parallel Numerics Pipelining in the CPU Goal: Acceleration of single instructions To notice: All ops are created equal Elementary operations are carried out in pipelines: Divide task into smaller subtasks Each subtask is executed on piece of hardware that operates concurrently with other stages of the pipeline Compare Production line, car manufacturing! Addition Pipeline: page 6 of 127

11 Visualization Pipelining page 7 of 127

12 Visualization Pipelining: Time step 1 page 8 of 127

13 Visualization Pipelining: Time step 2 page 9 of 127

14 Visualization Pipelining: Time step 3 page 10 of 127

15 Visualization Pipelining: Time step 4 page 11 of 127

16 Visualization Pipelining: Time step 5 page 12 of 127

17 Visualization Pipelining Startup time = k(= 4) clock units Lateron: per clock unit one result Total time: k u + n u Total time considering only one pipeline and one vector operation! page 13 of 127

18 Advantages of Pipelines Pipeline filled: per clock unit one result is achieved. All additions should be organized such that pipeline is always filled! Pipeline nearly empty: e.g., in the beginning of the computations, it is not efficient! Major task for CPU: Organize all operations such that operands are just in time at the right position to fill the pipeline and keep it full. page 14 of 127

19 Computer Science Aspects of Parallel Numerics Computer Science Aspects of Parallel Numerics Numerical Problems Numerical Problems Nume CPU Pipelining page 15 of 127

20 Computer Science Aspects of Parallel Numerics Computer Science Aspects of Parallel Numerics Numerical Problems Numerical Problems Nume CPU page 16 of 127

21 Pipelining: General Steps Instruction Fetch: Get the next command Decoding: Analyse instruction, compute addresses of operands Operand Fetch: Get the values of the next operands Execution step: Carry out command on operands Result Write: Write result in memory Pipelining of these steps, and also inside each step. Chaining Pipelines: αx + y combining multiplication αx and addition page 17 of 127

22 Special case: Vector scaling For set of data the same operation has to be executed on all components. y 1 x 1 y 2. = α x 2. y n x n for j = 1, 2,..., n : y j = αx j ; Total costs: Startup time + vector length clock time (pipeline length + vector length ) T If CPU is well organized, the startup time disappears! page 18 of 127

23 Example of a modern CPU architecture Source: A. Heinecke and C. Trinitis, Cache-oblivious Matrix Algorithms in the Age of Multi- and Many-Cores, Concurrency and Computation: Practice and Experience, pp. 1 20, 2012, DOI: /cpe page 19 of 127

24 Problem: Data Dependency Example: Fibonacci numbers x 0 = 0, x 1 = 1, x 2 = x 0 + x 1,..., x i = x i 2 + x i 1 After filling in x 0 and x 1 the next pair needs x 2 which is known only after the first computation is finished! Pipeline contains always only one pair is nearly empty all the time! Similar Problem for recursive subroutine calls. page 20 of 127

25 Memory Organization page 21 of 127

26 Cache Idea Cache as memory buffer between large, slow and small, fast memory. Considering data flow (last used data), try to predict (!) which data will be requested in next step: keep last used data in fast cache because it is likely that the same data will be used again keep the neighborhood of last used data in fast cache. Memory is organized in chunks. Together with current data, put surroundings into cache (cache line). Possible discrepancy between intention and reality! page 22 of 127

27 Cache Idea Cache hit: The data requested from the higher level is found in the cache: use the cached data and transfer it to the higher level. Done. Cache miss: The data requested from the higher level is not found in the cache: Look for data in the lower (slower) memory. Copy the related data to the cache (removing the oldest cache entry) and transfer it to the higher level. page 23 of 127

28 Cache Idea Cache hit: The data requested from the higher level is found in the cache: use the cached data and transfer it to the higher level. Done. Cache miss: The data requested from the higher level is not found in the cache: Look for data in the lower (slower) memory. Copy the related data to the cache (removing the oldest cache entry) and transfer it to the higher level. Temporal locality: Data referenced at some point in time will again be referenced in near future. If data needs to be reused, do it as soon as possible! Spatial locality: Data close to recently referenced data is probable to be referenced soon. Work blockwisely to ensure neighbouring! page 23 of 127

29 Mapping between Memory Direct mapping: Adress in cache, modulo Adress Adress Disadvantage: Immediately replacing of data in cache page 24 of 127

30 Mapping between Memory Direct mapping: Adress in cache, modulo Adress Adress Disadvantage: Immediately replacing of data in cache Associative mapping: Partition cache in blocks. Write data to direct mapped address in one of the blocks. Replace oldest data in block. page 24 of 127

31 Cyclic data distribution Main memory is often organized in banks connected by bus: Per cycle n operands can be fetched out of the n banks. Storing vectors! x 1 in bank1 x 2 in bank2,.... Allows one step access to x. page 25 of 127

1.2.2. Parallel Processors Classical von Neumann

32 Parallel Processors Classical von Neumann model: Code and data in memory! Control unit fetches instructions and data from memory and sequentially coordinates the operations. page 26 of 127

33 Flynn Taxonomie Classification of computer systems by number of instruction streams and data streams: SISD: Single instruction stream, single data stream. Conventional serial computers page 27 of 127

34 Flynn Taxonomie Classification of computer systems by number of instruction streams and data streams: SISD: Single instruction stream, single data stream. Conventional serial computers SIMD: Single instruction stream can handle multiple data streams. All PEs synchronously executing same instruction stream on different sets of data. E.g., SSE (Streaming SIMD Extensions), instruction set, e.g. for graphics processing MISD: Machine is capable to process single data stream using multiple instruction streams. Uncommon architecture. Heterogenous systems operate on same data stream and must agree on result e.g., fault tolerance. page 27 of 127

35 Flynn Taxonomie Classification of computer systems by number of instruction streams and data streams: SISD: Single instruction stream, single data stream. Conventional serial computers SIMD: Single instruction stream can handle multiple data streams. All PEs synchronously executing same instruction stream on different sets of data. E.g., SSE (Streaming SIMD Extensions), instruction set, e.g. for graphics processing MISD: Machine is capable to process single data stream using multiple instruction streams. Uncommon architecture. Heterogenous systems operate on same data stream and must agree on result e.g., fault tolerance. MIMD: Multiple instruction streams on multiple data streams. General purpose parallel computers. page 27 of 127

36 Parallel Computation MIMD architecture: multiple instructions multiple data (compare to SISD = single instructions single data, etc.) page 28 of 127

37 Shared Memory Parallelization (SMP) page 29 of 127

Cache Coherence parallel for(i=proc_number; i<n; i=i+2) x[i]=2.; 2 threads, proc_number 0 and 1: Thread 1 changes x(0,2,4,...) and thread2 changes x(1,3,5,...). Each changing step of thread 1 also changes data that is contained in the cache of thread 2 (and vice versa).

38 Cache Coherence parallel for(i=proc_number; i<n; i=i+2) x[i]=2.; 2 threads, proc_number 0 and 1: Thread 1 changes x(0,2,4,...) and thread2 changes x(1,3,5,...). Each changing step of thread 1 also changes data that is contained in the cache of thread 2 (and vice versa). Otherwise data in the two caches is not consistent anymore! To retain the right values in both caches after each changing step, also the value in the other cache has to be renewed! Leads to dramatic increase of computational time, ev. slower than sequential computation! page 30 of 127

39 Distributed Memory Parallelization (DMP) Virtual shared memory: Distributed data but organized as shared memory. page 31 of 127

40 Nonuniform Memory Access (NUMA) Cluster of multiple CPU processors Different types of communication shared memory and distributed memory! page 32 of 127

41 Topology of the processor/memory interconnection (DMP) Mesh (Array, Grid): p processors, longest path 2( p 1) Historical first approaches: Vectorization and Topology of Processors page 33 of 127

42 Topology of the processor/memory interconnection Time for sending data from one processor to another depends on the connection network topology: Topology worst case time Mesh 2( p 1) Vector p 1 Ring/Torus p 2 Tree 2(ld(p) 1) Hypercube ld(p) Definition: diameter = largest distance page 34 of 127

43 Communication Topology diameter = 2( p 1) Manhattan distance page 35 of 127

44 Hypercube Diameter=Dimension page 36 of 127

45 Hypercube Diameter=Dimension page 36 of 127

46 Hypercube Diameter=Dimension Example: vector x i = x ip...i 1 i 0 as Hypercube page 36 of 127

47 Different Topologies Topology G p diameter(g) degree(g) edges(g) G 1 (n), (vector) n n 1 2 n 1 T 1 (n), (ring) n n 2 2 n G 2 (n, n), (mesh) n 2 2n 2 4 2n 2 2n T 2 (n, n) n 2 2 n 2 4 2n2 BT (h) 2 h+1 1 2h 3 2 h+1 2 HC(k) 2 k k k 2 k 1 k G: Grid, T : Torus, BT : binary tree, HC: hypercube page 37 of 127

48 Network based on Switches - 3-level Omega network Blocking network. Simultaneous connection P0 P6 and P1 P7 is not possible! Turn-over of switches necessary! Always only one possible position for pair of outgoing/ingoing switches! page 38 of 127

49 Network based on Switches - Crossbar Direct, independent connection between all processors. Nonblocking! page 39 of 127

50 page 40 of 127

51 Performance Analysis Definitions: f : parallel fraction of program 1 f : sequential fraction of program t p : wall clock time to execute the job on p parallel processors page 41 of 127

52 Performance Analysis Definitions: f : parallel fraction of program 1 f : sequential fraction of program t p : wall clock time to execute the job on p parallel processors Using p processors in parallel we can achieve speedup. Speedup S p := t 1 tp is the ratio of execution time with 1 vs. p PEs In the ideal case we observe t 1 = p t p. Choice of algorithm for t 1! page 41 of 127

53 Performance Analysis Definitions: f : parallel fraction of program 1 f : sequential fraction of program t p : wall clock time to execute the job on p parallel processors Using p processors in parallel we can achieve speedup. Speedup S p := t 1 tp is the ratio of execution time with 1 vs. p PEs In the ideal case we observe t 1 = p t p. Choice of algorithm for t 1! Efficiency using p processors: E p = S p p, 0 E p 1 E p 1: very good parallelizable, because then S p p or t 1 p t p. (problem scales) E p 0: bad, because E p = Sp p = t 1 p t p and t 1 p t p. page 41 of 127

54 Amdahl s Law Assumption: for fixed problem size program can be separated into parallel fraction f and sequential fraction 1 f. t p = f t 1 f + (1 f )p + (1 f )t 1 = t 1 (1 f )t 1 p }{{} p }{{} strongly sequential ideally parallel page 42 of 127

55 Amdahl s Law Assumption: for fixed problem size program can be separated into parallel fraction f and sequential fraction 1 f. t p = f t 1 f + (1 f )p + (1 f )t 1 = t 1 (1 f )t 1 p }{{} p }{{} strongly sequential ideally parallel Amdahl s law: S p = t 1 t p = 1 f +(1 f )p p = p f + (1 f )p 1 1 f E p = S p p = 1 f + (1 f )p 1 p E p = 0 (1 f )p We always have a small portion of our algorithm that is not parallelizable the efficiency will always be zero in the limit! page 42 of 127

56 Amdahl s Law Speedup depending on p. Saturation 1 1 f. Stay away of saturation! page 43 of 127

57 Gustafson s Law Assume that problem can be solved in 1 unit of time on parallel machine with p processors. Fraction f is good parallelizable with speedup p Define f as execution time of the parallelizable part in t p = 1 Compared with this parallel implementation an uniprocessor would perform t 1 = (1 f ) + f p for the same job. page 44 of 127

58 Gustafson s Law Assume that problem can be solved in 1 unit of time on parallel machine with p processors. Fraction f is good parallelizable with speedup p Define f as execution time of the parallelizable part in t p = 1 Compared with this parallel implementation an uniprocessor would perform t 1 = (1 f ) + f p for the same job. Speedup: S pf = t 1 = 1 f + f p = p + (1 p)(1 f ) t p 1 page 44 of 127

59 Gustafson s Law Assume that problem can be solved in 1 unit of time on parallel machine with p processors. Fraction f is good parallelizable with speedup p Define f as execution time of the parallelizable part in t p = 1 Compared with this parallel implementation an uniprocessor would perform t 1 = (1 f ) + f p for the same job. Speedup: Efficiency: S pf = t 1 = 1 f + f p = p + (1 p)(1 f ) t p 1 E pf = S pf p = 1 f + f p f p page 44 of 127

60 Amdahl: 0 1-f f 1 f/p f p f E p f f S p ft t f t p p p ) (1 1 / 1 1 / ) 1 ( 1 1 Gustafson: 0 1-f f 1 ' ' 1 ' ' 1 ' ') (1 1 f p f E p f f S pt f t f t p p p p page 45 of 127

61 Comparison Amdahl vs. Gustafson law/speedup: f = f = 0 p f = f = 1 Amdahl f Gustafson 1 f O(p) p p law/efficiency: f = f = 0 p f = f = 1 Amdahl 1 p 0 1 Gustafson 1 p f 1 page 46 of 127

62 Who Needs Communication Communications between tasks depends upon problem: No need for communication: Some problems can be decomposed and executed in parallel with virtually no need for tasks to share data. Often called embarrassingly parallel as they are straight-forward. Very little inter-task communication is required. page 47 of 127

63 Who Needs Communication Communications between tasks depends upon problem: No need for communication: Some problems can be decomposed and executed in parallel with virtually no need for tasks to share data. Often called embarrassingly parallel as they are straight-forward. Very little inter-task communication is required. Need for communication: Most parallel applications not so simple require tasks to share data with each other. Changes to neighboring data has a direct effect on that task s data. Usually there always exists a sequential part, e.g. I/O. page 47 of 127

64 Further Keywords Load balancing: Job should be distributed such that all processors are busy all the time! avoid idle processors efficiency page 48 of 127

65 Further Keywords Load balancing: Job should be distributed such that all processors are busy all the time! avoid idle processors efficiency Dead lock: Two or more processors waiting indefinitely for event that can be caused only by one of themselves. page 48 of 127

66 Further Keywords Load balancing: Job should be distributed such that all processors are busy all the time! avoid idle processors efficiency Dead lock: Two or more processors waiting indefinitely for event that can be caused only by one of themselves. Each processor waiting for results of the other processor! page 48 of 127

67 Further Keywords Load balancing: Job should be distributed such that all processors are busy all the time! avoid idle processors efficiency Dead lock: Two or more processors waiting indefinitely for event that can be caused only by one of themselves. Each processor waiting for results of the other processor! Race condition: Non-deterministic outcome/results depending on chronology of parallel tasks (shared memory). page 48 of 127

Parallel Numerics, WT 2013/ Introduction

Parallel Numerics, WT 2013/ Introduction Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature