10th August Part One: Introduction to Parallel Computing

Size: px

Start display at page:

Download "10th August Part One: Introduction to Parallel Computing"

Alvin Ball
5 years ago
Views:

1 Part One: Introduction to Parallel Computing 10th August 2007

2 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer architectures Examples of problems demanding parallel processing Well and hard to parallelize Relations between algorithmic complexity and parallel computing Measures of parallel computing Reachable Speedups (Amdahl s Law) Finding an optimal number of processors Typical parallel applications

3 Reasons for Parallel Computing Parallel Computing often considered as the main direction for high performance computing. Specific goals can be: Solve problems in a shorter, acceptable time Find solutions for big problems (large set of input variables, very high accuracy) in acceptable time Map a problem into the memory of a computer Most of these aspects are getting supported by progress of technology and processor architecture, but there are limitations.

4 Limiting Factors Traditionally, performance growth was driven by: Packing of more and more functions into a processor chip Scaling up processors clock frequency Limiting factors. i.e. aspects against: Area size of processor chip (die area) can not be enlarged without increasing the time for signal propagation. Clock frequency is bounded: signals often must propagate across the chip area within a single clock cycle A further increase of functional density causes structures that measure only a few atoms More functionality and higher clock frequency cause more energy consumption and more heating of the processors.

5 How these limitations materialize Example: 20 cm ns as typical velocity of electrons in copper (electrical wires) A future processor chip could be clocked with 100 GHz. A clock period then is ns. The signal distance in this time is 0.2 cm. Thus, the only ways for a performance increase Parallel utilization of smaller sub-components within a processor chip, not in a common clock domain Using multiple processors Using multiple computers

6 Criteria Better Algorithm (less Operations) Degree of Parallel Proc. Mapping on Processors, Memory Hierarchie Algorithm with minimal number of computation steps: often several algorithms exist, differing in the number of computations steps and memory consumption. Mapping to processor architecture: Use of operations directly corresponding to instructions, memory locality Parallelization: Decomposition into independently executable operation streams / using vector operations.

7 Overview: Parallel Computer Architectures (1) Definition Parallel Computer by T. Ungerer: A parallel computer consists of multiple processing units that work coordinated and (at least partly) simultaneously in order to solve a problem cooperatively.

8 Overview: Parallel Computer Architectures (2) Classification (Flynn, 1966): Coarse classification - based on number of independent instruction steams and number of data pair streams. DataStreams Single (SD) Multiple (MD) Instruction Streams Single (SI) Multiple (MI) SISD SIMD MIMD von Neumann computer: SISD SIMD, MIMD are extensions of the von-neumann architecture, and both parallel computers

9 Overview: Parallel Computer Architectures (3) MIMD: Shared Memory Multiprocessor Systems Server with many processors (as usual today), with different applications components beeing executed onto different processors, e.g. database and web server MultiCore processors Distributed Memory Multiprocessor Systems - Distributed Systems Blade systems (networked computer blades) Cluster computer Networked workstations, as long used with parallel run time environment (e.g. MPI) Parallel computers connected by wide area networks: GRID

10 Overview: Parallel Computer Architectures (4) SIMD: Array Computer: High number of equally strutured arithmetical units working synchronously under control of a single control unit. Vector Computer - one (multiple) specialized arithmetical units. These units work in a pipeline mode for fast floating point calculations. Arithmetical Pipelining

11 Classification Parallel Computer von Neumann non von Neumann MIMD SIMD Dataflow Computer Systolic array Distributed Memory Shared Memory Array Computer Vector Computer NUMA UMA Network topologies, Routing Cache coherency & memory cnsistency architecture class technical treats

12 Example (1/3) Polynomial: y = a 4 x 4 + a 3 x 3 + a 2 x 2 + a 1 x + a 0 Algorithm A1: separate calculation of powers and products 1 x 1, x 2, x 3, x 4 : 3 Multiply 2 Products a i x i : 4 Multiply 3 Summarize: 4 Add 4 Control (Loop over i) requires 3 Add A1 requires 7 multiplications and 7 additions.

13 Example (2/3) Algorithm A2: stepwise calculation (1) i:=1; n:=4; (2) z:=a[n]; (3) z1:= x * z + a[i-1]; (4) i:= i + 1; (5) if (i<=n) { z := z1; goto (2); } (6) result := z1; A2 requires 4 multiplications and 8 additions. A2 is the better algorithm, compared to A1, because it needs less operations.

14 Example (3/3) Question: Can A1 and A2 get parallelized? y = a *x*x*x*x + a *x*x*x + a *x*x + a *x + a y = (((a * x + a ) * x + a ) * x + a ) * x + a * * * * * * * * * + * + * + * A1 Result after 5 time steps + + A2 Result after 8 time steps + * + * + A1 is the better one in terms of parallel execution. A2 can not be parallelized, due to data dependencies

15 Complexity of Algorithms (1) Definition: Time Complexity Number of computation steps related to the problem size n... size of input data T(n) exact number of computation steps O(n): Order of Complexity (without constant factors, contains only major functions of n) Example: T(n) = n + 3 n 2 O(n 2 )

16 Complexity of Algorithms (2) Hierarchy of complexities: Useful (gut brauchbare Algorithmen): O(1), O(log n) Still useful (noch brauchbare Algorithmen): O(n), O(n log n), polynomial Critical, useless algorithms: O(2 n ), O(n!) Parallel execution of algorithms beneficial, if: complexity in the range between logarithmic to polynomial algorithm contains a high degree of independent calculations

17 Complexity and Parallel Computing Scaling problem size (left) single processor vs. (right) linearly growing number of processors

18 Complexity and Parallel Computing Examples: Scalar-Product O(n): number of used processors directly corresponds to the scaled vector size, n new = d n old p new = d p old. Matrix-Multiplication O(n 3 ): Parallel matrix multiplication allows bigger problem sizes in a constant time, n new = d n old p new = d 3 p old. Generate and test binary numbers of length n, O(2 n ): practically not scalable, n new = n old + 1 p new = p old 2. Traveling Salesman O(n!): practically not scalable, n new = n old + 1 p new = p old n new.

19 A Good Example (1/2) Matrix Multiplication C = A B for i:=0 to n-1 for j:=0 to n-1 c[i,j]:=0 for k:=0 to n-1 endfor endfor endfor c[i,j] := c[i,j] + a[k,j] * b[i,k] Complexity Order: O(n 3 ) Parallel algorithm: Input partitioning - The outer two loops (i,j) are split, and different processes/threads cover these different areas.

20 A Good Example (2/2) Matrix Multiplication Table shows number of steps, divided in steps per loop n input size T 1 (n) T 2 (n) T 4 (n) T 8 (n) 10 2 * *10*10 5*10*10 5*5*10 5*5*5 = 200 = 1000 = 500 = 250 = * *20*20 10*20*20 10*10*20 5*10*20 = 800 = 8000 = 4000 = 2000 = * *40*40 20*40*40 20*20*40 10*20*40 = 3200 = = =16000 = * *80*80 40*80*80 40*40*80 20*40*80 = = = = = Problem size can be increased, but doubling problem size requires the processor number to be increased by factor 8.

21 A Bad Example (1/2) Traveling Salesman (TSP) Input: n objects, for each two objects i,j a distance cost d i,j {1, 2,...,n} Required result: permutation p of the objects with p(i) = i-th Element, such that d p(i),p(i+1) ) + d p(n),p(0) is minimal n 1 ( i=1 Time Complexity: T = (n-1)!, T = (n-1)!/2 (symmetric TSP)

22 A Bad Example (2/2) Experiment: Provide (n 1) processors for a problem of size n n T 1 (n) T n 1 (n) 4 3!=6 6/3=2 5 4!=24 24/4 = 6 6 5! = /5 = != /9 = != /10 = By using n processors we are able to process a problem size of n + 1, compared to a single processor machine with a problem of size n.

23 Measures to Evaluate Parallel Computing: Speedup Parameters: p... number of processors used T 1... time steps needed for execution on a single processor T p... time steps for execution on a parallel computer with p processors Speedup - how many times faster does the program run S p = T 1 T p Speedup normally in the range of 1... p. If S p > p, then this is caused by additional effects, e.g. better memory utilization, parallel operating system.

24 Measures: Efficiency Efficiency - utilization of parallelism E p = S p p Normally, E p is in the range of Ideal algorithms exhibit an E p = 1, independently of p. When E p on a realistic machine does not sink with increasing number of processors, we call that scalable. (Scalability)

25 Measures: Scaleup Scaleup - how much more data can we process in a fixed period of time m... size of the small problem n... size of the big problem, computed with p processors SC p = n m whereby T 1 (m) = T p (n) Scaleup depends directly on time complexity of the algorithm.

26 Measures: Reachable Speedup (1) Ideally, with p processors we can gain a speedup of p. Not always, because most algorithms contain (small) sequential parts. a... fraction of that can be parallelized on p processors b... fraction of that remains sequential (e.g. due to data dependencies) a and b express fractions of time consumptions related to the entire execution time on a single processor. Thus, a + b = 1. By using the speedup formula and normalizing T 1 (n) to 1, we obtain: S p = T 1(n) T p (n) = a + b b + a p = 1 (1 a) + a p

27 Measures: Reachable Speedup (2) Amdahl s Law (1967) S p = 1 b + a p Maximal Speedup: Use an infinite number of processors: lim p S p = 1 b Vary b in the range from 0 to 1: x-axis is b A low fraction non-parallelizable operations may significantly limit the reachable speedup.

28 Measures: Reachable Speedup (3) lim p S p = 1 b Example: with b=0.1, the maximum speedup is 10, independently how many processors are used (p>=10).

29 Measures: Reachable Speedup (4) Vary the number of processors used, curves for several b-values x-axis is p The existence b > 0.05 causes that speedup increase can only be reached until a number of processors p x. As bigger b gets, the smaller is p x.

30 Measures: Optimal number of processors (1) We use another measure: F p = S p E p T 1 F p grows with increasing speedups But F p sinks with decreasing efficiency Division by T 1 in order to normalize F p ; not really necessary in our scope F p reaches a maximum, when the optimal number of processors is used.

31 Measures: Optimal number of processors (2) Applying Amdahl s Law to S p and calculate E p, F p. Plot for several b-values - fractions of nonparallelizable operations F p reaches a maximum, when the optimal number of processors is used, thus search for the top points in the curves!

32 Measures: Optimal number of processors (3) Analytical approach: with F p = S p E p = (S p ) 2 1 p F p = ( d df p dp = 0 ( 1 b+ a p 1 b + a p dp ) 2 1 p ) 2 1 p = 0

33 Measures: Optimal number of processors (4) we obtain: ( a 2 p = 1 2a + a 2 ) 1 2 Examples for p using the analytical approach: b a optimal p

34 Typical Parallel Applications (1) All common applications exhibit a very high fraction of parallelizable operations (b very small) Linear Algebra: Operations with vectors and matrices Systems of linear equations: A x = b Solvers may work in a direct way, e.g. Gaussian-Elimination-Algorithm Iterative solvers, e.g. Gauss-Seidel-Iteration, some very efficient solvers for sparse coefficient matrices A

35 Typical Parallel Applications (2) Solution of Differential Equations: Equations that contain x, a function y(x) and deviations y (x). Numerical solution using discrete differences instead of symbolic differentiation Calculate approximated values for different values of x in parallel (Runge-Kutta-Algorithm)

36 Typical Parallel Applications (3) Image processing: Local operators, e.g. spreading of spectrum, smoothing can be executed on different image parts in parallel Object matching, e.g. detection of geometric forms Finding of similar blocks in different images for detection of object movements (soft) real-time multimedia

37 Summary Part 1 High performance computing with parallel computers Goals: solve problem in shorter time (speedup), or bigger problems in a specified/acceptable time (scaleup) Different parallel computer architectures: Multiprocesssors (shared memory), Distributed systems, Vector processors, Array computers Scaleup directly depends on time complexity of the algorithm, parallelization helps if time complexity order is polynomial or less Speedup is limited by sequential fraction of operations Common parallel applications with a very small sequential fraction of operations

PCS - Part 1: Introduction to Parallel Computing

PCS - Part 1: Introduction to Parallel Computing Institute of Computer Engineering University of Lübeck, Germany Baltic Summer School, Tartu 2009 Part 1 - Overview Reasons for parallel computing Goals