PCS - Part 1: Introduction to Parallel Computing

Size: px

Start display at page:

Download "PCS - Part 1: Introduction to Parallel Computing"

Georgia Holland
6 years ago
Views:

1 PCS - Part 1: Introduction to Parallel Computing Institute of Computer Engineering University of Lübeck, Germany Baltic Summer School, Tartu 2009

2 Part 1 - Overview Reasons for parallel computing Goals and limitations Criteria for high performance computing Overview of parallel computer architectures Examples of problems demanding parallel processing Well and hard to parallelize Influence of algorithmic complexity Measures and Limitations Measures: Speedup, Efficiency, Scaleup Reachable Speedup (Amdahl s Law) Optimal number of processors Typical parallel applications

3 Reasons for Parallel Computing Parallel computing often considered as the main direction for high performance computing. Specific goals can be: Solve problems in a shorter, acceptable time Find solutions for big problems (i.e. large set of input variables, very high accuracy) in acceptable time Map a problem into the memory of a computer Most of these aspects are addressed by technological progress and processor architecture improvements,... but there are limitations.

4 Limiting Factors Traditionally, performance growth was driven by: Packing of more and more functions into a processor chip Scaling up processors clock frequency Limiting factors. i.e. aspects against: Area size of processor chip (die area) can not be enlarged without increasing the time for signal propagation. Clock frequency is bounded: signals often must propagate across the chip area within a single clock cycle A further increase of functional density would cause structures measuring only a few atoms More functionality and higher clock frequency cause more energy consumption and more heating of the processors.

5 How these limitations materialize Example: 20 cm ns as typical velocity of electrons in copper (electrical wires) A future processor chip could be clocked with 100 GHz. A clock period then is ns. The signal distance in this time is 0.2 cm. A processor chip may not contain wires that are longer than 0.2 cm, which limits the size and number of functional units on a processor. Different techniques for further performance growth: Parallel utilization of smaller sub-components within a processor chip that are not in a common clock domain Using multiple processors, e.g. multicore, multiprocessor machines Using multiple computers, e.g. computer clusters, GRID

6 Criteria for high computation performance better algorithm (less operations) degree of parallel processing mapping of the code on processors and into memory hierarchy Algorithm with minimal number of computation steps: often several algorithms exist, differing in the number of computations steps and memory consumption. Mapping to processor architecture: Use of operations that directly correspond to CPU instructions, exploiting memory locality with proper data structures Parallel execution: Program decomposition into independently executable operation streams / using vector operations.

7 Example of a parallel algorithm (1/3) Calculate value of polynomial for a given x: y = a 4 x 4 + a 3 x 3 + a 2 x 2 + a 1 x + a 0 Algorithm A1: separate calculation of powers and products 1 x 1, x 2, x 3, x 4 : 3 Multiply 2 Products a i x i : 4 Multiply 3 Summarize: 4 Add 4 Control (Loop over i) requires 3 Add A1 requires 7 multiplications and 7 additions.

8 Example of a parallel algorithm (2/3) Algorithm A2: stepwise calculation a[4] = a 4, a[3] = a 3, a[2] = a 2, a[1] = a 1, a[0] = a 0 (1) i:=4; n:=4; (2) z:=a[n]; (3) z1:= x * z + a[i-1]; (4) i:= i - 1; (5) if (i>0) { z := z1; goto (3); } (6) result := z1; A2 requires 4 multiplications, 4 additions, 4 subtractions. A2 is the better algorithm, compared to A1, because it needs less operations.

9 Example of a parallel algorithm (3/3) Question: Are A1 and A2 suited for parallel execution? y = a *x*x*x*x + a *x*x*x + a *x*x + a *x + a y = (((a * x + a ) * x + a ) * x + a ) * x + a * * * * * * * * * + * + * + * A1 Result after 5 time steps + + A2 Result after 8 time steps + * + * + A1 is the better choice for parallel execution. A2 can not exploit parallelism, due to data dependencies

least partly) simultaneously in order to solve a problem cooperatively.

10 Overview: Parallel Computer Architectures (1) Definition Parallel Computer by T. Ungerer: A parallel computer consists of multiple processing units that work coordinated and (at least partly) simultaneously in order to solve a problem cooperatively. high-end computers massively use parallel computation units even low-end computers make use of parallel computation units (e.g. shader units of graphic cards)

11 Overview: Parallel Computer Architectures (2) Classification (Flynn, 1966): A coarse classification - based on the number of independent instruction steams and the number of data streams. Single (SD) data streams Multiple (MD) instruction streams Single (SI) Multiple (MI) SISD SIMD MIMD von Neumann computer: SISD SIMD, MIMD are extensions of the von-neumann architecture, and both parallel computers

12 Overview: Parallel Computer Architectures (3) MIMD - Multiple Instruction/ Multiple Data - computers: Shared Memory Multiprocessor Systems Server with many processors (as usual today), with different applications components being executed onto different processors, e.g. database and web server Multicore processors Distributed Memory Multiprocessor Systems - Distributed Systems Blade systems (networked computer blades) Cluster computer Networked workstations, as long used with parallel run time environment (e.g. MPI) Parallel computers connected by wide area networks: GRID

13 Overview: Parallel Computer Architectures (4) SIMD - Single Instruction / Multiple Data - computers: Array Computer: A high number of equally structured arithmetical units, which work synchronously under control of a single control unit. Vector Computer - one (multiple) specialized arithmetical units. These units work in a pipeline mode for fast floating point calculations. Arithmetical Pipelining

14 Classification Parallel Computer von Neumann Non von Neumann MIMD SIMD Dataflow Computer Systolic array Distributed Memory Shared Memory Array Computer Vector Computer NUMA UMA Network topologies, Routing Cache coherency & memory consistency architecture class technical treats

15 The worlds biggest: TOP 500 List A ranking of supercomputers, released each 6 months, now with the 33st TOP500 list (June 2009)... 1 IBM - Roadrunner - BladeCenter QS22/LS21 Cluster, Processors: PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz, Network: Voltaire Infiniband, installed at DOE/NNSA/LANL, cores, R max = TFlops, R Peak = TFlops 2 Cray XT5 - Jaguar - Cray XT5 QC 2.3 GHz, Oak Ridge National Laboratory, U.S., cores, R max = TFlops, R Peak = TFlops 3 IBM JUGENE - Blue Gene/P, installed at Forschungszentrum Juelich, Germany, cores, R max = TFlops, R Peak = TFlops 9 Sun Microsystems - Ranger - SunBlade x6420, Opteron QC 2.3 Ghz, Infiniband, Texas Advanced Computing Center/Univ. of Texas, cores, R max = TFlops, R Peak = TFlops...

Top500: 1st rank: Roadrunner Data from 2008: in total 6,562 dual-core AMD Opteron, and 12,240 Cell chips 98 terabytes of memory, and is housed in 278 refrigerator-sized racks its 10,000 connections

16 Top500: 1st rank: Roadrunner Data from 2008: in total 6,562 dual-core AMD Opteron, and 12,240 Cell chips 98 terabytes of memory, and is housed in 278 refrigerator-sized racks its 10,000 connections are both Infiniband and Gigabit Ethernet Hybrid computing system: Standard processing (e.g. file system I/O) is handled by the Opteron processors. Mathematically and CPU-intensive elements are directed to the Cell processors.

17 Influence of algorithmic complexity Definition: Time Complexity Number of computation steps related to the problem size n... size of input data T(n) exact number of computation steps O(n): Order of Complexity (without constant factors, contains only major functions of n) Example: T(n) = n + 3 n 2 O(n 2 )

18 Hierarchy of complexities Useful algorithms: O(1), O(log n) Still useful: O(n), O(n log n), polynomial Critical, useless algorithms: O(2 n ), O(n!) Parallel execution of algorithms beneficial, if: complexity between logarithmic to polynomial (O(log n) to O(n x )) algorithm contains a high degree of independent calculations

19 Complexity and Parallel Computing Scaling problem size linear n log n polynomial grade 2 polynomial grade 3 n! factorial computation time 1 linear n log n polynomial grade 2 polynomial grade 3 n! factorial problem size n computation time problem size n (left) single processor vs. (right) linearly growing number of processors

20 Complexity and Parallel Computing Examples: Scalar-Product O(n): number of needed processors directly corresponds to the scaled vector size, n new = d n old p new = d p old. Matrix-Multiplication O(n 3 ): Parallel matrix multiplication allows bigger problem sizes in a constant time, n new = d n old p new = d 3 p old. Generate and test binary numbers of length n, O(2 n ): practically not scalable, n new = n old + 1 p new = p old 2. Traveling Salesman Problem O(n!): practically not scalable, n new = n old + 1 p new = p old n new.

21 A Good Example (1/3) Matrix Multiplication C = A B for i:=0 to n-1 for j:=0 to n-1 c[i,j]:=0 for k:=0 to n-1 endfor endfor endfor c[i,j] := c[i,j] + a[k,j] * b[i,k] Complexity Order: O(n 3 ) Parallel algorithm: Input partitioning - The outer two loops (i,j) are split, and different processes/threads cover these different areas.

22 A Good Example (2/3) Matrix multiplication principle i k k B b j a A C c i,j Scalar product of vectors With scalar vector product: s = n 1 k=0 a[k] b[k]

23 A Good Example (3/3) Matrix Multiplication Table shows number of steps, divided in steps per loop (i,j,k) n input size T 1 (n) T 2 (n) T 4 (n) T 8 (n) 10 2 * *10*10 5*10*10 5*5*10 5*5*5 = 200 = 1000 = 500 = 250 = * *20*20 10*20*20 10*10*20 5*10*20 = 800 = 8000 = 4000 = 2000 = * *40*40 20*40*40 20*20*40 10*20*40 = 3200 = = =16000 = * *80*80 40*80*80 40*40*80 20*40*80 = = = = = Problem size can be increased, but doubling problem size requires the processor number to be increased by factor 8.

24 A Bad Example (1/2) Traveling Salesman Problem (TSP) Input: n objects, for each two objects i,j a distance cost d i,j {1, 2,...,n} Required result: Permutation p of the objects with p(i) = i-th Element, such that d p(i),p(i+1) ) + d p(n),p(0) is minimal n 1 ( i=0 p(0) p(0) d p(0),p(1) p(1) START / STOP p(1) Time Complexity: T = (n-1)!, T = (n-1)!/2 (symmetric TSP)

25 A Bad Example (2/2) Experiment: Provide (n 1) processors for a traveling salesman problem of size n n T 1 (n) T n 1 (n) 4 3!=6 6/3=2 5 4!=24 24/4 = 6 6 5! = /5 = != /9 = != /10 = By using n processors we are able to process a problem size of n + 1, compared to a single processor machine with a problem of size n.

26 Measures and Limitations: Speedup Parameters: p... number of processors used T 1... time steps needed for execution on a single processor T p... time steps for execution on a parallel computer with p processors Speedup - how many times faster does the program run S p = T 1 T p Speedup normally in the range of 1... p. If S p > p, then this is caused by additional effects, e.g. better memory utilization, parallel operating system.

27 Measures: Efficiency Efficiency - utilization of parallelism E p = S p p Normally, E p is in the range of Ideal algorithms exhibit an E p = 1, independently of p - the number of processors. When E p on a realistic machine does not sink with increasing number of processors, we call that scalable. (Scalability)

28 Measures: Scaleup Scaleup - how much more data can we process in a fixed period of time m... size of the small problem n... size of the big problem, computed with p processors SC p = n m whereby T 1 (m) = T p (n) Scaleup depends directly on time complexity of the algorithm.

29 Limitations: Reachable Speedup (1) Ideally, with p processors we can gain a speedup of p. But this is not always true, because most algorithms contain (small) sequential parts. Known parameters: a... fraction of instructions that can be parallelized on p processors b... fraction of instructions that remain sequential, e.g. due to data dependencies a and b express fractions of time consumptions related to the entire execution time on a single processor. Thus, a + b = 1. By using the speedup formula and normalizing T 1 (n) to 1, we obtain: S p = T 1(n) T p (n) = a + b b + a p = 1 (1 a) + a p

30 Limitations: Reachable Speedup (2) Amdahl s Law (1967) Vary b in the range from 0 to 1: S p = 1 b + a p Maximal Speedup: Use an infinite number of processors: Speedup p=16 p=8 lim p S p = 1 b 4 2 p= Fraction of sequential operations (b) Even a low fraction non-parallel operations may significantly limit the reachable speedup.

31 Limitations: Reachable Speedup (3) lim p S p = 1 b Example: with b=0.1, the maximum speedup is 10, independently how many processors are used (p>=10). S p 1 b

32 Measures: Reachable Speedup (4) Vary the number of processors used, curves for several b-values Speedup b=0.00 b=0.01 b=0.05 b=0.10 b=0.20 b= Number processors (p) The existence b > 0.05 causes that speedup increase can only be reached until a number of processors p x. As bigger b gets, the smaller is p x.

33 Measures: Optimal number of processors (1) We use another measure: F p = S p E p T 1 F p grows with increasing speedups But F p sinks with decreasing efficiency Division by T 1 in order to normalize F p ; not really necessary in our scope F p reaches a maximum, when the optimal number of processors is used.

34 Measures: Optimal number of processors (2) Applying Amdahl s Law to S p and calculate E p, F p. Plot for several b-values - fractions of non-parallel operations Fp = Sp * Ep b=0.00 b=0.01 b=0.05 b=0.10 b=0.20 b= Number processors (p) F p reaches a maximum, when the optimal number of processors is used Search for the top points in the curves!

35 Measures: Optimal number of processors (3) Analytical approach: with F p = S p E p = (S p ) 2 1 p F p = ( d df p dp = 0 ( 1 b+ a p 1 b + a p dp ) 2 1 p ) 2 1 p = 0

36 Measures: Optimal number of processors (4) we obtain: ( a 2 p = 1 2a + a 2 ) 1 2 Examples for p using the analytical approach: b a optimal p

37 Typical Parallel Applications (1) All common applications exhibit a very high fraction of parallel operations (b very small) Linear Algebra: Operations with vectors and matrices Systems of linear equations: A x = b Solvers may work in a direct way, e.g. Gaussian-Elimination-Algorithm Iterative solvers, e.g. Gauss-Seidel-Iteration, some very efficient solvers for sparse coefficient matrices A

38 Typical Parallel Applications (2) Solution of Differential Equations: Equations that contain x, a function y(x) and deviations y (x). Numerical solution using discrete differences instead of symbolic differentiation Calculate approximated values for different values of x in parallel (Runge-Kutta-Algorithm)

39 Typical Parallel Applications (3) Image processing: Local operators, e.g. spreading of spectrum, smoothing can be executed on different image parts in parallel Object matching, e.g. detection of geometric forms Finding of similar blocks in different images for detection of object movements (soft) real-time multimedia

Example: Partial Differential Equation (1) Laplace Partial Differetial Equation (Laplace PDE): U(x, y) = δ2 δ2 U(x, y) + U(x, y) = 0 δx 2 δy 2 the values in U(x, y) express for instance spacial

40 Example: Partial Differential Equation (1) Laplace Partial Differetial Equation (Laplace PDE): U(x, y) = δ2 δ2 U(x, y) + U(x, y) = 0 δx 2 δy 2 the values in U(x, y) express for instance spacial distribution of electrical potential fields temperature on a surface level of ground water (e.g. for planning of building constructions) Boundary values must be known for a solution of U(x, y) = 0

41 Example: Partial Differential Equation (2) Discretization: express U(x, y) by a two-dimensional array of values at discrete grid points U(x, y) : U(i, j) with x i = i h, y j = j h with h as the distance of neighbor points in x, and in y direction. Discrete approximation of differential operator: Common practice is a substitution for the first order deviation, according to: d dx f(x) = f (x) = d dx f(x) = f (x) = f(x + h) f(x) lim h 0 h f(x + h) f(x) + O(h) h... we need a discretization of d 2 dx 2 f(x) = f (x)

42 Example: Partial Differential Equation (3) Using Taylor series: f(x + h) = f(x) + f (x)h f (x)h f (x)h f(x h) = f(x) f (x)h f (x)h f (x)h f(x + h) + f(x h) = 2f(x) + f (x)h f (x)h 4 + O(h 6 ) This can be written... f (x) = f(x + h) + f(x h) 2f(x) h 2 + O(h 2 ) O(h 2 ) = h2 12 f (x) +...

43 Example: Partial Differential Equation (4) The differential operators can be written as differences: U(i + 1, j) + U(i 1, j) 2U(i, j) U(i, j + 1) + U(i, j 1) 2U(i, j) h 2 + h 2 = 0 Finally, an iterative formula for U(i,j) is obtained: U(i, j) = 1 (U(i + 1, j) + U(i 1, j) + U(i, j + 1) + U(i, j 1)) 4 iteration x+1 iteration x

44 Example: Partial Differential Equation (5) Can be transformed into a parallel iteration on separated areas for each processor iteration x+1 iteration x Access to neighbor areas: multiprocessor: via access to shared memory and synchronization multicomputer: by exchanging U(i,j) values that lay on the boundaries of the locally processed area (using messages).

45 Summary Part 1 High performance computing with parallel computers Goals: solve problem in shorter time (speedup), or bigger problems in a specified/acceptable time (scaleup) Different parallel computer architectures: Multiprocessors (shared memory), Distributed systems (Distributed Memory), Vector processors and combinations of them Scaleup directly depends on time complexity of the algorithm, parallelization helps if time complexity order is polynomial or less Speedup is limited by sequential fraction of operations b: S p 1 b, most parallel applications with a very small sequential fraction of operations

10th August Part One: Introduction to Parallel Computing

10th August Part One: Introduction to Parallel Computing Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer