A Study on the Performance of Cholesky-Factorization using MPI

Similar documents
Chapter 3 Classification of FFT Processor Algorithms

LU Decomposition Method

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers *

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Lecture 5. Counting Sort / Radix Sort

Lecture 1: Introduction and Strassen s Algorithm

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

CS 683: Advanced Design and Analysis of Algorithms

Pattern Recognition Systems Lab 1 Least Mean Squares

Lower Bounds for Sorting

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments

An Efficient Algorithm for Graph Bisection of Triangularizations

GPUMP: a Multiple-Precision Integer Library for GPUs

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Analysis of Algorithms

Ones Assignment Method for Solving Traveling Salesman Problem

Analysis of Algorithms

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #2: Randomized MST and MST Verification January 14, 2015

Data Structures and Algorithms. Analysis of Algorithms

6.854J / J Advanced Algorithms Fall 2008

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov

CIS 121 Data Structures and Algorithms with Java Fall Big-Oh Notation Tuesday, September 5 (Make-up Friday, September 8)

An Efficient Algorithm for Graph Bisection of Triangularizations

CIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13

How do we evaluate algorithms?

The isoperimetric problem on the hypercube

Improving Template Based Spike Detection

Outline and Reading. Analysis of Algorithms. Running Time. Experimental Studies. Limitations of Experiments. Theoretical Analysis

Lecture 2: Spectra of Graphs

3D Model Retrieval Method Based on Sample Prediction

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Algorithms Chapter 3 Growth of Functions

Heaps. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015

Counting the Number of Minimum Roman Dominating Functions of a Graph

Math Section 2.2 Polynomial Functions

COMMUNICATION-OPTIMAL PARALLEL AND SEQUENTIAL CHOLESKY DECOMPOSITION

APPLICATION NOTE. Automated Gain Flattening. 1. Experimental Setup. Scope and Overview

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

The Closest Line to a Data Set in the Plane. David Gurney Southeastern Louisiana University Hammond, Louisiana

An improved Thomas Algorithm for finite element matrix parallel computing

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

condition w i B i S maximum u i

Solution printed. Do not start the test until instructed to do so! CS 2604 Data Structures Midterm Spring, Instructions:

Lecture 28: Data Link Layer

UNIVERSITY OF MORATUWA

Algorithms for Disk Covering Problems with the Most Points

Lecture 6. Lecturer: Ronitt Rubinfeld Scribes: Chen Ziv, Eliav Buchnik, Ophir Arie, Jonathan Gradstein

BASED ON ITERATIVE ERROR-CORRECTION

Bayesian approach to reliability modelling for a probability of failure on demand parameter

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

1 Graph Sparsfication

Second-Order Domain Decomposition Method for Three-Dimensional Hyperbolic Problems

Python Programming: An Introduction to Computer Science

The Penta-S: A Scalable Crossbar Network for Distributed Shared Memory Multiprocessor Systems

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem

1.2 Binomial Coefficients and Subsets

Analysis of Server Resource Consumption of Meteorological Satellite Application System Based on Contour Curve

CHAPTER IV: GRAPH THEORY. Section 1: Introduction to Graphs

COSC 1P03. Ch 7 Recursion. Introduction to Data Structures 8.1

AN OPTIMIZATION NETWORK FOR MATRIX INVERSION

Numerical Methods Lecture 6 - Curve Fitting Techniques

prerequisites: 6.046, 6.041/2, ability to do proofs Randomized algorithms: make random choices during run. Main benefits:

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Thompson s Group F (p + 1) is not Minimally Almost Convex

Identification of the Swiss Z24 Highway Bridge by Frequency Domain Decomposition Brincker, Rune; Andersen, P.

New Results on Energy of Graphs of Small Order

Alpha Individual Solutions MAΘ National Convention 2013

Multi-Pivot Quicksort: Theory and Experiments

( n+1 2 ) , position=(7+1)/2 =4,(median is observation #4) Median=10lb

Order statistics. Order Statistics. Randomized divide-andconquer. Example. CS Spring 2006

The Adjacency Matrix and The nth Eigenvalue

Evaluation scheme for Tracking in AMI

DATA MINING II - 1DL460

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

The Counterchanged Crossed Cube Interconnection Network and Its Topology Properties

CSE 417: Algorithms and Computational Complexity

A New Approach To Scheduling Parallel Programs Using Task Duplication

Parallel Polygon Approximation Algorithm Targeted at Reconfigurable Multi-Ring Hardware

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved.

Algorithm. Counting Sort Analysis of Algorithms

Intro to Scientific Computing: Solutions

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

Exact Minimum Lower Bound Algorithm for Traveling Salesman Problem

Computational Geometry

Accuracy Improvement in Camera Calibration

A Boolean Query Processing with a Result Cache in Mediator Systems

Chapter 24. Sorting. Objectives. 1. To study and analyze time efficiency of various sorting algorithms

Analysis of Algorithms

ANN WHICH COVERS MLP AND RBF

A Note on Least-norm Solution of Global WireWarping

Image Segmentation EEE 508

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures

Analysis of Class Design Coupling Based on Information Entropy Di Jiang 1,2, a, Hua Zhou 1,2,b and Xingping Sun 1,2,c

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.

Performance Plus Software Parameter Definitions

Transcription:

A Study o the Performace of Cholesky-Factorizatio usig MPI Ha S. Kim Scott B. Bade Departmet of Computer Sciece ad Egieerig Uiversity of Califoria Sa Diego {hskim, bade}@cs.ucsd.edu Abstract Cholesky-factorizatio is a well kow method for solvig a liear system of equatios. I this paper, various parallel algorithms for Cholesky-factorizatio usig MPI are desiged, aalyzed, ad implemeted. Parallel algorithms deped o the layout of workload. Colum strip cyclic layout ad Colum block strip cyclic layout are the two mai layout discussed i this paper. We show the performace of parallel algorithm scales up relatively well ad the commuicatio overhead degrades most of performace gaied from parallelism. Also to fid out what the optimal size of block is i colum block strip cyclic layout, the decompositio of executio time was performed. The result is optimal block size is aroud 3 to 6 ad the reaso why the size varies is because this value results from the summatio of commuicatio overhead distributio ad post computatio cost distributio. 1. Itroductio Cholesky factorizatio [1] is a well kow method whe solvig a liear system of equatios. For a give matrix which has some special property (positive defiite matrix) ca be decomposed ito two matrix ad with these two matrix, it becomes much easier to solve the liear system of equatios. I this project (CSE 26 Parallel Computatio Fall 25 research project at Uiversity of Califoria Sa Diego), we preset parallel algorithms implemeted with MPI [3]. To implemet a real scietific applicatio compoet, careful aalysis has bee doe sice the very early stages. Choosig oe better algorithm out of various cadidates is based o cost aalysis ad durig implemetatios ad experimets we did additioal ivestigatios o the bottleeck of implemetatio. We orgaize this paper as follows: I Sectio 2, we describe the backgroud of the problem dealt i this paper. I Sectio 3, as a aïve approach, we preset the sequetial algorithm. I Sectio 4, we show two parallel algorithms ad cost aalysis. I Sectio 5, we show the performace of the parallel algorithms ad aalytical results. We coclude the paper i Sectio 6 ad 7. 2. Backgroud 2.1. Cholesky Factorizatio Cholesky Factorizatio is used i solvig liear systems of equatios. A liear system of equatios i ukows x 1, x 2,, x is defied as follow[1]: A set of equatios E1, E2,, E of the form, E1 : a11x1 + a12x2 +... + a1x = b1 E2 : a21x1 + a22x2 +... + a2x = b2 E : a1x1 + a2x2 +... + ax = b The systems of equatios above ca be expressed with matrix; - 1 -

Ax = b where a11 a21 A =. a1 a a a 12 22. 2 a a a 1 2., ad x1 b1 x = =, b x b Cholesky factorizatio of a give symmetric, positive defiite square matrix A is of the form A=U T U where U is upper triagular ad U T is traspose of U. 2.2. Positive Defiite Matrix Matrix A is positive defiite matrix if ad oly if A = A T, x T Ax > for all x!=. 2.2.1. Positive Defiite Matrix Geeratio Positive defiite matrixes of arbitrary size are required as a iput to the Cholesky factorizatio program. The approach to geerate a positive defiite matrix is quite simple. First geerate a real-value upper triagular matrix. The traspose the matrix. Now we have U ad U T. Simply multiplyig two matrixes ca geerate a positive defiite matrix. The followig theorem guaratees the result to be a property positive defiite matrix; Theorem 1. (Ayers 1962) A real symmetric matrix A is positive defiite iff there exists a real osigular matrix M such that A = MM T, where M T is the traspose 3. Iitial Approach 3.1. Serial Algorithm Serial algorithm helps to uderstad how the algorithm works ad how we ca improve the algorithm by parallelism. Serial algorithm i textbook is used. The ruig time of this algorithm is O( 3 ). The outermost loop iterates times. The last two loops iterates (-k+1)(-k) / 2 times for each k. Summatio k from to -1 yields 3 term, which meas the ruig time is O( 3 ). Algorithm 1. SERIAL_CHOLESKY procedure SERIAL_CHOLESKY(A) for k := to -1 do A[k,k] := A [ k, k] ; for j:=k+1 to -1 do A[k,j] := A[k,j]/A[k,k]; for i:=k+1 to -1 do for j:=i to -1 do - 2 -

edfor ed SERIAL_CHOLESKY A[i,j]:=A[i,j]-A[k,i] x A[k,j]; 4. Parallel Algorithm 4.1. Partitio Layout To decide which partitioig layout would be adequate for this problem, we eed to aalyze first o how Cholesky factorizatio works. Figure.1 To get the value located at the star mark, (a) multiplyig the row (bold horizotal lie) of the leftmost matrix ad the colum (bold vertical lie above the star mark) of the secod leftmost matrix. However, by usig the symmetry property of Cholesky factorizatio, this multiplicatio equals (b) multiplyig two colums (bold vertical lie) of the rightmost matrix. As the figure above shows, to get the value of the star mark, two rows must be previously calculated. This iformatio, say A[:i, i] ad A[:i, j], should be broadcasted to the processors which wishes to compute A[i,j]. The we have three choices: (1) Colum strip partitio layout, (2) colum strip cyclic partitio layout, ad (3) block cyclic partitio layout. It is trivial to argue that the colum strip cyclic partitio would perform better tha colum strip partitio because cyclic placemet ca equalize the load of computatio ad also creates cyclic depedecy of computatio. The issue is the colum strip cyclic layout versus the block cyclic layout. I argue that the colum strip cyclic partitio layout performs better tha the block cyclic partitio layout. Before divig ito this argumet, we eed to defie some assumptios ad defiitios. Assumptio 1. The commuicatio cost is defied as α-β model. Namely, (cost) = (the umber of commuicatios) * (α + β * (the legth of data per commuicatios) ) Assumptio 2. The commuicatio chael is full-duplex. Namely, we do t eed to worry about simultaeous trasfer of data betwee two processors. Also the formal defiitio of two partitioig techiques is; - 3 -

Defiitio 1. Colum Strip Cyclic Layout The colum strip cyclic layout i two-dimesioal problem set is defied as follows; For give processors ad N by N matrix, the ith colum is allocated to { (i%) }th processor. For example, i 4 by 4 matrix ad with two processors, the layout is 1 1 Figure 2. Colum Strip Cyclic Layout with two processors i 4 by 4 matrix. Defiitio 2. Block Cyclic Layout The block cyclic layout i two-dimesioal problem is defied as follows; For a give N by N matrix A ad give 2 processors, the elemet A[i,j] is allocated to { (i % ) + (j % ) }th processor where i <= j For example, i 4 by 4 matrix with four processors, the layout is, 1 1 3 2 3 1 3 Figure 3. Block Cyclic Layout with four processors i 4 by 4 matrix. 4.2. Cost Aalysis To decide a better algorithm i terms of commuicatio overhead, some aalysis had to be doe before implemetatio. Theorem 2. The colum strip cyclic partitioig performs better tha the block cyclic partitioig i Cholesky factorizatio. To prove this theorem, two lemmas should be itroduced. Lemma 1. The commuicatio overhead of the colum strip cyclic partitioig is O(N 2 ) where N deotes the - 4 -

umber of colums i a matrix. Proof) For processor X to get the value of A[i,j], the processor eed two colums of iformatio. However, because of the colum strip cyclic partitioig, the oly iformatio that resides outside of the processor is A[:i, j]. Therefore, for each update the commuicatio cost of sedig oe row (A[:i, j]) of double type accordig to α-β model is, α + 8βi If we assume that the topology of the etwork is fully coected ad we ca use a recursive algorithm for broadcastig, the upper boud for the broadcast cost of sedig ith colum is,,where deotes the umber of processes. (α + 8βi) log This kid of broadcast takes place N times. Therefore, the total commuicatio cost is as follows; N 1 i= ( α + 8βi)log = αn log + 4β log N( N 1) = O( N 2 ) (1) Q.E.D. Lemma 2. The commuicatio overhead of the block cyclic partitioig is also O(N 2 ) where N deotes the umber of colums i a matrix. Proof) For processor X to get the value of A[i,j], the processor eed two colums of iformatio. Because the layout is block cyclic, processors shares oe colum if deotes the umber of processes available i a parallel computatio. Therefore, processors should broadcast its iformatio to processors. For a processor, the amout of iformatio to be set is i 8 Ad the broadcast cost for each processor is ( α + i 8β )log However, for oe update, 2 total cost for oe update is, Summatio i from to N-1 yield, processors must participate for commuicatios. Therefore, the i 2 ( α + 8β ) log - 5 -

N 1 i= (2 = αn = O( N log α + 16βi log ) log + 4β log N( N 1) 2 ) (2) Q.E.D. Proof of Theorem 2 It is trivial from the lemma 1 ad 2 whe comparig the cost equatio i (1) ad (2) sice the startup time of colum strip layout is α N log N whereas i case of block cyclic layoutα N log. Q.E.D. 4.3. Parallel Algorithm Now that we proved the colum strip cyclic layout is more efficiet, the parallel algorithm ca be implemeted as follows; Algorithm 2. PARALLEL_CHOLESKY procedure PARALLEL_CHOLESKY(A) for k:= to N-1 do if(k % pes == rak) //do calculatio prior to ay other processors; for i := to k-1 do A[k,k] := A[k,k] - A[i,k]*A[i,k]; A[k,k] := sqrt(a[k,k]); edif // other processors eed to wait util this loop eds Broadcast its result to other processors; Update matrix with the data received from broadcast for i:= start to N 1 by pes do for j:= to k-1 do A[k,i] := A[k,i] A[j,k]; A[k,i] := A[k,i]/A[k,k]; edfor edfor ed PARALLEL_CHOLESKY 4.4. Colum Block Strip Layout Rather tha allocatig oe colum to oe processor, a couple of colums ca be allocated to oe - 6 -

processor. For example, th ad 1 st colums are allocated to th processor, ad 2 d ad 3 rd colums are allocated to 1 st processor ad so o. The optimal width of block is to be foud. Oe extreme is oe, which is the origial algorithm. The other extreme is processor divides up the matrix evely, which meas N/ colums are allocated each processor ad this layout have some disadvatage i load balacig ad executio schedule. The advatage of this approach is maily two: (1) cache effect ad (2) commuicatio overhead reductio. By calculatig a block of data, we ca utilize cache mechaism. Because memory layout is rowmajor, with the origial algorithm, we caot exploit cache effect. However, if the block of data ca be read ad calculated, we ca reduce the umber of memory read. Secod advatage is the total umber of commuicatio ca be reduced. Rather tha broadcast oe elemet to every processor, collectig some iformatio i oe processor ad broadcastig all together ca reduce the umber of commuicatio by a factor of the width of the block. 4.4.1. Block Strip Parallel Algorithm Basic structure of block strip parallel algorithm is very similar to colum strip algorithm. The major differece is other processors eed to wait util blocks of colum are computed. For example, whe the width of strip is b, oe processor should compute 1/2 * b (b+1) etries. After the iitial computatio, the processor desigated to compute the etries broadcast the values to other processors. Other processors update their matrix ad execute their ow computatio with the ewly computed values. I this stage, each processor also computes blocked strip of size b i oe executio. This algorithm termiates because each computatio is executed i lock-step ad the step is bouded by the size of matrix N. Algorithm 3. BLOCK_PARALLEL_CHOLESKY procedure BLOCK_PARALLEL_CHOLESKY(A) for k:= to N-1 do if((k/b) % pes == rak) Compute the left-most colum block prior to broadcastig; edif // other processors eed to wait util this loop eds Broadcast colum block to other processors; Update matrix with the data received from broadcast while eedmorecomputatio == true Other processors starts computatio with the updated iformatio For each proc, (N/b/pes)-may colums should be computed edwhile edfor ed BLOCK_PARALLEL_CHOLESKY - 7 -

4.4.2. Commuicatio cost The commuicatio cost of this algorithm is trivial. Because we aggregates b updates ito oe but still the overall amout of data trasferred remais same, the commuicatio cost is, log α N + 4βN( N 1) log b 5. Experimet 5.1. Scalability Study 5.1.1. Experimet Pla We performed experimets measurig the executio time of colum strip cyclic algorithm as the umber of processes icreases but with a fixed workload (i.e. with a fixed size of matrix 216 by 216) o the FWGrid [5] machie. Oly 24 processors i oe rack are used because cross-rack commuicatio overhead is far larger tha i-rack commuicatio overhead. alpha (us) 1/beta (1^(-9)) Cross-rack 44.9 9.748 I-rack 38.1 9.668 Differece (%) 17.8477693.829 Table 1. Compariso betwee cross-rack ad i-rack (Alpha Beta Model Costat) As we ca see, the startup overhead i cross-rack commuicatio is 17% more tha i-rack. However, whe program rus across rack, the performace gets worse. A experimet was performed with 1 processors. Oe used processors oly i rack 5 ad the other used all the processors i FWGrid machie. The colum strip cyclic algorithm was ru with these processors. 1 processors i rack cross rack ruig time 14.93 143.46 Table 2. Compariso betwee cross-rack ad i-rack (Ruig Time) These figures clearly show why we oly used 24 processors i oe rack because the differece is a factor of 1. 5.1.2. Result # procs 1 2 3 5 9 12 16 18 2 24 Ruig Time(s) 83.62 59.65 4.23 26.42 16.67 13.12 1.27 9.9 9.3 8.36 Speedup 1 1.4 2.7 3.16 5.1 6.37 8.14 8.43 8.98 1. Efficiecy 1.7.69.63.55.53.5.46.44.41 Table 3. Scalability of Colum Strip Cyclic Layout - 8 -

9 8 7 Ruig Time 1.2 1 Efficiecy Ruig Time 6 5 4 3 2 1 Efficiecy.8.6.4.2 5 1 15 2 # Procs 5 1 15 2 # Proc Figure 4. Ruig Time Figure 5. Efficiecy 5.1.3. Discussio The ruig time shows a very smooth mootoically decreasig graph. However, efficiecy is betwee.4 ad.6, which is either good or bad. To fid out what causes the low efficiecy, aother experimet must be performed. The easiest way is to suppress commuicatio. Without commuicatio, the efficiecy is ehaced. Most of the efficiecy idicates above.9. So we ca coclude that the commuicatio overhead lowered the efficiecy of the overall performace. # Procs 1 2 3 5 9 12 16 18 2 24 Efficiecy 1.94.97.93.9.87.87.9.9.87 Ruig Time 83.62 44.42 28.55 17.96 1.31 7.95 5.96 5.1 4.62 3.99 Table 4. Tabular Data Ruig Time 9 8 7 6 5 4 3 2 1 Ruig Time w/o Commuicatio 5 1 15 2 25 # Procs Efficiecy 1.8.6.4.2 Efficiecy w/o Commuicatio 5 1 15 2 25 # Procs Figure 6. Ruig Time whe performed without commuicatio Figure 7. Efficiecy whe performed without commuicatio 5.2. Optimal Block Size 5.2.1. Experimet Pla I colum block strip cyclic algorithm, we eed to fid out what the optimal size of width is. Experimet was performed with gradual icremet i block size. Firstly, ruig time was measured with 216 by 216 matrix ad 18 processors. However, because of usatisfactory result described below, extra experimets were added: experimets with 12 processors, 9 processors ad 4 processors. - 9 -

5.2.2. Result Ruig Time with 18 Procs 1.5 1 9.5 9 Ruig Time 8.5 8 7.5 7 6.5 6 5 1 15 2 25 Block Size Figure 8. Ruig Time with 18 processors ad 216 by 216 matrix blk size 1 2 3 4 5 6 8 1 12 15 2 24 ruig time 9.81 9.22 8.87 9.62 9.71 9.42 9.28 9.41 9.45 9.5 9.96 1.23 stdev.1.3.6.24.23.2.17.14.17.19.21.28 Table 5. Ruig Time with 18 processors ad 216 by 216 matrix Ruig Time with 12 Procs 14.5 14 13.5 13 Ruig Time 12.5 12 11.5 11 1.5 1 5 1 15 2 25 Block Size Figure 9. Ruig Time with 12 processors ad 216 by 216 matrix blk size 1 2 3 4 5 6 8 1 12 15 2 24 ruig time 12.57 12.12 11.91 11.96 11.73 11.35 12.1 12.1 11.97 12.48 13.29 14.16 stdev.17.6.7.4.1.17.11.15.22.21.17.13 Table 6. Ruig Time with 12 processors ad 216 by 216 matrix - 1 -

Ruig Time with 9 Procs 19 18.5 18 Ruig Time 17.5 17 16.5 16 5 1 15 2 25 Block Size Figure 1. Ruig Time with 9 processors ad 216 by 216 matrix blk size 1 2 3 4 5 6 8 1 12 15 2 24 ruig time 16.44 16.39 16.13 17.18 17.59 17.8 17.92 17.68 17.81 17.98 18.7 18.62 stdev.5.18.19.29.27.2.14.21.21.15.19.24 Table 7. Ruig Time with 9 processors ad 216 by 216 matrix 5.2.3. Discussio Ufortuately, with this set of experimets, we could t figure out what the optimal value is. First of all, the graph chages accordig to umber of processors ad the fluctuatio shows too much irregularity. Oe ad oly commo patter is the icreasig patter whe block size exceeds 15. Eve though we ca t pick oe value for optimal size, we ca say it has some miimum value for each experimet. With 18 processors, whe block size is 3, the ruig time is 8.87 ad stadard deviatio is oly.6. If we ca assume that the distributio follows Gaussia distributio, eve whe comparig with the closest ruig time (whe block size is 2), this distributio is statistically showig better performace. The same argumet ca be applied to aother case: block size of 6 with 12 processors. However, block size of 3 with 9 processors has too small probability so we ca t argue that 3 is showig better performace. 5.3. Bottleeck Aalysis 5.3.1. Experimet Pla Although the commuicatio overhead domiates the most of performace degradatio i Sectio 5.1, further ivestigatio is eeded to aalyze more accurate behavior of this program especially for Sectio 5.2. The executio of Choleksy-factorizatio algorithms cosist of three parts: (1) iitial computatio, which computes oe etry to broadcast while other processors are waitig for, (2) commuicatio ad (3) update ad computatio, which parallelism plays its role. So our method for fidig bottleeck is by suppressig each part ad measurig executio time. - 11 -

5.3.2. Result Decompositio of Executio Time Iitial Computatio Commuicatio Post commuicatio 12. 1. Percetage 8. 6. 4. 2.. 1 2 3 4 5 6 8 1 12 15 2 24 Block Size Figure 11. Decompositio of Executio Time block size 1 2 3 4 5 6 8 1 12 15 2 24 iitial computatio (s).17.23.21.37.31.26.12.53.66.47.97.8 commuicatio (s) 4.51 4.21 3.73 3.8 3.16 3.82 4.6 4.9 3.77 3.23 2.55 2.31 post computatio (s) 5.13 4.78 4.93 6.17 6.24 5.34 5.1 4.79 5.2 5.8 6.45 7.12 Table 8. The executio time of each compoet 5.3.3. Aalysis The miimum value i this settig is whe we use block size of 3. However, we caot fid ay patter from the graph showig 3 is the miimum. Post computatio has small value but ot the smallest (size of 2 is the smallest) ad commuicatio overhead is also ot the smallest (size of 4). Block size of 3 becomes the smallest because it has small commuicatio overhead ad small post computatio cost though it does ot have the smallest values. With this graph oly thig we ca ifer is the commuicatio cost has its ow distributio as the block size icreases ad the post computatio cost has other distributio of its ow. So the summatio of two distributio results the overall performace. Probably that s why the graphs i Sectio 5.2 have irregularity. 6. Future Work The bottleeck aalysis is ot perfect at this time. However, the basic idea of decomposig executio time is illustrated i this paper. Further ivestigatio would be per-process measurig. Because there exist some imbalace of executio time betwee processors, more iformatio about bottleeck ca be foud with per-processor measuremet. Also we did t take ito accout the cache effect. Cachig ca drastically chage the overall performace of a program. However, i this paper, the cachig problem has ot bee discussed much. Further ivestigatio should be placed oto how well the cache works o parallel algorithms. - 12 -

7. Coclusio I this project, we leared how the parallel algorithm ca be aalyzed, desiged ad optimized. Cost aalysis ad proof, algorithm descriptio, ad decompositio of a program ito may compoets all helped to uderstad how a parallel computatio works. Also, through this project, we had a chace to use FWGrid machie which has may processes up to almost 1. Ad usig this machie, we were able to check the scalability of a program. 8. Bibliography [1] Advaced Egieerig Mathematics 8 th Ed., by Erwi Kreyszig, Joh Wiley & Sos, Ic., 1999. [2] Itroductio to Parallel Computig 2 d Ed., by Grama, Gupta, Karypis, ad Kumar. Addiso-Wesley Publisher, 23. [3] MPI (Message Passig Iterface) Home Page: http://www.mcs.al.gov/mpi [4] All the idea related to layout take from lecture otes by Prof. Scott Bade: http://www.cse.ucsd.edu/classes/fa5/cse26/lectures/lec14/lec14.pdf [5] FWGrid Home Page: http://fwgrid.ucsd.edu/ - 13 -