COMMUNICATION-OPTIMAL PARALLEL AND SEQUENTIAL CHOLESKY DECOMPOSITION

Size: px

Start display at page:

Download "COMMUNICATION-OPTIMAL PARALLEL AND SEQUENTIAL CHOLESKY DECOMPOSITION"

Joanna Parker
5 years ago
Views:

1 SIA J. SCI. COUT. Vol. 2, No. 6, pp c 2010 Society for Idustrial ad Applied athematics COUNICATION-OTIAL ARALLEL AND SEQUENTIAL CHOLESKY DECOOSITION GREY BALLARD, JAES DEEL, OLGA HOLTZ, AND ODED SCHWARTZ Abstract. Numerical algorithms have two kids of costs: arithmetic ad commuicatio, by which we mea either movig data betwee levels of a memory hierarchy (i the sequetial case) or over a etwork coectig processors (i the parallel case). Commuicatio costs ofte domiate arithmetic costs, so it is of iterest to desig algorithms miimizig commuicatio. I this paper we first exted kow lower bouds o the commuicatio cost (both for badwidth ad for latecy) of covetioal (O( )) matrix multiplicatio to Cholesky factorizatio, which is used for solvig dese symmetric positive defiite liear systems. Secod, we compare the costs of various Cholesky decompositio implemetatios to these lower bouds ad idetify the algorithms ad data structures that attai them. I the sequetial case, we cosider both the two-level ad hierarchical memory models. Combied with prior results i [J. Demmel et al., Commuicatio-optimal arallel ad Sequetial QR ad LU Factorizatios, Techical report EECS , Uiversity of Califoria, Berkeley, CA, 2008], [J. Demmel et al., Implemetig Commuicatio-optimal arallel ad Sequetial QR ad LU Factorizatios, SIA. J. Sci. Comp., submitted], ad [J. Demmel, L. Grigori, ad H. Xiag, Commuicatio-avoidig Gaussia Elimiatio, roceedigs of the 2008 AC/IEEE Coferece o Supercomputig, 2008] this gives a set of commuicatio-optimal algorithms for O( )implemetatios of the three basic factorizatios of dese liear algebra: LU with pivotig, QR, ad Cholesky. But it goes beyod this prior work o sequetial LU by optimizig commuicatio for ay umber of levels of memory hierarchy. Key words. Cholesky decompositio, badwidth, latecy, commuicatio avoidig, algorithm, lower boud AS subject classificatios. 68Q25, 68W10, 68W15, 68W40, 65Y05, 65Y10, 65Y20, 65F0 DOI / Itroductio. Let A be a real symmetric ad positive defiite matrix. The there exists a real lower triagular matrix L so that A = L L T (L is uique if we restrict its diagoal elemets to be positive). This is called the Cholesky decompositio. We are iterested i fidig efficiet parallel ad sequetial algorithms for the Cholesky decompositio. Efficiecy is measured both by the umber of arithmetic operatios ad by the amout of commuicatio, either betwee levels of a memory hierarchy o a sequetial machie or betwee processors commuicatig over a etwork o a parallel machie. Sice the time to move oe word of data typically exceeds Received by the editors Jue 2, 2009; accepted for publicatio (i revised form) September 21, 2010; published electroically December 2, A prelimiary versio of this paper was accepted to the 21st AC Symposium o arallelism i Algorithms ad Architectures (SAA 09), 2009, pp Computer Sciece Departmet, Uiversity of Califoria, Berkeley, CA (ballard@eecs. berkeley.edu). Research supported by icrosoft ad Itel fudig (award ) ad by matchig fudig by U.C. Discovery (award DIG ). athematics Departmet ad CS Divisio, Uiversity of Califoria, Berkeley, CA (demmel@cs.berkeley.edu). Research supported by icrosoft ad Itel fudig (award ) ad by matchig fudig by U.C. Discovery (award DIG ). Departmets of athematics, Uiversity of Califoria, Berkeley ad Techische Uiversität Berli (holtz@math.berkeley.edu). The third author ackowledges support of the Sofja Kovalevskaja programme of Alexader vo Humboldt Foudatio. The Weizma Istitute of Sciece, Rehovot 76100, Israel (oded.schwartz@weizma.ac.il). This work was doe while at the Departmet of athematics, Techische Uiversität Berli, ad while visitig the Uiversity of Califoria, Berkeley. 495

2 496 G. BALLARD, J. DEEL, O. HOLTZ, AND O. SCHWARTZ the time to perform oe arithmetic operatio by a large ad growig factor [18], our goal will be to miimize commuicatio Commuicatio model. We model commuicatio costs i more detail as follows. I the sequetial case, with two levels of memory hierarchy (fast ad slow), commuicatio meas readig data items (words) from slow memory to fast memory ad writig data from fast memory to slow memory. Words that are cotiguous ca be read or writte i a budle, which we will call a message. We assume that a message of words ca be commuicated betwee fast ad slow memory i time α + β, where α is the latecy (secods per message) ad β is the iverse badwidth (secods per word). We defie the badwidth cost of a algorithm to be the total umber of words commuicated ad the latecy cost of a algorithm to be the total umber of messages commuicated. We assume that the matrix beig factored iitially resides i slow memory ad is too large to fit i the smaller fast memory. Our goal the is to miimize the total umber of words ad the total umber of messages commuicated betwee fast ad slow memory. 1 I the parallel case, we are iterested i the commuicatio amog the processors. As i the sequetial case, we assume that a message of cosecutively stored words ca be commuicated i time α + β. This cost icludes the time required to pack ocotiguous words ito a sigle message if ecessary. We assume that the matrix is iitially evely distributed amog all processors ad that there is oly eough memory to store about 1/ th of a matrix per processor. As before, our goal is to miimize the umber of words ad messages commuicated. I order to measure the commuicatio cost of a parallel algorithm, we cout the umber of words ad messages commuicated alog the critical path as defied i [0], as this metric is closely related to the total ruig time of the algorithm. We assume that (1) computatio ad commuicatio are ot overlapped (all commuicatio is blockig ), (2) the cost per flop is the same o each processor, ad the commuicatio costs (α ad β) are the same betwee each pair of processors, ad () there is o commuicatio resource cotetio amog processors. For example, if processor 0 seds a message of size to processor 1 at time 0 ad processor 2 seds a message of size to processor also at time 0, the cost alog the critical path is α + β. However, if both processors 0 ad 1 try to sed a message to processor 2 at the same time, the cost alog the critical path will be the sum of the costs of each message Commuicatio lower bouds. We cosider classical algorithms for Cholesky decompositio, i.e., those that perform the usual O( ) arithmetic operatios, possibly reordered by associativity ad commutativity of additio. That is, our results do ot apply whe usig distributivity to reorgaize the algorithm (such as Strasse-like algorithms); we also assume o pivotig is performed. We defie classical more carefully later. We show that the commuicatio complexity of ay such Cholesky algorithm shares essetially the same lower boud as does the classical matrix multiplicatio. Theorem 1.1 (mai theorem). Ay sequetial or parallel classical algorithm for the Cholesky decompositio of -by- matrices ca be trasformed ito a classical 1 The sequetial commuicatio model used here is sometimes called the two-level iput/output (I/O) model or disk access machie (DA) model (see [2, 9, 12]). Our badwidth cost model follows that of [2] ad [25] i that it assumes the block-trasfer size is oe word of data (B =1i the commo otatio). However, our model allows message sizes to vary from oe word up to the maximum umber of words that ca fit i fast memory.

3 COUNICATION-OTIAL CHOLESKY DECOOSITION 497 algorithm for -by- matrix multiplicatio i such a way that the badwidth cost of the matrix multiplicatio algorithm is at most a costat times the badwidth cost of the Cholesky algorithm. Therefore, ay badwidth cost lower boud for classical matrix multiplicatio applies to classical Cholesky decompositio, asymptotically. I particular, sice a sequetial classical -by- matrix multiplicatio algorithm has a badwidth cost lower boud of Ω( / 1/2 ), where is the fast memory size [2, 25], classical Cholesky decompositio has the same lower boud (we discuss the parallel case later). To get a latecy cost lower boud, we use the simple observatio [1] that the umber of messages is at least the badwidth cost lower boud divided by the maximum message size ad that the maximum message size is at most fast memory size i the sequetial case (or the local memory size i the parallel case). So for sequetial matrix multiplicatio this meas the latecy cost lower boud is Ω( / /2 ). 1.. Commuicatio upper bouds Two-level memory model. I the case of matrix multiplicatio o a sequetial machie, a well-kow algorithm attais the badwidth cost lower boud [2]. It computes C = A B by performig blocked matrix multiplicatio that multiplies ad accumulates -by- submatrices of A, B, adc. If each matrix is stored so that the etries of each of its submatrices are cotiguous (ot the case with columwise or rowwise storage), the latecy cost lower boud is reached; we call suchadatastructureblock-cotiguous storage ad describe it i more detail below. Alteratively, oe could try to copy A ad B from their iput format (say columwise) to cotiguous block storage doig (asymptotically) o more commuicatio tha the subsequet matrix multiplicatio; we will see this is possible, provided =Ω(). There will be aalogous requiremets for Cholesky to attai its latecy cost lower boud. I particular, we draw the followig coclusios about the commuicatio costs of sequetial classical Cholesky decompositio i the two-level memory model, as summarized i Table 1.1 ad what follows: 1. Naïve sequetial variats of Cholesky that operate o sigle rows ad colums (be they left-lookig, right-lookig, etc.) attai either the badwidth or the latecy cost lower boud. 2. A sequetial blocked algorithm used i LA ACK (with the correct block size) attais the badwidth cost lower boud, as do the recursive algorithms i [, 4, 20, 21, 27]. A recursive algorithm aalogous to Toledo s LU algorithm [28] attais the badwidth cost lower boud i early all cases, except possibly for a O(log ) factor i the arrow rage 2 log 2 <<2.. Whether the LA ACK algorithm also attais the latecy cost lower boud depeds o the matrix layout: If the iput matrix is give i rowwise or columwise format ad this is ot chaged by the algorithm, the the latecy cost lower boud caot be attaied. But if the iput matrix is give i cotiguous block storage or = Ω() sothatitcabecopiedquicklyto cotiguous block format, the the latecy cost lower boud ca be attaied by the LA ACK algorithm. Toledo s algorithm caot miimize latecy (at least whe > 2/ ); see also coclusio 5 below Hierarchical memory model. Sice most computers have multiple levels (e.g., L1, L2, L caches, mai memory, ad disk), we also cosider the commuicatio costs i the hierarchical memory model. I this case, a optimal algorithm

4 498 G. BALLARD, J. DEEL, O. HOLTZ, AND O. SCHWARTZ Table 1.1 Sequetial badwidth ad latecy costs: lower boud vs. algorithms. deotes the size of the fast memory. We assume 2 > i this table. FLOs cout of all is O( ). Refer to sectio for defiitios of terms ad details. Lower boud Naïve: left/right lookig Colum-major Θ( ) Θ LA ACK [5] Colum-major Block-cotiguous Rectagular recursive [28] Colum-major Block-cotiguous Square recursive [, 4, 20, 21, 27] Recursive packed format [4] Colum-major [20] Block-cotiguous [] Cache Badwidth Latecy oblivious Ω Ω /2 O ( ) O Θ + 2 log ( ) Θ + 2 log O ( ) O ( ) O ) ( 2 + O O /2 Ω Ω ( 2) Ω O O /2 should simultaeously miimize commuicatio betwee all pairs of adjacet levels of memory hierarchy (e.g., miimize badwidth ad latecy costs betwee L1 ad L2, betwee L2 ad L, etc.). I the case of sequetial matrix multiplicatio, badwidth cost is miimized i this sese by simply applyig the usual blocked algorithm recursively, where each level of recursio multiplies matrices that fit at a particular level of the memory hierarchy, by usig the blocked algorithm to multiply submatrices that fit i the ext smaller level. This is easy sice matrix multiplicatio aturally breaks ito smaller matrix multiplicatios. For matrix multiplicatio to miimize latecy across all memory hierarchy levels, it is ecessary for all submatrices of all sizes to be stored cotiguously. This leads to a data structure variously referred to as recursive block storage, orto orderig, or storage usig space-fillig curves ad is described i [, 6, 16, 17, 29]. Fially, sequetial matrix multiplicatio ca achieve commuicatio optimality as just described i oe of two ways: (1) We ca choose the umber of recursio levels ad sizes of the subblocks with prior kowledge of the umber of levels ad sizes of the levels of memory hierarchy, a cache-aware process called tuig. (2) We carry out the recursio dow to 1-by-1 blocks (or some other small costat size), repeatedly dividig block sizes by 2 (perhaps paddig submatrices to have eve dimesios as eeded). Such a algorithm is called cache-oblivious [17] ad has the advatage of simplicity ad portability compared to a cache-aware algorithm, though it might also have more overhead i practice. It is ideed possible for sequetial Cholesky to be orgaized to be optimal across multiple memory hierarchy levels i all the seses just described, assumig we use recursive block storage:

5 COUNICATION-OTIAL CHOLESKY DECOOSITION 499 Table 1.2 arallel badwidth ad latecy costs: lower boud vs. algorithms. deotes the size of the memory of each processor. is the umber of processors. b is the block size. Refer to sectio for defiitios of terms ad details. Lower-boud Geeral 2D layout: ScaLA ACK [11] Geeral Choosig b =Θ = O ( 2 ) (( O 2 O Badwidth Latecy FLOS Ω Ω 2 ) + b ( 2 log ) log ) ( Ω ( ) Ω /2 ) O ( b log ) O( log ) O Ω Ω ( ) 4. The recursive algorithm modeled o Toledo s LU ca be implemeted i a cache-oblivious way so as to miimize badwidth cost, but ot the latecy cost The cache-oblivious recursive Cholesky algorithm which first appeared i [20] (ad ca also be foud i [, 4, 21, 27]) miimizes both badwidth ad latecy costs (assumig recursive block storage) for all matrices across all memory hierarchy levels. Noe of the other algorithms have this property arallel model. Fially, we address the case of parallel Cholesky, where there are processors coected by a etwork with latecy α ad reciprocal badwidth β. We cosider here oly the memory-scalable case, where each processor s local memory is of size = O( 2 / ), so that oly O(1) copies of the matrix are stored overall (the so-called two-dimesioal (2D) case). See [25] for the geeral lower boud, that holds for three-dimesioal (D) algorithms, for matrix multiplicatio. A D algorithm for LU decompositio without pivotig (alog with a algorithm for triagular system solutio) is preseted with commuicatio aalysis i [24]. While the algorithms miimize the total volume of commuicatio, they do ot miimize the commuicatio costs alog the critical path as measured i this paper. For the 2D case, the cosequece of our mai theorem is agai a badwidth cost lower boud of the form Ω( /( 1/2 )) = Ω( 2 / 1/2 ), ad a latecy cost lower boud of the form Ω( /( /2 )) = Ω( 1/2 ). 6. The Cholesky algorithm i ScaLA ACK [11] attais a matchig upper boud. It does so by partitioig the matrix ito submatrices ad distributig them to the processors i a block cyclic maer. With the right choice of block size b, amely,b =Θ( ), it attais the above badwidth ad latecy cost lower bouds to withi a factor of log. Thisissummarizedi Table 1.2. See sectio..1 for details. The rest of this paper is orgaized as follows. I sectio 2 we show the reductio from matrix multiplicatio to Cholesky decompositio, thus extedig the badwidth cost lower bouds of [2] ad [25] to a badwidth cost lower boud for the sequetial ad parallel implemetatios of Cholesky decompositio. We also discuss latecy cost lower bouds. I sectio we preset kow Cholesky decompositio algorithms ad compare their badwidth ad latecy costs with the lower bouds. 2 Toledo s algorithm is desiged to retai umerical stability for the LU case. The algorithm of [20] deals with the Cholesky case oly ad therefore requires o pivotig for umerical stability. Thus a simpler recursio suffices, ad the latecy cost dimiishes.

6 500 G. BALLARD, J. DEEL, O. HOLTZ, AND O. SCHWARTZ A shorter versio of this paper (a exteded abstract) appeared i the proceedigs of the SAA 2009 coferece; the shorter versio icludes the reductio proof of the lower bouds ad Tables 1.1 ad 1.2. This paper proves the claims made i the tables by presetig the algorithms ad their aalyses i sectio. 2. Commuicatio lower bouds. Cosider a algorithm for a parallel computer with processors that multiplies matrices i the classical way (the usual O( ) arithmetic operatios possibly reordered usig associativity ad commutativity of additio), ad each of the processors has memory of size. Iroy, Toledo, ad Tiski [25] showed that at least oe of the processors has to sed or receive this miimal umber of words. Theorem 2.1 (see [25]). Ay classical implemetatio of matrix multiplicatio of matrices A ad B o a processor machie, each equipped with memory of size, requires that oe of the processors seds or receives at least words. If A ad B are of size m ad m r, respectively, the the correspodig boud mr is As ay processor has memory of size, ay message it seds or receives may deliver at most words. Therefore we deduce the followig. Corollary 2.2. Ay classical implemetatio of matrix multiplicatio of matrices A ad B o a processor machie, each equipped with memory of size, requires that oe of the processors seds or receives at least 2 1 messages. 2 2 If A ad B are of size m ad m r, respectively, the the correspodig boud mr is For the case of = 1 these coicide with the lower bouds for the badwidth ad the latecy costs of the sequetial case. The lower boud o badwidth cost of sequetial matrix multiplicatio was previously show (up to some multiplicative costat factor) by Hog ad Kug [2]. Oe meas of extedig these lower bouds to more algorithms is by reductio. It is easy to reduce matrix multiplicatio to LU decompositio of a slightly larger order, as the followig idetity shows: I 0 B A I 0 = I A I I 0 B I A B. 0 0 I 0 0 I I This idetity meas that LU factorizatio without pivotig ca be used to perform matrix multiplicatio; to accommodate pivotig, A ad/or B ca be scaled dow to be too small to be chose as pivots, ad A B ca be scaled up accordigly. Thus a O( ) implemetatio of LU decompositio that uses oly associativity ad commutativity of additio to reorgaize its operatios (thus elimiatig Strasselike algorithms) must perform at least as much commuicatio as a correspodigly reorgaized implemetatio of O( ) matrix multiplicatio. We wish to mimic this lower boud costructio for Cholesky. Cosider the followig reductio from matrix multiplicatio to Cholesky decompositio. Let T be the matrix defied below, composed of ie square blocks each of dimesio ; the the Cholesky decompositio of T is I A T B I I A T B T A I + A A T 0 = A I I A B L L T, B T 0 D B T (A B) T X X T

7 COUNICATION-OTIAL CHOLESKY DECOOSITION 501 where X is the Cholesky factor of D D B T B B T A T AB ad D ca be ay symmetric matrix such that D is positive defiite. Thus A B is computed via this Cholesky decompositio. Ituitively this seems to show that the commuicatio complexity eeded for computig matrix multiplicatio is a lower boud to that of computig the Cholesky decompositio (of matrices three times larger) as A B appears i L T, the decompositio of T. Note, however, that A A T also appears i T. Coceivably, computig A B from A ad B icurs less commuicatio cost if we are also give A A T, so the above is ot a sufficiet reductio from matrix multiplicatio to Cholesky decompositio. Let us istead cosider the followig approach to prove the lower boud. I additio to the real umbers R, cosider ew starred umerical quatities, called 1 ad 0, with arithmetic properties detailed i the followig tables. 1 ad 0 mask ay real value i additio/substractio operatio but behave similarly to 1 R ad 0 R i multiplicatio ad divisio operatios: Cosider this set of values ad arithmetic operatios. The set is commutative with respect to additio ad to multiplicatio (by the symmetries of the correspodig tables). The set is associative with respect to additio: regardless of orderig of summatio, the sum is 1 if oe of the summads is 1, otherwise it is 0 if oe of the summads is 0. The set is also associative with respect to multiplicatio: (a b) c = a (b c). This is trivial if all factors are i R. As 1 is a multiplicative idetity, it is also immediate if some of the factors equal 1. Otherwise, at least oe of the factorsis0, ad the product is 0. Distributivity, however, does ot hold: 1 (1 +1 )=1 2=(1 1 )+(1 1 ). Let us retur to the costructio. We set T to be: T I AT B A C 0, B T 0 C where C has 1 o the mai diagoal ad 0 everywhere else: C Oe ca verify that the (uique) Cholesky decompositio of C is (2.1) C = C C T Note that computig A A T is asymptotically as hard as matrix multiplicatio: take A = [X, 0; Y T, 0]. The A A T =[,XY;, ]. 4 By writig X Y we mea the resultig matrix, assumig the straightforward matrix multiplicatio algorithm. This has to be stated clearly, as the distributivity does ot hold for the starred values.

8 502 G. BALLARD, J. DEEL, O. HOLTZ, AND O. SCHWARTZ Note that if a matrix X does ot cotai ay starred values 0 ad 1,the X = C X = X C = C X = X C = C T X = X C T ad C + X = C. Therefore, oe ca cofirm that the Cholesky decompositio of T is (2.2) T I AT B A C 0 = I A C I AT B C T A B L L T. B T 0 C B T (A B) T C Oe ca thik of C as maskig the A A T previously appearig i the cetral block of T, therefore allowig the lower boud of computig A B to be accouted for by the Cholesky decompositio ad ot by the computatio of A A T. ore formally, let Alg be ay classical algorithm for Cholesky factorizatio. We covert it to a matrix multiplicatio algorithm as follows i Algorithm 1. Algorithm 1 atrix ultiplicatio by Cholesky Decompositio. Iput: Two matrices, A ad B. 1: Let Alg be Alg updated to correctly hadle the ew 0, 1 values. {ote that Alg ca be costructed off-lie.} 2: Costruct T as i (2.2). : L = Alg (T ). 4: retur (L 2 ) T. C T The simplest coceptual way to do step 1 is to attach a extra bit to every umerical value, idicatig whether it is starred or ot, ad modify every arithmetic operatio to check this bit before performig a operatio. This icreases the badwidth cost by at most a costat factor. Alteratively, we ca use sigallig NaNs as defied i the IEEE floatig poit stadard [1] to ecode 1 ad 0 with o extra bits. If the istructios implemetig Cholesky are scheduled determiistically, there is aother alterative. Oe ca ru the algorithm symbolically, propagatig 0 ad 1 argumets from the iputs forward, simplifyig or elimiatig arithmetic operatios whose iputs cotai 0 or 1. Oe ca also elimiate operatios for which there is o path i the directed acyclic graph (describig how outputs of each operatio propagate to iputs of other operatios) to the desired output A B. The resultig Alg performs a strict subset of the arithmetic ad memory operatios of the origial Cholesky algorithm. We ote that updatig Alg to form Alg is doe off-lie, so that step 1 does ot actually take ay time to perform whe Algorithm 1 is called. Classical Cholesky decompositio algorithms. We ext verify the correctess of this reductio: that the output of this procedure o iput A, B is ideed the multiplicatio A B, aslogasalg is a classical algorithm, i a sese we ow defie carefully. Let T = L L T be the Cholesky decompositio of T. The we have the followig formulas: (2.) L(i, i) = T (i, i) (L(i, k)) 2, (2.4) L(i, j) = 1 T (i, j) L(j, j) k [j 1] k [i 1] L(i, k) L(j, k),i>j,

9 COUNICATION-OTIAL CHOLESKY DECOOSITION 50 Fig Depedecies of L(i, j) for diagoal etries (left) ad other etries (right). Dark grey represets the sets S i,i (left) ad S i,j (right). Light grey represets idirect depedecies. where [t] ={1,...,t}. A classical Cholesky decompositio algorithm computes each of these O( ) FLOS, which may be reordered usig oly commutativity ad associativity of additio. By the o-pivotig ad o-distributivity restrictios o Alg, whe a etry of L is computed, all the etries o which it depeds have already bee computed ad combied by the above formulas, with the sums occurrig i ay order. These depedecies form a depedecy graph o the etries of L ad impose a partial orderig o the computatio of the etries of L (see Figure 2.1). That is, whe a etry L(i, i) is computed, by (2.), all the etries {L(i, k)} k [i 1] have already bee computed. 5 Deote this set of etries by S i,i,amely, (2.5) S i,i {L(i, k)} k [i 1]. Similarly, whe a etry L(i, j) (fori>j) is computed, by (2.4), all the etries {L(i, k)} k [j 1] ad all the etries {L(j, k)} k [j] have already bee computed. Deote this set by S i,j,amely, (2.6) S i,j {L(i, k)} k [j 1] {L(j, k)} k [j]. Lemma 2.. Ay orderig of the computatio of the elemets of L that respects the partial orderig iduced by the above metioed directed acyclic graph results i a correct computatio of A B. roof. We eed to cofirm that the starred etries 1 ad 0 of T do ot somehow cotamiate the desired etries of L T 2. The proof is by iductio o the partial order o pairs (i, j) implied by (2.5) ad (2.6). The base case the correctess of computig L(1, 1) is immediate. Assume by iductio that all elemets of S i,j are correctly computed ad cosider the computatio of L(i, j) accordig to the block i which it resides: If L(i, j) resides i block L 11, L 21,orL 1,theS i,j cotais oly real values, ad o arithmetic operatios with 0 or 1 occur (recall Figure 2.1 or (2.2), (2.5), ad (2.6)). Therefore, the correctess follows from the correctess of the origial Cholesky decompositio algorithm. If L(i, j) resides i L 22 or L,theS i,j may cotai starred value (elemets of C ). We treat separately the case where L(i, j) is o the mai diagoal ad the case where it is ot. If i = j, the by (2.) L(i, i) is determied to be 1 sice T (i, i) =1 ad sice addig to, subtractig from, ad takig the square root of 1 all result i 1 (recall Table 2.1 ad (2.)). 5 While this partial orderig costrais the schedulig of FLOS, it does ot uiquely idetify a computatio directed acyclic graph (DAG), for the additios withi oe summatio ca be i arbitrary order (formig arbitrary subtrees i the computatio DAG).

10 504 G. BALLARD, J. DEEL, O. HOLTZ, AND O. SCHWARTZ Table 2.1 Arithmetic operatios: x, y stad for ay real values. For cosistecy, 0 0 ad 1 1. ± 1 0 y x 1 0 x ± y 1 0 y y x x 0 x y x 0 x / 1 0 y /y x x x/y If i > j, the by the iductive assumptio the divisor L(j, j) of (2.4) is correctly computed to be 1 (recall Figure 2.1 ad the defiitio of C i (2.1)). Therefore, o divisio by 0 is performed. oreover, T (i, j) is0. The L(i, j) is determied to be the correct value 0, uless 1 is subtracted (recall (2.4)). However, every subtracted product (recall (2.4)) is composed of two factors of the same colum but of differet rows. Therefore, by the structure of C,oeofthemis1, so their product is ot 1, ad the value is computed correctly. If L(i, j) resides i L 2,theS i,j may cotai starred values (see Figure 2.1, right-had side, row j). However, every subtractio performed (recall (2.4)) is composed of a product of two factors, of which oe is o the ith row (ad o a colum k<j). Hece, by iductio (o i, j), the (i, k) elemet has bee computed correctly to be a real value, ad by the multiplicatio properties so is the product. Therefore o maskig occurs. This completes the proof of Lemma 2.. Commuicatio Aalysis. We ow kow that Algorithm 1 correctly multiplies matrices classically ad so has kow commuicatio lower bouds give by Theorem 2.1 ad Corollary 2.2. It remais to cofirm that step 2 (settig up T )adstep4 (returig L T 2 ) do ot require much commuicatio, so that these lower bouds apply to step, ruig Alg (recall that step 1 may be performed off-lie ad so does t cout). Sice Alg is either a small modificatio of Cholesky to add star labels to all data items (at most doublig the badwidth cost) or a subset of Cholesky with some operatios omitted (those with starred argumets or ot leadig to the desired output L 2 ), a lower boud o commuicatio for Alg is also a lower boud for Cholesky. Theorem 2.4 (mai theorem). Ay sequetial or parallel classical algorithm for the Cholesky decompositio of -by- matrices ca be trasformed ito a classical algorithm for -by- matrix multiplicatio, i such a way that the badwidth cost of the matrix multiplicatio algorithm is at most a costat times the badwidth cost of the Cholesky algorithm. Therefore ay badwidth or latecy cost lower boud for classical matrix multiplicatio applies to classical Cholesky, asymptotically speakig. Corollary 2.5. I the sequetial case, with a fast memory of size, thebadwidth cost lower boud for Cholesky decompositio is Ω( / 1/2 ), ad the latecy cost lower boud is Ω( / /2 ). roof. Costructig T (i ay data format) requires badwidth of at most 18 2 (copyiga-by- matrix, with aother factor of 2 if each etry has a flag idicatig whether it is starred or ot), ad extractig L T 2 requires aother 2 of badwidth. Furthermore, we ca assume 2 < / 1/2 ; i.e., that < 2 ; i.e., that the matrix is

11 COUNICATION-OTIAL CHOLESKY DECOOSITION 505 too large to fit etirely i fast memory (the oly case of iterest). Thus the badwidth lower boud Ω( / 1/2 ) of Algorithm 1 domiates the badwidth costs of steps 2 ad 4 ad so must apply to step (Cholesky). Fially, as each message delivers at most words, the latecy lower boud for step is by a factor of smaller tha its badwidth cost lower boud, as desired. Corollary 2.6. I the parallel case, with a 2D layout o processors, the badwidth cost lower boud for Cholesky decompositio is Ω( 2 / 1/2 ), ad the latecy cost lower boud is Ω( 1/2 ). roof. The argumet i the parallel case is aalogous to that of Corollary 2.5. The costructio of iput ad retrieval of output at steps 2 ad 4 of Algorithm 1 cotribute badwidth of O( 2 ). Therefore the lower boud of the badwidth Ω( ) is determied by step, the Cholesky decompositio. The lower boud o the latecy of step is therefore Ω( /2 ), as each message delivers at most words. luggig i =Θ( 2 ) yields B =Ω( 1/2 ).. Upper bouds. I this sectio we discuss kow algorithms for Cholesky decompositio ad their badwidth ad latecy cost aalyses i the sequetial twolevel memory model, i the sequetial hierarchical memory model, ad i the parallel computig model. Throughout this sectio, we will use the otatio B() ad L() for badwidth ad latecy costs, respectively, where is the dimesio of the square matrix. These fuctios may also deped o the parameters, the size of the fast memory, ad, the umber of processors, but we omit its referece whe it is clear from cotext. See also [10] for latecy aalysis of two of the Cholesky decompositio algorithms i the two-level (out-of-core) memory model. Note that all the Cholesky decompositio algorithms i this sectio perform almost exactly the same computatio: they all perform the computatios defied i (2.) ad (2.4) ad differ oly i the order of summatio ad the timig of computatio. Therefore, oe of them use distributivity. Without this latter fact, it would ot make sese to compare their badwidth ad latecy costs to the lower boud. I sectio of [22], stadard error aalyses of the Cholesky decompositio algorithm are give. These hold for ay orderig of the summatio of (2.) ad (2.4) ad therefore apply to all Cholesky decompositio algorithms below..1. Sequetial algorithms Data storage. Before reviewig the various sequetial algorithms let us first cosider the uderlyig data structure which is used for storig the matrix. Although it does ot affect the badwidth cost, it may have a sigificat ifluece o the latecy cost of the algorithm: it may or may ot allow retrievig may relevat words usig oly a sigle message. As a simple example, cosider computig the sum of umbers i slow memory, which obviously requires readig these words. If they are i cosecutive memory locatios, this ca be doe i oe read operatio, the miimum possible latecy cost. But if they are ot cosecutive, say, they are separated by at least 1 words, this may require read operatios, the maximum possible latecy cost. The various methods for data storage (oly some of which are implemeted i LA ACK) appear i Figure.1. We ca partitio these data structures ito two classes: colum-major ad block-cotiguous. Colum-major storage. The colum-major data structures store the matrix etries i columwise order. This meas that they are most fit for algorithms that access a colum at a time (e.g., the left ad right lookig aïve algorithms). However, the

12 506 G. BALLARD, J. DEEL, O. HOLTZ, AND O. SCHWARTZ Fig..1. Uderlyig data structures. Top (colum-major): full, old packed, rectagular full packed. Bottom (block-cotiguous): blocked, block-recursive format, recursive full packed. latecy cost of algorithms that access a block at a time (e.g., LA ACK s implemetatio) will fail to achieve optimality, as retrievig a block of size b b requires at least b messages, eve if a sigle message ca deliver b 2 words. This meas a possible icrease i the latecy cost by a factor of b (where b is typically the order of ). As the matrices of iterest for Cholesky decompositio are symmetric, storig the etire matrix (kow as full storage ) wastes space. About half of the space ca be saved by usig the old packed or the rectagular full packed storages. The latter has the advatage of a more uiform idexig, which allows faster addressig. Block-cotiguous storage. The block-cotiguous data structures store the matrix etries i a way that allows a read or write of a block usig a sigle message. This may reduce the latecy cost of a Cholesky decompositio algorithm that accesses a b b block at a time by a factor of b, compared with usig colum-major storage. We distiguish betwee two types of block-cotiguous storage formats. The blocked data structure stores each block of the matrix i a cotiguous space of the memory. For this, oe has to kow i advace the size of blocks to be used by the algorithm (which is a machie-specific parameter). The elemets of each block may be stored as cotiguous subblocks, where each subblock is of a predefied size. The data structure may iclude several such layers of subblocks. This layered data structure may fit the hierarchical memory model (see sectio.2 for further discussio of this model). The ext data structure allows the beefits of block-cotiguous storage without kowledge of machie-specific parameters. The block-recursive format [6, 16, 17, 29] (also kow as the bit iterleaved layout, space-fillig curve storage, or orto orderig format) stores each of the four /2 /2 submatrices cotiguously, ad the elemets of each submatrix are ordered so that the smaller submatrices are each stored cotiguously, ad so o recursively. This format is cache-oblivious. This meas that it allows access to a sigle block usig a sigle message (or a costat umber of messages), without kowig i advace the size of the blocks (see Figure.1). Both the blocked ad block-recursive formats have packed versios, where about half of the space is saved by usig the symmetry of the matrices (see [16] for recursive full packed data structures). The algorithm i [4] uses a hybrid

13 COUNICATION-OTIAL CHOLESKY DECOOSITION 507 data structure i which oly half the matrix is stored, ad recursive orderig is used o triagular submatrices ad colum-major orderig is used o square submatrices. Coversio o the fly. Sice block-cotiguous storage yields better latecy cost tha colum-major storage for some algorithms, it ca be worthwhile to covert a matrix stored i colum-major order to the blocked data structure (with block size b =Θ()) before ruig the algorithm. This ca be doe by readig Θ() elemets at a time, i a columwise order (which requires oe message), ad the writig each of these elemets to the right locatio of the ew matrix. We write these words usig Θ( ) messages (oe per each relevat block). Thus, the total umber of messages is O( 2 ), which is asymptotically domiated by O( )for =Ω(). Covertig /2 back to colum-major storage ca be doe i similar way (with the same asymptotic cost). Other variats of these data structures. Similar to the colum-major data structures, there are row-major data structures that store the matrix etries rowwise. All the above-metioed packed data structures have versios that are idexed to efficietly store the lower (as i Figure.1) ad upper triagular part of a matrix. We cosider the applicatio of each of the algorithms below to colum-major or block-cotiguous data structures oly. Adaptig each of these algorithms to other compatible data structures (row-major vs. colum-major, upper triagular vs. lower triagular, full storage vs. packed storage) is straightforward Arithmetic cout of sequetial algorithms. The arithmetic operatios of all the sequetial algorithms cosidered below are exactly the same, up to reorderig. The arithmetic cout of all these algorithm is therefore the same ad is give oly oce. By (2.), (2.4) the total arithmetic cout is A() = i=1 j=1 i (2j 1) = +Θ(2 )..1.. Upper boud for matrix multiplicatio. I this sectio we establish the badwidth ad latecy costs of matrix multiplicatio, which is used as a subroutie i some of the Cholesky decompositio algorithms. We recall the recursive matrix multiplicatio algorithm of Frigo et al. [17]. Their algorithm, give as Algorithm 2, works by a divide-ad-coquer approach, where at each step the algorithm splits the largest of three dimesios. Lemma.1 (see [17]). The badwidth cost B (, m, r) of multiplyig two matrices of dimesios m ad m r is mr B (, m, r) =Θ + m + mr + r. This result is stated slightly differetly i [17]. We obtai the form above by settig the cache-lie legth (or trasfer block size) equal to 1. The latecy cost of this algorithm varies accordig to the data structure used ad the dimesios of the iput ad output matrices. Lemma.2. Whe usig a recursive cotiguous-block data structure, the latecy cost of matrix multiplicatio is mr m + mr + r L (, m, r) =Θ +, /2

14 508 G. BALLARD, J. DEEL, O. HOLTZ, AND O. SCHWARTZ Algorithm 2 C = Ratul(A, B): A Recursive atrix ultiplicatio Algorithm. Iput: A, B, two matrices of dimesios m ad m r. Output: C, a matrix of dimesio r, sothatc = A B A11 A \\ artitio A, B, C by dividig each dimesio i half, e.g., A = 12 A 21 A 22 1: if = m = r =1the 2: C 11 = A 11 B 11 : else ( if =max{, ) m, r} the ( ) 4: ( C11 C 12 ) = Ratul (( A11 A 12 ),B ) 5: C21 C 22 = Ratul A21 A 22,B 6: else if m =max{, (( m, r} ) the A11 7: C = Ratul, ) B A 11 B 12 (( 21 ) A12 8: C = C + Ratul, ) B A 21 B : else ( ) C11 B11 10: = Ratul A, ( C 21 ) ( ( B 21 )) C12 B12 11: = Ratul A, C 22 B 22 12: ed if 1: retur C ad whe usig colum-major or row-major layout, the latecy cost is mr m + mr + r L (, m, r) =Θ +. roof. Let m,,r be the dimesios of the problem the first time all three matrices fit ito the fast memory. The m,,r differ by at most a factor of 2 ad thus are all Θ( ). Therefore, assumig the recursive cotiguous-block data structure, each of the three matrices resides i O(1) square blocks, ad so readig ad/or writig each matrix icurs a latecy cost of Θ(1). Thus, the total latecy cost is L () =Θ( B() ). If the colum-major or row-major data structures are used, the each such block of size Θ( ) Θ( ) is read or writte usig Θ( ) messages (oe for each of its rows or colums), ad therefore the total latecy cost is L () = Θ( B() ). We ote that Algorithm 2 is cache-oblivious. That is, the algorithm achieves the badwidth ad latecy costs give above with o kowledge of the size of the fast memory. The iterative blocked matrix multiplicatio algorithm ca also achieve this performace, but the block size must be chose to be Θ( ) Upper boud for triagular system solutio. I this sectio we establish the badwidth ad latecy costs of aother subroutie used i some of the Cholesky decompositio algorithms, that of a triagular system solutio (also kow as triagular solve) with multiple right-had sides (TRS). Algorithm is a recursive versio of TRS (a variatio of this algorithm appears i [4], desiged to utilize a o-

15 COUNICATION-OTIAL CHOLESKY DECOOSITION 509 recursive, tued implemetatio of matrix multiplicatio). Below, we assume each matrix multiplicatio is implemeted with the recursive algorithm (Algorithm 2). Algorithm X = RT RS(A, U): Recursive Triagular Solver. Iput: A, U two matrices, U is upper triagular. Output: X, sothatx = A U 1 A11 A \\ artitio A, U, X by dividig each dimesio i half, e.g., A = 12 A 21 A 22 1: if =1the 2: X = A/U : else 4: X 11 = RT RS(A 11,U 11 ) 5: X 12 = RT RS(A 12 X 11 U 12,U 22 ) 6: X 21 = RT RS(A 21,U 11 ) 7: X 22 = RT RS(A 22 X 21 U 12,U 22 ) 8: ed if 9: retur X Badwidth cost. As o commuicatio is eeded for sufficietly small matrices (other tha readig the etire iput ad writig the output), the badwidth cost of this algorithm is give by the recurrece { ( 4 B TRS 2 2 ) +2 B ( 2 ) if > B TRS () = otherwise, where B () =B (,, ) is the badwidth cost of multiplyig two -by- matrices. Assumig the matrix multiplicatio is doe usig Algorithm 2, we have B () =O( + 2 ). The solutio to this recurrece (ad the total badwidth cost of recursive TRS) is the B TRS () =O + 2. Latecy cost. Assumig a block-cotiguous data structure (recursive, or with the correct block size picked), the latecy cost is give by the recurrece { ( 4 L ) L TRS () TRS 2 +2 L ( 2 ) if >, otherwise, where L () =L (,, ) =O( ) is the latecy cost of recursive matrix /2 multiplicatio. The solutio to this recurrece is L TRS () =O. /2 Agai, we ote that Algorithm is cache-oblivious. The iterative blocked TRS algorithm ca also achieve this performace, but the block size must be chose to be Θ( ). Equipped with the commuicatio costs of matrix multiplicatio ad triagular system solvig, we proceed to the followig subsectios, where we preset several Cholesky decompositio algorithms ad provide the commuicatio cost aalyses which yield the first five eumerated coclusios i the itroductio as well as the results show i Table 1.1.,

16 510 G. BALLARD, J. DEEL, O. HOLTZ, AND O. SCHWARTZ.1.5. The aïve left-lookig Cholesky algorithm. We start with revisitig the two aïve algorithms (left-lookig ad right-lookig), ad we see that both have o-optimal badwidth ad latecy, as stated i coclusio 1 of the itroductio. The aïve left-lookig Cholesky decompositio algorithm is give i Algorithm 4 ad Figure.2. Note that Algorithm 4 assumes that two colums (k ad j) ofthe matrix ca fit ito fast memory simultaeously (i.e., >2). Algorithm 4 Naïve left-lookig Cholesky algorithm. 1: for j =1to do 2: read A(j :, j) from slow memory : for k =1toj 1 do 4: read A(j :, k) from slow memory 5: update diagoal elemet: A(j, j) A(j, j) A(j, k) 2 6: for i = j +1to do 7: update jth colum elemet: A(i, j) A(i, j) A(i, k)a(j, k) 8: ed for 9: ed for 10: calculate fial value of diagoal elemet: A(j, j) A(j, j) 11: for i = j +1to do 12: calculate fial value of jth colum elemet: A(i, j) A(i, j)/a(j, j) 1: ed for 14: write A(j :, j) to slow memory 15: ed for Badwidth cost. Assumig >2, we follow Algorithm 4, ad commuicatio occurs at lies 2, 4, ad 14, so the total umber of words trasferred betwee fast ad slow memory while executig the etire algorithm is give by [ ( j 1 )] 2( j +1)+ ( j +1) = j=1 k=1 I the case whe <2, each colum j is read ito fast memory i segmets of size /2. For each segmet of colum j, the correspodig segmets of previous colums k are read ito fast memory i sequece to update the curret segmet. I this way, the total umber of words trasferred betwee fast ad slow memory does ot chage. Latecy cost. Assumig > 2 ad the matrix is stored i colum-major order, each colum is cotiguous i memory, so the total umber of messages is give by [ ( j 1 )] 2+ 1 = j=1 k=1 I the case whe < 2, a etire colum caot be read/writte i oe message, so the colum must be read/writte i multiple messages of size O(). Agai, assumig the matrix is stored i colum-major order, segmets of colums will be cotiguous i memory, ad the latecy is give by the badwidth divided by the message size: O( ). The algorithm ca be adjusted to work oe row at a time (up-lookig) if the matrix is stored i row-major order (with o chage i badwidth or latecy), but

17 COUNICATION-OTIAL CHOLESKY DECOOSITION 511 k j j k Fig..2. Naïve algorithms. Left: left-lookig. Right: right-lookig. Colum j is curretly beig computed (both algorithms). Light grey is the area already computed. Colum k is the colum beig read (left-lookig) or writte (right-lookig). a block-cotiguous storage format will icrease the latecy (colums would ot be cotiguous i memory i this case). This aalysis, alog with that of sectio.1.6, yields coclusio 1 i the itroductio The aïve right-lookig Cholesky algorithm. The aïve right-lookig Cholesky decompositio algorithm is give i Algorithm 5 ad Figure.2. Note that Algorithm 5 assumes that two colums (k ad j)ofthematrixcafititofastmemory simultaeously (i.e., >2). Algorithm 5 Naïve right-lookig Cholesky algorithm. 1: for j =1to do 2: read A(j :, j) from slow memory : calculate fial value for diagoal elemet: A(j, j) A(j, j) 4: for i = j +1to do 5: calculate fial value for jth colum elemet: A(i, j) A(i, j)/a(j, j) 6: ed for 7: for k = j +1to do 8: read A(k :, k) from slow memory 9: for i = k to do 10: update kth colum elemet: A(i, k) A(i, k) A(i, j)a(k, j) 11: ed for 12: write A(k :, k) to slow memory 1: ed for 14: write A(j :, j) to slow memory 15: ed for Badwidth cost. Assumig >2, we follow Algorithm 5, ad commuicatio occurs at lies 2, 8, 12, ad 14, so the total umber of words trasferred betwee fast ad slow memory while executig the etire algorithm is give by [ ] j 1 2( j +1)+ 2( k +1) = j=1 k=1 Ithecasewhe<2, more commuicatio is required. I order to perform the colum factorizatio, the colum is read ito fast memory i segmets of size 1, updated with the mai diagoal elemet (which must remai i fast memory), ad writte back to slow memory. I order to update the trailig kth colum, the

18 512 G. BALLARD, J. DEEL, O. HOLTZ, AND O. SCHWARTZ kth colum is read ito fast memory i segmets of size ( 1)/2 alog with the correspodig segmet of the jth colum. I order to compute the update for the etire colum, the kth elemet of the jth colum must remai i fast memory. After updatig the segmet of the kth colum, it is writte back to slow memory. Thus, the aïve right-lookig algorithm requires more reads from slow memory whe two colums of the matrix caot fit ito fast memory, but this icreases the badwidth by oly a costat factor. Latecy cost. Assumig > 2 ad the matrix is stored i colum-major order, the total umber of messages is give by 2+ 2 = 2 +. j=1 k=j+1 Asithecaseoftheaïve left-lookig algorithm, whe <2, thesizeofa message is o loger the etire colum. Assumig the matrix is stored i colummajor order, message sizes are of the order equal to the size of fast memory. Thus, for O( ) badwidth, the latecy is O( ). This right-lookig algorithm ca be easily trasformed ito a dow-lookig rowwise algorithm i order to hadle row-major data storages, with o chage i badwidth or latecy. This aalysis, alog with that of sectio.1.5, yields coclusio 1 i the itroductio LA ACK s OTRF. We ext cosider a implemetatio available i LA ACK (see [5]) ad show that it is badwidth optimal whe usig the right block size (as stated i coclusio 2 of the itroductio) ad ca also be made latecyoptimal, assumig the correct data structure is used (as stated i coclusio of the itroductio). LA ACK s Cholesky algorithm, OTRF (Algorithm 6), is a blocked left-lookig algorithm. For simplicity, we assume the block size b evely divides the matrix size. See Figure. for a diagram of the algorithm. Note that by the partitioig of lie 2, A 21 is b (j 1)b, A 22 is b b, A 1 is (/b j)b (j 1)b, ada 2 is (/b j)b b. Algorithm 6 LA ACK OTRF. 1: for j =1to/b do 2: partitio matrix so that diagoal block (j, j) isa 22 : A 11 A 21 A 22 A 1 A 2 A : update diagoal block (j, j) (SYRK): A 22 A 22 A 21 A T 21 4: factor diagoal block (j, j) (OTF2): A 22 Chol(A 22 ) 5: update colum pael (GE): A 2 A 2 A 1 A T 21 6: triagular system solutio for colum pael (TRS): A 2 A 2 A T 22 7: ed for The commuicatio costs of Algorithm 6 deped o the subrouties which perform symmetric rak-b update, matrix multiply, ad triagular system solve, respectively. For simplicity, we will assume that the symmetric rak-b update is computed with the same badwidth ad latecy as a geeral matrix multiply. Although geeral matrix multiply requires more commuicatio, this assumptio will ot affect the asymptotic cout of the commuicatio required by the algorithm. We will also

19 COUNICATION-OTIAL CHOLESKY DECOOSITION 51 b b j j A 21 A 21 A 1 A 2 A 2 Fig... LA ACK s OTRF Cholesky decompositio algorithm. assume that the block size b is chose so that three blocks ca fit ito fast memory simultaeously; that is, we assume that 1 b. I this case, the factorizatio of the diagoal block (lie 4) requires oly Θ(b 2 )words to be trasferred sice the etire block fits ito fast memory. Also, i the triagular system solutio for the colum pael A 2 (lie 6), the triagular matrix A 22 is oly oe block, so the computatio ca be performed by readig the blocks of A 2 i sequece ad updatig them idividually i fast memory. Thus, the amout of commuicatio required by that subroutie is Θ(b 2 ) times the umber of blocks i the colum pael. Fially, the dimesios of the matrices i the matrix multiply of lie durig the jth iteratio of the loop are b b, b (j 1)b, ad(j 1)b b, ad the dimesios of the matrices i the matrix multiply of lie 5 durig the jth iteratio are (/b j)b b, (/b j)b (j 1)b, ad(j 1)b b. Badwidth cost. Uder the assumptios above, a upper boud o the umber of words trasferred betwee slow ad fast memory while executig Algorithm 6 is give by /b [ B() B (b, (j 1)b, b) j=1 ] +Θ(b 2 )+B ((/b j)b, (j 1)b, b)+(/b j)θ(b 2 ), where B (m,, r) is the badwidth cost required to execute a matrix multiplicatio of matrices of size m ad r i order to update a matrix of size r. SiceB is odecreasig i each of its variables, we have B() [ B (b,, b)+θ(b 2 )+B (,, b)+ ] b b Θ(b2 ) [ 2B (,, b)+θ(b 2 )+ ] b b Θ(b2 ). Assumig the matrix multiply algorithm used by LA ACK achieves the same badwidth cost as the oe give i sectio.1. (the iterative blocked algorithm with the correct block size suffices), we have ( 2 ) b B (,, b) =Θ b. Sice b ad b, B (,, b) =O( 2 ). Thus, B() =O b + 2,

Lecture 1: Introduction and Strassen s Algorithm

Lecture 1: Introduction and Strassen s Algorithm 5-750: Graduate Algorithms Jauary 7, 08 Lecture : Itroductio ad Strasse s Algorithm Lecturer: Gary Miller Scribe: Robert Parker Itroductio Machie models I this class, we will primarily use the Radom Access