COMMUNICATION-OPTIMAL PARALLEL AND SEQUENTIAL CHOLESKY DECOMPOSITION

Size: px
Start display at page:

Download "COMMUNICATION-OPTIMAL PARALLEL AND SEQUENTIAL CHOLESKY DECOMPOSITION"

Transcription

1 SIA J. SCI. COUT. Vol. 2, No. 6, pp c 2010 Society for Idustrial ad Applied athematics COUNICATION-OTIAL ARALLEL AND SEQUENTIAL CHOLESKY DECOOSITION GREY BALLARD, JAES DEEL, OLGA HOLTZ, AND ODED SCHWARTZ Abstract. Numerical algorithms have two kids of costs: arithmetic ad commuicatio, by which we mea either movig data betwee levels of a memory hierarchy (i the sequetial case) or over a etwork coectig processors (i the parallel case). Commuicatio costs ofte domiate arithmetic costs, so it is of iterest to desig algorithms miimizig commuicatio. I this paper we first exted kow lower bouds o the commuicatio cost (both for badwidth ad for latecy) of covetioal (O( )) matrix multiplicatio to Cholesky factorizatio, which is used for solvig dese symmetric positive defiite liear systems. Secod, we compare the costs of various Cholesky decompositio implemetatios to these lower bouds ad idetify the algorithms ad data structures that attai them. I the sequetial case, we cosider both the two-level ad hierarchical memory models. Combied with prior results i [J. Demmel et al., Commuicatio-optimal arallel ad Sequetial QR ad LU Factorizatios, Techical report EECS , Uiversity of Califoria, Berkeley, CA, 2008], [J. Demmel et al., Implemetig Commuicatio-optimal arallel ad Sequetial QR ad LU Factorizatios, SIA. J. Sci. Comp., submitted], ad [J. Demmel, L. Grigori, ad H. Xiag, Commuicatio-avoidig Gaussia Elimiatio, roceedigs of the 2008 AC/IEEE Coferece o Supercomputig, 2008] this gives a set of commuicatio-optimal algorithms for O( )implemetatios of the three basic factorizatios of dese liear algebra: LU with pivotig, QR, ad Cholesky. But it goes beyod this prior work o sequetial LU by optimizig commuicatio for ay umber of levels of memory hierarchy. Key words. Cholesky decompositio, badwidth, latecy, commuicatio avoidig, algorithm, lower boud AS subject classificatios. 68Q25, 68W10, 68W15, 68W40, 65Y05, 65Y10, 65Y20, 65F0 DOI / Itroductio. Let A be a real symmetric ad positive defiite matrix. The there exists a real lower triagular matrix L so that A = L L T (L is uique if we restrict its diagoal elemets to be positive). This is called the Cholesky decompositio. We are iterested i fidig efficiet parallel ad sequetial algorithms for the Cholesky decompositio. Efficiecy is measured both by the umber of arithmetic operatios ad by the amout of commuicatio, either betwee levels of a memory hierarchy o a sequetial machie or betwee processors commuicatig over a etwork o a parallel machie. Sice the time to move oe word of data typically exceeds Received by the editors Jue 2, 2009; accepted for publicatio (i revised form) September 21, 2010; published electroically December 2, A prelimiary versio of this paper was accepted to the 21st AC Symposium o arallelism i Algorithms ad Architectures (SAA 09), 2009, pp Computer Sciece Departmet, Uiversity of Califoria, Berkeley, CA (ballard@eecs. berkeley.edu). Research supported by icrosoft ad Itel fudig (award ) ad by matchig fudig by U.C. Discovery (award DIG ). athematics Departmet ad CS Divisio, Uiversity of Califoria, Berkeley, CA (demmel@cs.berkeley.edu). Research supported by icrosoft ad Itel fudig (award ) ad by matchig fudig by U.C. Discovery (award DIG ). Departmets of athematics, Uiversity of Califoria, Berkeley ad Techische Uiversität Berli (holtz@math.berkeley.edu). The third author ackowledges support of the Sofja Kovalevskaja programme of Alexader vo Humboldt Foudatio. The Weizma Istitute of Sciece, Rehovot 76100, Israel (oded.schwartz@weizma.ac.il). This work was doe while at the Departmet of athematics, Techische Uiversität Berli, ad while visitig the Uiversity of Califoria, Berkeley. 495

2 496 G. BALLARD, J. DEEL, O. HOLTZ, AND O. SCHWARTZ the time to perform oe arithmetic operatio by a large ad growig factor [18], our goal will be to miimize commuicatio Commuicatio model. We model commuicatio costs i more detail as follows. I the sequetial case, with two levels of memory hierarchy (fast ad slow), commuicatio meas readig data items (words) from slow memory to fast memory ad writig data from fast memory to slow memory. Words that are cotiguous ca be read or writte i a budle, which we will call a message. We assume that a message of words ca be commuicated betwee fast ad slow memory i time α + β, where α is the latecy (secods per message) ad β is the iverse badwidth (secods per word). We defie the badwidth cost of a algorithm to be the total umber of words commuicated ad the latecy cost of a algorithm to be the total umber of messages commuicated. We assume that the matrix beig factored iitially resides i slow memory ad is too large to fit i the smaller fast memory. Our goal the is to miimize the total umber of words ad the total umber of messages commuicated betwee fast ad slow memory. 1 I the parallel case, we are iterested i the commuicatio amog the processors. As i the sequetial case, we assume that a message of cosecutively stored words ca be commuicated i time α + β. This cost icludes the time required to pack ocotiguous words ito a sigle message if ecessary. We assume that the matrix is iitially evely distributed amog all processors ad that there is oly eough memory to store about 1/ th of a matrix per processor. As before, our goal is to miimize the umber of words ad messages commuicated. I order to measure the commuicatio cost of a parallel algorithm, we cout the umber of words ad messages commuicated alog the critical path as defied i [0], as this metric is closely related to the total ruig time of the algorithm. We assume that (1) computatio ad commuicatio are ot overlapped (all commuicatio is blockig ), (2) the cost per flop is the same o each processor, ad the commuicatio costs (α ad β) are the same betwee each pair of processors, ad () there is o commuicatio resource cotetio amog processors. For example, if processor 0 seds a message of size to processor 1 at time 0 ad processor 2 seds a message of size to processor also at time 0, the cost alog the critical path is α + β. However, if both processors 0 ad 1 try to sed a message to processor 2 at the same time, the cost alog the critical path will be the sum of the costs of each message Commuicatio lower bouds. We cosider classical algorithms for Cholesky decompositio, i.e., those that perform the usual O( ) arithmetic operatios, possibly reordered by associativity ad commutativity of additio. That is, our results do ot apply whe usig distributivity to reorgaize the algorithm (such as Strasse-like algorithms); we also assume o pivotig is performed. We defie classical more carefully later. We show that the commuicatio complexity of ay such Cholesky algorithm shares essetially the same lower boud as does the classical matrix multiplicatio. Theorem 1.1 (mai theorem). Ay sequetial or parallel classical algorithm for the Cholesky decompositio of -by- matrices ca be trasformed ito a classical 1 The sequetial commuicatio model used here is sometimes called the two-level iput/output (I/O) model or disk access machie (DA) model (see [2, 9, 12]). Our badwidth cost model follows that of [2] ad [25] i that it assumes the block-trasfer size is oe word of data (B =1i the commo otatio). However, our model allows message sizes to vary from oe word up to the maximum umber of words that ca fit i fast memory.

3 COUNICATION-OTIAL CHOLESKY DECOOSITION 497 algorithm for -by- matrix multiplicatio i such a way that the badwidth cost of the matrix multiplicatio algorithm is at most a costat times the badwidth cost of the Cholesky algorithm. Therefore, ay badwidth cost lower boud for classical matrix multiplicatio applies to classical Cholesky decompositio, asymptotically. I particular, sice a sequetial classical -by- matrix multiplicatio algorithm has a badwidth cost lower boud of Ω( / 1/2 ), where is the fast memory size [2, 25], classical Cholesky decompositio has the same lower boud (we discuss the parallel case later). To get a latecy cost lower boud, we use the simple observatio [1] that the umber of messages is at least the badwidth cost lower boud divided by the maximum message size ad that the maximum message size is at most fast memory size i the sequetial case (or the local memory size i the parallel case). So for sequetial matrix multiplicatio this meas the latecy cost lower boud is Ω( / /2 ). 1.. Commuicatio upper bouds Two-level memory model. I the case of matrix multiplicatio o a sequetial machie, a well-kow algorithm attais the badwidth cost lower boud [2]. It computes C = A B by performig blocked matrix multiplicatio that multiplies ad accumulates -by- submatrices of A, B, adc. If each matrix is stored so that the etries of each of its submatrices are cotiguous (ot the case with columwise or rowwise storage), the latecy cost lower boud is reached; we call suchadatastructureblock-cotiguous storage ad describe it i more detail below. Alteratively, oe could try to copy A ad B from their iput format (say columwise) to cotiguous block storage doig (asymptotically) o more commuicatio tha the subsequet matrix multiplicatio; we will see this is possible, provided =Ω(). There will be aalogous requiremets for Cholesky to attai its latecy cost lower boud. I particular, we draw the followig coclusios about the commuicatio costs of sequetial classical Cholesky decompositio i the two-level memory model, as summarized i Table 1.1 ad what follows: 1. Naïve sequetial variats of Cholesky that operate o sigle rows ad colums (be they left-lookig, right-lookig, etc.) attai either the badwidth or the latecy cost lower boud. 2. A sequetial blocked algorithm used i LA ACK (with the correct block size) attais the badwidth cost lower boud, as do the recursive algorithms i [, 4, 20, 21, 27]. A recursive algorithm aalogous to Toledo s LU algorithm [28] attais the badwidth cost lower boud i early all cases, except possibly for a O(log ) factor i the arrow rage 2 log 2 <<2.. Whether the LA ACK algorithm also attais the latecy cost lower boud depeds o the matrix layout: If the iput matrix is give i rowwise or columwise format ad this is ot chaged by the algorithm, the the latecy cost lower boud caot be attaied. But if the iput matrix is give i cotiguous block storage or = Ω() sothatitcabecopiedquicklyto cotiguous block format, the the latecy cost lower boud ca be attaied by the LA ACK algorithm. Toledo s algorithm caot miimize latecy (at least whe > 2/ ); see also coclusio 5 below Hierarchical memory model. Sice most computers have multiple levels (e.g., L1, L2, L caches, mai memory, ad disk), we also cosider the commuicatio costs i the hierarchical memory model. I this case, a optimal algorithm

4 498 G. BALLARD, J. DEEL, O. HOLTZ, AND O. SCHWARTZ Table 1.1 Sequetial badwidth ad latecy costs: lower boud vs. algorithms. deotes the size of the fast memory. We assume 2 > i this table. FLOs cout of all is O( ). Refer to sectio for defiitios of terms ad details. Lower boud Naïve: left/right lookig Colum-major Θ( ) Θ LA ACK [5] Colum-major Block-cotiguous Rectagular recursive [28] Colum-major Block-cotiguous Square recursive [, 4, 20, 21, 27] Recursive packed format [4] Colum-major [20] Block-cotiguous [] Cache Badwidth Latecy oblivious Ω Ω /2 O ( ) O Θ + 2 log ( ) Θ + 2 log O ( ) O ( ) O ) ( 2 + O O /2 Ω Ω ( 2) Ω O O /2 should simultaeously miimize commuicatio betwee all pairs of adjacet levels of memory hierarchy (e.g., miimize badwidth ad latecy costs betwee L1 ad L2, betwee L2 ad L, etc.). I the case of sequetial matrix multiplicatio, badwidth cost is miimized i this sese by simply applyig the usual blocked algorithm recursively, where each level of recursio multiplies matrices that fit at a particular level of the memory hierarchy, by usig the blocked algorithm to multiply submatrices that fit i the ext smaller level. This is easy sice matrix multiplicatio aturally breaks ito smaller matrix multiplicatios. For matrix multiplicatio to miimize latecy across all memory hierarchy levels, it is ecessary for all submatrices of all sizes to be stored cotiguously. This leads to a data structure variously referred to as recursive block storage, orto orderig, or storage usig space-fillig curves ad is described i [, 6, 16, 17, 29]. Fially, sequetial matrix multiplicatio ca achieve commuicatio optimality as just described i oe of two ways: (1) We ca choose the umber of recursio levels ad sizes of the subblocks with prior kowledge of the umber of levels ad sizes of the levels of memory hierarchy, a cache-aware process called tuig. (2) We carry out the recursio dow to 1-by-1 blocks (or some other small costat size), repeatedly dividig block sizes by 2 (perhaps paddig submatrices to have eve dimesios as eeded). Such a algorithm is called cache-oblivious [17] ad has the advatage of simplicity ad portability compared to a cache-aware algorithm, though it might also have more overhead i practice. It is ideed possible for sequetial Cholesky to be orgaized to be optimal across multiple memory hierarchy levels i all the seses just described, assumig we use recursive block storage:

5 COUNICATION-OTIAL CHOLESKY DECOOSITION 499 Table 1.2 arallel badwidth ad latecy costs: lower boud vs. algorithms. deotes the size of the memory of each processor. is the umber of processors. b is the block size. Refer to sectio for defiitios of terms ad details. Lower-boud Geeral 2D layout: ScaLA ACK [11] Geeral Choosig b =Θ = O ( 2 ) (( O 2 O Badwidth Latecy FLOS Ω Ω 2 ) + b ( 2 log ) log ) ( Ω ( ) Ω /2 ) O ( b log ) O( log ) O Ω Ω ( ) 4. The recursive algorithm modeled o Toledo s LU ca be implemeted i a cache-oblivious way so as to miimize badwidth cost, but ot the latecy cost The cache-oblivious recursive Cholesky algorithm which first appeared i [20] (ad ca also be foud i [, 4, 21, 27]) miimizes both badwidth ad latecy costs (assumig recursive block storage) for all matrices across all memory hierarchy levels. Noe of the other algorithms have this property arallel model. Fially, we address the case of parallel Cholesky, where there are processors coected by a etwork with latecy α ad reciprocal badwidth β. We cosider here oly the memory-scalable case, where each processor s local memory is of size = O( 2 / ), so that oly O(1) copies of the matrix are stored overall (the so-called two-dimesioal (2D) case). See [25] for the geeral lower boud, that holds for three-dimesioal (D) algorithms, for matrix multiplicatio. A D algorithm for LU decompositio without pivotig (alog with a algorithm for triagular system solutio) is preseted with commuicatio aalysis i [24]. While the algorithms miimize the total volume of commuicatio, they do ot miimize the commuicatio costs alog the critical path as measured i this paper. For the 2D case, the cosequece of our mai theorem is agai a badwidth cost lower boud of the form Ω( /( 1/2 )) = Ω( 2 / 1/2 ), ad a latecy cost lower boud of the form Ω( /( /2 )) = Ω( 1/2 ). 6. The Cholesky algorithm i ScaLA ACK [11] attais a matchig upper boud. It does so by partitioig the matrix ito submatrices ad distributig them to the processors i a block cyclic maer. With the right choice of block size b, amely,b =Θ( ), it attais the above badwidth ad latecy cost lower bouds to withi a factor of log. Thisissummarizedi Table 1.2. See sectio..1 for details. The rest of this paper is orgaized as follows. I sectio 2 we show the reductio from matrix multiplicatio to Cholesky decompositio, thus extedig the badwidth cost lower bouds of [2] ad [25] to a badwidth cost lower boud for the sequetial ad parallel implemetatios of Cholesky decompositio. We also discuss latecy cost lower bouds. I sectio we preset kow Cholesky decompositio algorithms ad compare their badwidth ad latecy costs with the lower bouds. 2 Toledo s algorithm is desiged to retai umerical stability for the LU case. The algorithm of [20] deals with the Cholesky case oly ad therefore requires o pivotig for umerical stability. Thus a simpler recursio suffices, ad the latecy cost dimiishes.

6 500 G. BALLARD, J. DEEL, O. HOLTZ, AND O. SCHWARTZ A shorter versio of this paper (a exteded abstract) appeared i the proceedigs of the SAA 2009 coferece; the shorter versio icludes the reductio proof of the lower bouds ad Tables 1.1 ad 1.2. This paper proves the claims made i the tables by presetig the algorithms ad their aalyses i sectio. 2. Commuicatio lower bouds. Cosider a algorithm for a parallel computer with processors that multiplies matrices i the classical way (the usual O( ) arithmetic operatios possibly reordered usig associativity ad commutativity of additio), ad each of the processors has memory of size. Iroy, Toledo, ad Tiski [25] showed that at least oe of the processors has to sed or receive this miimal umber of words. Theorem 2.1 (see [25]). Ay classical implemetatio of matrix multiplicatio of matrices A ad B o a processor machie, each equipped with memory of size, requires that oe of the processors seds or receives at least words. If A ad B are of size m ad m r, respectively, the the correspodig boud mr is As ay processor has memory of size, ay message it seds or receives may deliver at most words. Therefore we deduce the followig. Corollary 2.2. Ay classical implemetatio of matrix multiplicatio of matrices A ad B o a processor machie, each equipped with memory of size, requires that oe of the processors seds or receives at least 2 1 messages. 2 2 If A ad B are of size m ad m r, respectively, the the correspodig boud mr is For the case of = 1 these coicide with the lower bouds for the badwidth ad the latecy costs of the sequetial case. The lower boud o badwidth cost of sequetial matrix multiplicatio was previously show (up to some multiplicative costat factor) by Hog ad Kug [2]. Oe meas of extedig these lower bouds to more algorithms is by reductio. It is easy to reduce matrix multiplicatio to LU decompositio of a slightly larger order, as the followig idetity shows: I 0 B A I 0 = I A I I 0 B I A B. 0 0 I 0 0 I I This idetity meas that LU factorizatio without pivotig ca be used to perform matrix multiplicatio; to accommodate pivotig, A ad/or B ca be scaled dow to be too small to be chose as pivots, ad A B ca be scaled up accordigly. Thus a O( ) implemetatio of LU decompositio that uses oly associativity ad commutativity of additio to reorgaize its operatios (thus elimiatig Strasselike algorithms) must perform at least as much commuicatio as a correspodigly reorgaized implemetatio of O( ) matrix multiplicatio. We wish to mimic this lower boud costructio for Cholesky. Cosider the followig reductio from matrix multiplicatio to Cholesky decompositio. Let T be the matrix defied below, composed of ie square blocks each of dimesio ; the the Cholesky decompositio of T is I A T B I I A T B T A I + A A T 0 = A I I A B L L T, B T 0 D B T (A B) T X X T

7 COUNICATION-OTIAL CHOLESKY DECOOSITION 501 where X is the Cholesky factor of D D B T B B T A T AB ad D ca be ay symmetric matrix such that D is positive defiite. Thus A B is computed via this Cholesky decompositio. Ituitively this seems to show that the commuicatio complexity eeded for computig matrix multiplicatio is a lower boud to that of computig the Cholesky decompositio (of matrices three times larger) as A B appears i L T, the decompositio of T. Note, however, that A A T also appears i T. Coceivably, computig A B from A ad B icurs less commuicatio cost if we are also give A A T, so the above is ot a sufficiet reductio from matrix multiplicatio to Cholesky decompositio. Let us istead cosider the followig approach to prove the lower boud. I additio to the real umbers R, cosider ew starred umerical quatities, called 1 ad 0, with arithmetic properties detailed i the followig tables. 1 ad 0 mask ay real value i additio/substractio operatio but behave similarly to 1 R ad 0 R i multiplicatio ad divisio operatios: Cosider this set of values ad arithmetic operatios. The set is commutative with respect to additio ad to multiplicatio (by the symmetries of the correspodig tables). The set is associative with respect to additio: regardless of orderig of summatio, the sum is 1 if oe of the summads is 1, otherwise it is 0 if oe of the summads is 0. The set is also associative with respect to multiplicatio: (a b) c = a (b c). This is trivial if all factors are i R. As 1 is a multiplicative idetity, it is also immediate if some of the factors equal 1. Otherwise, at least oe of the factorsis0, ad the product is 0. Distributivity, however, does ot hold: 1 (1 +1 )=1 2=(1 1 )+(1 1 ). Let us retur to the costructio. We set T to be: T I AT B A C 0, B T 0 C where C has 1 o the mai diagoal ad 0 everywhere else: C Oe ca verify that the (uique) Cholesky decompositio of C is (2.1) C = C C T Note that computig A A T is asymptotically as hard as matrix multiplicatio: take A = [X, 0; Y T, 0]. The A A T =[,XY;, ]. 4 By writig X Y we mea the resultig matrix, assumig the straightforward matrix multiplicatio algorithm. This has to be stated clearly, as the distributivity does ot hold for the starred values.

8 502 G. BALLARD, J. DEEL, O. HOLTZ, AND O. SCHWARTZ Note that if a matrix X does ot cotai ay starred values 0 ad 1,the X = C X = X C = C X = X C = C T X = X C T ad C + X = C. Therefore, oe ca cofirm that the Cholesky decompositio of T is (2.2) T I AT B A C 0 = I A C I AT B C T A B L L T. B T 0 C B T (A B) T C Oe ca thik of C as maskig the A A T previously appearig i the cetral block of T, therefore allowig the lower boud of computig A B to be accouted for by the Cholesky decompositio ad ot by the computatio of A A T. ore formally, let Alg be ay classical algorithm for Cholesky factorizatio. We covert it to a matrix multiplicatio algorithm as follows i Algorithm 1. Algorithm 1 atrix ultiplicatio by Cholesky Decompositio. Iput: Two matrices, A ad B. 1: Let Alg be Alg updated to correctly hadle the ew 0, 1 values. {ote that Alg ca be costructed off-lie.} 2: Costruct T as i (2.2). : L = Alg (T ). 4: retur (L 2 ) T. C T The simplest coceptual way to do step 1 is to attach a extra bit to every umerical value, idicatig whether it is starred or ot, ad modify every arithmetic operatio to check this bit before performig a operatio. This icreases the badwidth cost by at most a costat factor. Alteratively, we ca use sigallig NaNs as defied i the IEEE floatig poit stadard [1] to ecode 1 ad 0 with o extra bits. If the istructios implemetig Cholesky are scheduled determiistically, there is aother alterative. Oe ca ru the algorithm symbolically, propagatig 0 ad 1 argumets from the iputs forward, simplifyig or elimiatig arithmetic operatios whose iputs cotai 0 or 1. Oe ca also elimiate operatios for which there is o path i the directed acyclic graph (describig how outputs of each operatio propagate to iputs of other operatios) to the desired output A B. The resultig Alg performs a strict subset of the arithmetic ad memory operatios of the origial Cholesky algorithm. We ote that updatig Alg to form Alg is doe off-lie, so that step 1 does ot actually take ay time to perform whe Algorithm 1 is called. Classical Cholesky decompositio algorithms. We ext verify the correctess of this reductio: that the output of this procedure o iput A, B is ideed the multiplicatio A B, aslogasalg is a classical algorithm, i a sese we ow defie carefully. Let T = L L T be the Cholesky decompositio of T. The we have the followig formulas: (2.) L(i, i) = T (i, i) (L(i, k)) 2, (2.4) L(i, j) = 1 T (i, j) L(j, j) k [j 1] k [i 1] L(i, k) L(j, k),i>j,

9 COUNICATION-OTIAL CHOLESKY DECOOSITION 50 Fig Depedecies of L(i, j) for diagoal etries (left) ad other etries (right). Dark grey represets the sets S i,i (left) ad S i,j (right). Light grey represets idirect depedecies. where [t] ={1,...,t}. A classical Cholesky decompositio algorithm computes each of these O( ) FLOS, which may be reordered usig oly commutativity ad associativity of additio. By the o-pivotig ad o-distributivity restrictios o Alg, whe a etry of L is computed, all the etries o which it depeds have already bee computed ad combied by the above formulas, with the sums occurrig i ay order. These depedecies form a depedecy graph o the etries of L ad impose a partial orderig o the computatio of the etries of L (see Figure 2.1). That is, whe a etry L(i, i) is computed, by (2.), all the etries {L(i, k)} k [i 1] have already bee computed. 5 Deote this set of etries by S i,i,amely, (2.5) S i,i {L(i, k)} k [i 1]. Similarly, whe a etry L(i, j) (fori>j) is computed, by (2.4), all the etries {L(i, k)} k [j 1] ad all the etries {L(j, k)} k [j] have already bee computed. Deote this set by S i,j,amely, (2.6) S i,j {L(i, k)} k [j 1] {L(j, k)} k [j]. Lemma 2.. Ay orderig of the computatio of the elemets of L that respects the partial orderig iduced by the above metioed directed acyclic graph results i a correct computatio of A B. roof. We eed to cofirm that the starred etries 1 ad 0 of T do ot somehow cotamiate the desired etries of L T 2. The proof is by iductio o the partial order o pairs (i, j) implied by (2.5) ad (2.6). The base case the correctess of computig L(1, 1) is immediate. Assume by iductio that all elemets of S i,j are correctly computed ad cosider the computatio of L(i, j) accordig to the block i which it resides: If L(i, j) resides i block L 11, L 21,orL 1,theS i,j cotais oly real values, ad o arithmetic operatios with 0 or 1 occur (recall Figure 2.1 or (2.2), (2.5), ad (2.6)). Therefore, the correctess follows from the correctess of the origial Cholesky decompositio algorithm. If L(i, j) resides i L 22 or L,theS i,j may cotai starred value (elemets of C ). We treat separately the case where L(i, j) is o the mai diagoal ad the case where it is ot. If i = j, the by (2.) L(i, i) is determied to be 1 sice T (i, i) =1 ad sice addig to, subtractig from, ad takig the square root of 1 all result i 1 (recall Table 2.1 ad (2.)). 5 While this partial orderig costrais the schedulig of FLOS, it does ot uiquely idetify a computatio directed acyclic graph (DAG), for the additios withi oe summatio ca be i arbitrary order (formig arbitrary subtrees i the computatio DAG).

10 504 G. BALLARD, J. DEEL, O. HOLTZ, AND O. SCHWARTZ Table 2.1 Arithmetic operatios: x, y stad for ay real values. For cosistecy, 0 0 ad 1 1. ± 1 0 y x 1 0 x ± y 1 0 y y x x 0 x y x 0 x / 1 0 y /y x x x/y If i > j, the by the iductive assumptio the divisor L(j, j) of (2.4) is correctly computed to be 1 (recall Figure 2.1 ad the defiitio of C i (2.1)). Therefore, o divisio by 0 is performed. oreover, T (i, j) is0. The L(i, j) is determied to be the correct value 0, uless 1 is subtracted (recall (2.4)). However, every subtracted product (recall (2.4)) is composed of two factors of the same colum but of differet rows. Therefore, by the structure of C,oeofthemis1, so their product is ot 1, ad the value is computed correctly. If L(i, j) resides i L 2,theS i,j may cotai starred values (see Figure 2.1, right-had side, row j). However, every subtractio performed (recall (2.4)) is composed of a product of two factors, of which oe is o the ith row (ad o a colum k<j). Hece, by iductio (o i, j), the (i, k) elemet has bee computed correctly to be a real value, ad by the multiplicatio properties so is the product. Therefore o maskig occurs. This completes the proof of Lemma 2.. Commuicatio Aalysis. We ow kow that Algorithm 1 correctly multiplies matrices classically ad so has kow commuicatio lower bouds give by Theorem 2.1 ad Corollary 2.2. It remais to cofirm that step 2 (settig up T )adstep4 (returig L T 2 ) do ot require much commuicatio, so that these lower bouds apply to step, ruig Alg (recall that step 1 may be performed off-lie ad so does t cout). Sice Alg is either a small modificatio of Cholesky to add star labels to all data items (at most doublig the badwidth cost) or a subset of Cholesky with some operatios omitted (those with starred argumets or ot leadig to the desired output L 2 ), a lower boud o commuicatio for Alg is also a lower boud for Cholesky. Theorem 2.4 (mai theorem). Ay sequetial or parallel classical algorithm for the Cholesky decompositio of -by- matrices ca be trasformed ito a classical algorithm for -by- matrix multiplicatio, i such a way that the badwidth cost of the matrix multiplicatio algorithm is at most a costat times the badwidth cost of the Cholesky algorithm. Therefore ay badwidth or latecy cost lower boud for classical matrix multiplicatio applies to classical Cholesky, asymptotically speakig. Corollary 2.5. I the sequetial case, with a fast memory of size, thebadwidth cost lower boud for Cholesky decompositio is Ω( / 1/2 ), ad the latecy cost lower boud is Ω( / /2 ). roof. Costructig T (i ay data format) requires badwidth of at most 18 2 (copyiga-by- matrix, with aother factor of 2 if each etry has a flag idicatig whether it is starred or ot), ad extractig L T 2 requires aother 2 of badwidth. Furthermore, we ca assume 2 < / 1/2 ; i.e., that < 2 ; i.e., that the matrix is

11 COUNICATION-OTIAL CHOLESKY DECOOSITION 505 too large to fit etirely i fast memory (the oly case of iterest). Thus the badwidth lower boud Ω( / 1/2 ) of Algorithm 1 domiates the badwidth costs of steps 2 ad 4 ad so must apply to step (Cholesky). Fially, as each message delivers at most words, the latecy lower boud for step is by a factor of smaller tha its badwidth cost lower boud, as desired. Corollary 2.6. I the parallel case, with a 2D layout o processors, the badwidth cost lower boud for Cholesky decompositio is Ω( 2 / 1/2 ), ad the latecy cost lower boud is Ω( 1/2 ). roof. The argumet i the parallel case is aalogous to that of Corollary 2.5. The costructio of iput ad retrieval of output at steps 2 ad 4 of Algorithm 1 cotribute badwidth of O( 2 ). Therefore the lower boud of the badwidth Ω( ) is determied by step, the Cholesky decompositio. The lower boud o the latecy of step is therefore Ω( /2 ), as each message delivers at most words. luggig i =Θ( 2 ) yields B =Ω( 1/2 ).. Upper bouds. I this sectio we discuss kow algorithms for Cholesky decompositio ad their badwidth ad latecy cost aalyses i the sequetial twolevel memory model, i the sequetial hierarchical memory model, ad i the parallel computig model. Throughout this sectio, we will use the otatio B() ad L() for badwidth ad latecy costs, respectively, where is the dimesio of the square matrix. These fuctios may also deped o the parameters, the size of the fast memory, ad, the umber of processors, but we omit its referece whe it is clear from cotext. See also [10] for latecy aalysis of two of the Cholesky decompositio algorithms i the two-level (out-of-core) memory model. Note that all the Cholesky decompositio algorithms i this sectio perform almost exactly the same computatio: they all perform the computatios defied i (2.) ad (2.4) ad differ oly i the order of summatio ad the timig of computatio. Therefore, oe of them use distributivity. Without this latter fact, it would ot make sese to compare their badwidth ad latecy costs to the lower boud. I sectio of [22], stadard error aalyses of the Cholesky decompositio algorithm are give. These hold for ay orderig of the summatio of (2.) ad (2.4) ad therefore apply to all Cholesky decompositio algorithms below..1. Sequetial algorithms Data storage. Before reviewig the various sequetial algorithms let us first cosider the uderlyig data structure which is used for storig the matrix. Although it does ot affect the badwidth cost, it may have a sigificat ifluece o the latecy cost of the algorithm: it may or may ot allow retrievig may relevat words usig oly a sigle message. As a simple example, cosider computig the sum of umbers i slow memory, which obviously requires readig these words. If they are i cosecutive memory locatios, this ca be doe i oe read operatio, the miimum possible latecy cost. But if they are ot cosecutive, say, they are separated by at least 1 words, this may require read operatios, the maximum possible latecy cost. The various methods for data storage (oly some of which are implemeted i LA ACK) appear i Figure.1. We ca partitio these data structures ito two classes: colum-major ad block-cotiguous. Colum-major storage. The colum-major data structures store the matrix etries i columwise order. This meas that they are most fit for algorithms that access a colum at a time (e.g., the left ad right lookig aïve algorithms). However, the

12 506 G. BALLARD, J. DEEL, O. HOLTZ, AND O. SCHWARTZ Fig..1. Uderlyig data structures. Top (colum-major): full, old packed, rectagular full packed. Bottom (block-cotiguous): blocked, block-recursive format, recursive full packed. latecy cost of algorithms that access a block at a time (e.g., LA ACK s implemetatio) will fail to achieve optimality, as retrievig a block of size b b requires at least b messages, eve if a sigle message ca deliver b 2 words. This meas a possible icrease i the latecy cost by a factor of b (where b is typically the order of ). As the matrices of iterest for Cholesky decompositio are symmetric, storig the etire matrix (kow as full storage ) wastes space. About half of the space ca be saved by usig the old packed or the rectagular full packed storages. The latter has the advatage of a more uiform idexig, which allows faster addressig. Block-cotiguous storage. The block-cotiguous data structures store the matrix etries i a way that allows a read or write of a block usig a sigle message. This may reduce the latecy cost of a Cholesky decompositio algorithm that accesses a b b block at a time by a factor of b, compared with usig colum-major storage. We distiguish betwee two types of block-cotiguous storage formats. The blocked data structure stores each block of the matrix i a cotiguous space of the memory. For this, oe has to kow i advace the size of blocks to be used by the algorithm (which is a machie-specific parameter). The elemets of each block may be stored as cotiguous subblocks, where each subblock is of a predefied size. The data structure may iclude several such layers of subblocks. This layered data structure may fit the hierarchical memory model (see sectio.2 for further discussio of this model). The ext data structure allows the beefits of block-cotiguous storage without kowledge of machie-specific parameters. The block-recursive format [6, 16, 17, 29] (also kow as the bit iterleaved layout, space-fillig curve storage, or orto orderig format) stores each of the four /2 /2 submatrices cotiguously, ad the elemets of each submatrix are ordered so that the smaller submatrices are each stored cotiguously, ad so o recursively. This format is cache-oblivious. This meas that it allows access to a sigle block usig a sigle message (or a costat umber of messages), without kowig i advace the size of the blocks (see Figure.1). Both the blocked ad block-recursive formats have packed versios, where about half of the space is saved by usig the symmetry of the matrices (see [16] for recursive full packed data structures). The algorithm i [4] uses a hybrid

13 COUNICATION-OTIAL CHOLESKY DECOOSITION 507 data structure i which oly half the matrix is stored, ad recursive orderig is used o triagular submatrices ad colum-major orderig is used o square submatrices. Coversio o the fly. Sice block-cotiguous storage yields better latecy cost tha colum-major storage for some algorithms, it ca be worthwhile to covert a matrix stored i colum-major order to the blocked data structure (with block size b =Θ()) before ruig the algorithm. This ca be doe by readig Θ() elemets at a time, i a columwise order (which requires oe message), ad the writig each of these elemets to the right locatio of the ew matrix. We write these words usig Θ( ) messages (oe per each relevat block). Thus, the total umber of messages is O( 2 ), which is asymptotically domiated by O( )for =Ω(). Covertig /2 back to colum-major storage ca be doe i similar way (with the same asymptotic cost). Other variats of these data structures. Similar to the colum-major data structures, there are row-major data structures that store the matrix etries rowwise. All the above-metioed packed data structures have versios that are idexed to efficietly store the lower (as i Figure.1) ad upper triagular part of a matrix. We cosider the applicatio of each of the algorithms below to colum-major or block-cotiguous data structures oly. Adaptig each of these algorithms to other compatible data structures (row-major vs. colum-major, upper triagular vs. lower triagular, full storage vs. packed storage) is straightforward Arithmetic cout of sequetial algorithms. The arithmetic operatios of all the sequetial algorithms cosidered below are exactly the same, up to reorderig. The arithmetic cout of all these algorithm is therefore the same ad is give oly oce. By (2.), (2.4) the total arithmetic cout is A() = i=1 j=1 i (2j 1) = +Θ(2 )..1.. Upper boud for matrix multiplicatio. I this sectio we establish the badwidth ad latecy costs of matrix multiplicatio, which is used as a subroutie i some of the Cholesky decompositio algorithms. We recall the recursive matrix multiplicatio algorithm of Frigo et al. [17]. Their algorithm, give as Algorithm 2, works by a divide-ad-coquer approach, where at each step the algorithm splits the largest of three dimesios. Lemma.1 (see [17]). The badwidth cost B (, m, r) of multiplyig two matrices of dimesios m ad m r is mr B (, m, r) =Θ + m + mr + r. This result is stated slightly differetly i [17]. We obtai the form above by settig the cache-lie legth (or trasfer block size) equal to 1. The latecy cost of this algorithm varies accordig to the data structure used ad the dimesios of the iput ad output matrices. Lemma.2. Whe usig a recursive cotiguous-block data structure, the latecy cost of matrix multiplicatio is mr m + mr + r L (, m, r) =Θ +, /2

14 508 G. BALLARD, J. DEEL, O. HOLTZ, AND O. SCHWARTZ Algorithm 2 C = Ratul(A, B): A Recursive atrix ultiplicatio Algorithm. Iput: A, B, two matrices of dimesios m ad m r. Output: C, a matrix of dimesio r, sothatc = A B A11 A \\ artitio A, B, C by dividig each dimesio i half, e.g., A = 12 A 21 A 22 1: if = m = r =1the 2: C 11 = A 11 B 11 : else ( if =max{, ) m, r} the ( ) 4: ( C11 C 12 ) = Ratul (( A11 A 12 ),B ) 5: C21 C 22 = Ratul A21 A 22,B 6: else if m =max{, (( m, r} ) the A11 7: C = Ratul, ) B A 11 B 12 (( 21 ) A12 8: C = C + Ratul, ) B A 21 B : else ( ) C11 B11 10: = Ratul A, ( C 21 ) ( ( B 21 )) C12 B12 11: = Ratul A, C 22 B 22 12: ed if 1: retur C ad whe usig colum-major or row-major layout, the latecy cost is mr m + mr + r L (, m, r) =Θ +. roof. Let m,,r be the dimesios of the problem the first time all three matrices fit ito the fast memory. The m,,r differ by at most a factor of 2 ad thus are all Θ( ). Therefore, assumig the recursive cotiguous-block data structure, each of the three matrices resides i O(1) square blocks, ad so readig ad/or writig each matrix icurs a latecy cost of Θ(1). Thus, the total latecy cost is L () =Θ( B() ). If the colum-major or row-major data structures are used, the each such block of size Θ( ) Θ( ) is read or writte usig Θ( ) messages (oe for each of its rows or colums), ad therefore the total latecy cost is L () = Θ( B() ). We ote that Algorithm 2 is cache-oblivious. That is, the algorithm achieves the badwidth ad latecy costs give above with o kowledge of the size of the fast memory. The iterative blocked matrix multiplicatio algorithm ca also achieve this performace, but the block size must be chose to be Θ( ) Upper boud for triagular system solutio. I this sectio we establish the badwidth ad latecy costs of aother subroutie used i some of the Cholesky decompositio algorithms, that of a triagular system solutio (also kow as triagular solve) with multiple right-had sides (TRS). Algorithm is a recursive versio of TRS (a variatio of this algorithm appears i [4], desiged to utilize a o-

15 COUNICATION-OTIAL CHOLESKY DECOOSITION 509 recursive, tued implemetatio of matrix multiplicatio). Below, we assume each matrix multiplicatio is implemeted with the recursive algorithm (Algorithm 2). Algorithm X = RT RS(A, U): Recursive Triagular Solver. Iput: A, U two matrices, U is upper triagular. Output: X, sothatx = A U 1 A11 A \\ artitio A, U, X by dividig each dimesio i half, e.g., A = 12 A 21 A 22 1: if =1the 2: X = A/U : else 4: X 11 = RT RS(A 11,U 11 ) 5: X 12 = RT RS(A 12 X 11 U 12,U 22 ) 6: X 21 = RT RS(A 21,U 11 ) 7: X 22 = RT RS(A 22 X 21 U 12,U 22 ) 8: ed if 9: retur X Badwidth cost. As o commuicatio is eeded for sufficietly small matrices (other tha readig the etire iput ad writig the output), the badwidth cost of this algorithm is give by the recurrece { ( 4 B TRS 2 2 ) +2 B ( 2 ) if > B TRS () = otherwise, where B () =B (,, ) is the badwidth cost of multiplyig two -by- matrices. Assumig the matrix multiplicatio is doe usig Algorithm 2, we have B () =O( + 2 ). The solutio to this recurrece (ad the total badwidth cost of recursive TRS) is the B TRS () =O + 2. Latecy cost. Assumig a block-cotiguous data structure (recursive, or with the correct block size picked), the latecy cost is give by the recurrece { ( 4 L ) L TRS () TRS 2 +2 L ( 2 ) if >, otherwise, where L () =L (,, ) =O( ) is the latecy cost of recursive matrix /2 multiplicatio. The solutio to this recurrece is L TRS () =O. /2 Agai, we ote that Algorithm is cache-oblivious. The iterative blocked TRS algorithm ca also achieve this performace, but the block size must be chose to be Θ( ). Equipped with the commuicatio costs of matrix multiplicatio ad triagular system solvig, we proceed to the followig subsectios, where we preset several Cholesky decompositio algorithms ad provide the commuicatio cost aalyses which yield the first five eumerated coclusios i the itroductio as well as the results show i Table 1.1.,

16 510 G. BALLARD, J. DEEL, O. HOLTZ, AND O. SCHWARTZ.1.5. The aïve left-lookig Cholesky algorithm. We start with revisitig the two aïve algorithms (left-lookig ad right-lookig), ad we see that both have o-optimal badwidth ad latecy, as stated i coclusio 1 of the itroductio. The aïve left-lookig Cholesky decompositio algorithm is give i Algorithm 4 ad Figure.2. Note that Algorithm 4 assumes that two colums (k ad j) ofthe matrix ca fit ito fast memory simultaeously (i.e., >2). Algorithm 4 Naïve left-lookig Cholesky algorithm. 1: for j =1to do 2: read A(j :, j) from slow memory : for k =1toj 1 do 4: read A(j :, k) from slow memory 5: update diagoal elemet: A(j, j) A(j, j) A(j, k) 2 6: for i = j +1to do 7: update jth colum elemet: A(i, j) A(i, j) A(i, k)a(j, k) 8: ed for 9: ed for 10: calculate fial value of diagoal elemet: A(j, j) A(j, j) 11: for i = j +1to do 12: calculate fial value of jth colum elemet: A(i, j) A(i, j)/a(j, j) 1: ed for 14: write A(j :, j) to slow memory 15: ed for Badwidth cost. Assumig >2, we follow Algorithm 4, ad commuicatio occurs at lies 2, 4, ad 14, so the total umber of words trasferred betwee fast ad slow memory while executig the etire algorithm is give by [ ( j 1 )] 2( j +1)+ ( j +1) = j=1 k=1 I the case whe <2, each colum j is read ito fast memory i segmets of size /2. For each segmet of colum j, the correspodig segmets of previous colums k are read ito fast memory i sequece to update the curret segmet. I this way, the total umber of words trasferred betwee fast ad slow memory does ot chage. Latecy cost. Assumig > 2 ad the matrix is stored i colum-major order, each colum is cotiguous i memory, so the total umber of messages is give by [ ( j 1 )] 2+ 1 = j=1 k=1 I the case whe < 2, a etire colum caot be read/writte i oe message, so the colum must be read/writte i multiple messages of size O(). Agai, assumig the matrix is stored i colum-major order, segmets of colums will be cotiguous i memory, ad the latecy is give by the badwidth divided by the message size: O( ). The algorithm ca be adjusted to work oe row at a time (up-lookig) if the matrix is stored i row-major order (with o chage i badwidth or latecy), but

17 COUNICATION-OTIAL CHOLESKY DECOOSITION 511 k j j k Fig..2. Naïve algorithms. Left: left-lookig. Right: right-lookig. Colum j is curretly beig computed (both algorithms). Light grey is the area already computed. Colum k is the colum beig read (left-lookig) or writte (right-lookig). a block-cotiguous storage format will icrease the latecy (colums would ot be cotiguous i memory i this case). This aalysis, alog with that of sectio.1.6, yields coclusio 1 i the itroductio The aïve right-lookig Cholesky algorithm. The aïve right-lookig Cholesky decompositio algorithm is give i Algorithm 5 ad Figure.2. Note that Algorithm 5 assumes that two colums (k ad j)ofthematrixcafititofastmemory simultaeously (i.e., >2). Algorithm 5 Naïve right-lookig Cholesky algorithm. 1: for j =1to do 2: read A(j :, j) from slow memory : calculate fial value for diagoal elemet: A(j, j) A(j, j) 4: for i = j +1to do 5: calculate fial value for jth colum elemet: A(i, j) A(i, j)/a(j, j) 6: ed for 7: for k = j +1to do 8: read A(k :, k) from slow memory 9: for i = k to do 10: update kth colum elemet: A(i, k) A(i, k) A(i, j)a(k, j) 11: ed for 12: write A(k :, k) to slow memory 1: ed for 14: write A(j :, j) to slow memory 15: ed for Badwidth cost. Assumig >2, we follow Algorithm 5, ad commuicatio occurs at lies 2, 8, 12, ad 14, so the total umber of words trasferred betwee fast ad slow memory while executig the etire algorithm is give by [ ] j 1 2( j +1)+ 2( k +1) = j=1 k=1 Ithecasewhe<2, more commuicatio is required. I order to perform the colum factorizatio, the colum is read ito fast memory i segmets of size 1, updated with the mai diagoal elemet (which must remai i fast memory), ad writte back to slow memory. I order to update the trailig kth colum, the

18 512 G. BALLARD, J. DEEL, O. HOLTZ, AND O. SCHWARTZ kth colum is read ito fast memory i segmets of size ( 1)/2 alog with the correspodig segmet of the jth colum. I order to compute the update for the etire colum, the kth elemet of the jth colum must remai i fast memory. After updatig the segmet of the kth colum, it is writte back to slow memory. Thus, the aïve right-lookig algorithm requires more reads from slow memory whe two colums of the matrix caot fit ito fast memory, but this icreases the badwidth by oly a costat factor. Latecy cost. Assumig > 2 ad the matrix is stored i colum-major order, the total umber of messages is give by 2+ 2 = 2 +. j=1 k=j+1 Asithecaseoftheaïve left-lookig algorithm, whe <2, thesizeofa message is o loger the etire colum. Assumig the matrix is stored i colummajor order, message sizes are of the order equal to the size of fast memory. Thus, for O( ) badwidth, the latecy is O( ). This right-lookig algorithm ca be easily trasformed ito a dow-lookig rowwise algorithm i order to hadle row-major data storages, with o chage i badwidth or latecy. This aalysis, alog with that of sectio.1.5, yields coclusio 1 i the itroductio LA ACK s OTRF. We ext cosider a implemetatio available i LA ACK (see [5]) ad show that it is badwidth optimal whe usig the right block size (as stated i coclusio 2 of the itroductio) ad ca also be made latecyoptimal, assumig the correct data structure is used (as stated i coclusio of the itroductio). LA ACK s Cholesky algorithm, OTRF (Algorithm 6), is a blocked left-lookig algorithm. For simplicity, we assume the block size b evely divides the matrix size. See Figure. for a diagram of the algorithm. Note that by the partitioig of lie 2, A 21 is b (j 1)b, A 22 is b b, A 1 is (/b j)b (j 1)b, ada 2 is (/b j)b b. Algorithm 6 LA ACK OTRF. 1: for j =1to/b do 2: partitio matrix so that diagoal block (j, j) isa 22 : A 11 A 21 A 22 A 1 A 2 A : update diagoal block (j, j) (SYRK): A 22 A 22 A 21 A T 21 4: factor diagoal block (j, j) (OTF2): A 22 Chol(A 22 ) 5: update colum pael (GE): A 2 A 2 A 1 A T 21 6: triagular system solutio for colum pael (TRS): A 2 A 2 A T 22 7: ed for The commuicatio costs of Algorithm 6 deped o the subrouties which perform symmetric rak-b update, matrix multiply, ad triagular system solve, respectively. For simplicity, we will assume that the symmetric rak-b update is computed with the same badwidth ad latecy as a geeral matrix multiply. Although geeral matrix multiply requires more commuicatio, this assumptio will ot affect the asymptotic cout of the commuicatio required by the algorithm. We will also

19 COUNICATION-OTIAL CHOLESKY DECOOSITION 51 b b j j A 21 A 21 A 1 A 2 A 2 Fig... LA ACK s OTRF Cholesky decompositio algorithm. assume that the block size b is chose so that three blocks ca fit ito fast memory simultaeously; that is, we assume that 1 b. I this case, the factorizatio of the diagoal block (lie 4) requires oly Θ(b 2 )words to be trasferred sice the etire block fits ito fast memory. Also, i the triagular system solutio for the colum pael A 2 (lie 6), the triagular matrix A 22 is oly oe block, so the computatio ca be performed by readig the blocks of A 2 i sequece ad updatig them idividually i fast memory. Thus, the amout of commuicatio required by that subroutie is Θ(b 2 ) times the umber of blocks i the colum pael. Fially, the dimesios of the matrices i the matrix multiply of lie durig the jth iteratio of the loop are b b, b (j 1)b, ad(j 1)b b, ad the dimesios of the matrices i the matrix multiply of lie 5 durig the jth iteratio are (/b j)b b, (/b j)b (j 1)b, ad(j 1)b b. Badwidth cost. Uder the assumptios above, a upper boud o the umber of words trasferred betwee slow ad fast memory while executig Algorithm 6 is give by /b [ B() B (b, (j 1)b, b) j=1 ] +Θ(b 2 )+B ((/b j)b, (j 1)b, b)+(/b j)θ(b 2 ), where B (m,, r) is the badwidth cost required to execute a matrix multiplicatio of matrices of size m ad r i order to update a matrix of size r. SiceB is odecreasig i each of its variables, we have B() [ B (b,, b)+θ(b 2 )+B (,, b)+ ] b b Θ(b2 ) [ 2B (,, b)+θ(b 2 )+ ] b b Θ(b2 ). Assumig the matrix multiply algorithm used by LA ACK achieves the same badwidth cost as the oe give i sectio.1. (the iterative blocked algorithm with the correct block size suffices), we have ( 2 ) b B (,, b) =Θ b. Sice b ad b, B (,, b) =O( 2 ). Thus, B() =O b + 2,

Lecture 1: Introduction and Strassen s Algorithm

Lecture 1: Introduction and Strassen s Algorithm 5-750: Graduate Algorithms Jauary 7, 08 Lecture : Itroductio ad Strasse s Algorithm Lecturer: Gary Miller Scribe: Robert Parker Itroductio Machie models I this class, we will primarily use the Radom Access

More information

Ones Assignment Method for Solving Traveling Salesman Problem

Ones Assignment Method for Solving Traveling Salesman Problem Joural of mathematics ad computer sciece 0 (0), 58-65 Oes Assigmet Method for Solvig Travelig Salesma Problem Hadi Basirzadeh Departmet of Mathematics, Shahid Chamra Uiversity, Ahvaz, Ira Article history:

More information

Elementary Educational Computer

Elementary Educational Computer Chapter 5 Elemetary Educatioal Computer. Geeral structure of the Elemetary Educatioal Computer (EEC) The EEC coforms to the 5 uits structure defied by vo Neuma's model (.) All uits are preseted i a simplified

More information

Lecture 5. Counting Sort / Radix Sort

Lecture 5. Counting Sort / Radix Sort Lecture 5. Coutig Sort / Radix Sort T. H. Corme, C. E. Leiserso ad R. L. Rivest Itroductio to Algorithms, 3rd Editio, MIT Press, 2009 Sugkyukwa Uiversity Hyuseug Choo choo@skku.edu Copyright 2000-2018

More information

EE123 Digital Signal Processing

EE123 Digital Signal Processing Last Time EE Digital Sigal Processig Lecture 7 Block Covolutio, Overlap ad Add, FFT Discrete Fourier Trasform Properties of the Liear covolutio through circular Today Liear covolutio with Overlap ad add

More information

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Pseudocode ( 1.1) High-level descriptio of a algorithm More structured

More information

Big-O Analysis. Asymptotics

Big-O Analysis. Asymptotics Big-O Aalysis 1 Defiitio: Suppose that f() ad g() are oegative fuctios of. The we say that f() is O(g()) provided that there are costats C > 0 ad N > 0 such that for all > N, f() Cg(). Big-O expresses

More information

LU Decomposition Method

LU Decomposition Method SOLUTION OF SIMULTANEOUS LINEAR EQUATIONS LU Decompositio Method Jamie Traha, Autar Kaw, Kevi Marti Uiversity of South Florida Uited States of America kaw@eg.usf.edu http://umericalmethods.eg.usf.edu Itroductio

More information

A Study on the Performance of Cholesky-Factorization using MPI

A Study on the Performance of Cholesky-Factorization using MPI A Study o the Performace of Cholesky-Factorizatio usig MPI Ha S. Kim Scott B. Bade Departmet of Computer Sciece ad Egieerig Uiversity of Califoria Sa Diego {hskim, bade}@cs.ucsd.edu Abstract Cholesky-factorizatio

More information

Chapter 3 Classification of FFT Processor Algorithms

Chapter 3 Classification of FFT Processor Algorithms Chapter Classificatio of FFT Processor Algorithms The computatioal complexity of the Discrete Fourier trasform (DFT) is very high. It requires () 2 complex multiplicatios ad () complex additios [5]. As

More information

Lecture 6. Lecturer: Ronitt Rubinfeld Scribes: Chen Ziv, Eliav Buchnik, Ophir Arie, Jonathan Gradstein

Lecture 6. Lecturer: Ronitt Rubinfeld Scribes: Chen Ziv, Eliav Buchnik, Ophir Arie, Jonathan Gradstein 068.670 Subliear Time Algorithms November, 0 Lecture 6 Lecturer: Roitt Rubifeld Scribes: Che Ziv, Eliav Buchik, Ophir Arie, Joatha Gradstei Lesso overview. Usig the oracle reductio framework for approximatig

More information

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming Lecture Notes 6 Itroductio to algorithm aalysis CSS 501 Data Structures ad Object-Orieted Programmig Readig for this lecture: Carrao, Chapter 10 To be covered i this lecture: Itroductio to algorithm aalysis

More information

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19 CIS Data Structures ad Algorithms with Java Sprig 09 Stacks, Queues, ad Heaps Moday, February 8 / Tuesday, February 9 Stacks ad Queues Recall the stack ad queue ADTs (abstract data types from lecture.

More information

Communication-optimal Parallel and Sequential Cholesky decomposition

Communication-optimal Parallel and Sequential Cholesky decomposition Communication-optimal arallel and Sequential Cholesky decomposition Grey Ballard James Demmel Olga Holtz Oded Schwartz Electrical Engineering and Computer Sciences University of California at Berkeley

More information

On (K t e)-saturated Graphs

On (K t e)-saturated Graphs Noame mauscript No. (will be iserted by the editor O (K t e-saturated Graphs Jessica Fuller Roald J. Gould the date of receipt ad acceptace should be iserted later Abstract Give a graph H, we say a graph

More information

How do we evaluate algorithms?

How do we evaluate algorithms? F2 Readig referece: chapter 2 + slides Algorithm complexity Big O ad big Ω To calculate ruig time Aalysis of recursive Algorithms Next time: Litterature: slides mostly The first Algorithm desig methods:

More information

Big-O Analysis. Asymptotics

Big-O Analysis. Asymptotics Big-O Aalysis 1 Defiitio: Suppose that f() ad g() are oegative fuctios of. The we say that f() is O(g()) provided that there are costats C > 0 ad N > 0 such that for all > N, f() Cg(). Big-O expresses

More information

Counting Regions in the Plane and More 1

Counting Regions in the Plane and More 1 Coutig Regios i the Plae ad More 1 by Zvezdelia Stakova Berkeley Math Circle Itermediate I Group September 016 1. Overarchig Problem Problem 1 Regios i a Circle. The vertices of a polygos are arraged o

More information

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design College of Computer ad Iformatio Scieces Departmet of Computer Sciece CSC 220: Computer Orgaizatio Uit 11 Basic Computer Orgaizatio ad Desig 1 For the rest of the semester, we ll focus o computer architecture:

More information

CIS 121 Data Structures and Algorithms with Java Fall Big-Oh Notation Tuesday, September 5 (Make-up Friday, September 8)

CIS 121 Data Structures and Algorithms with Java Fall Big-Oh Notation Tuesday, September 5 (Make-up Friday, September 8) CIS 11 Data Structures ad Algorithms with Java Fall 017 Big-Oh Notatio Tuesday, September 5 (Make-up Friday, September 8) Learig Goals Review Big-Oh ad lear big/small omega/theta otatios Practice solvig

More information

An Efficient Algorithm for Graph Bisection of Triangularizations

An Efficient Algorithm for Graph Bisection of Triangularizations A Efficiet Algorithm for Graph Bisectio of Triagularizatios Gerold Jäger Departmet of Computer Sciece Washigto Uiversity Campus Box 1045 Oe Brookigs Drive St. Louis, Missouri 63130-4899, USA jaegerg@cse.wustl.edu

More information

6.854J / J Advanced Algorithms Fall 2008

6.854J / J Advanced Algorithms Fall 2008 MIT OpeCourseWare http://ocw.mit.edu 6.854J / 18.415J Advaced Algorithms Fall 2008 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 18.415/6.854 Advaced Algorithms

More information

Data Structures and Algorithms. Analysis of Algorithms

Data Structures and Algorithms. Analysis of Algorithms Data Structures ad Algorithms Aalysis of Algorithms Outlie Ruig time Pseudo-code Big-oh otatio Big-theta otatio Big-omega otatio Asymptotic algorithm aalysis Aalysis of Algorithms Iput Algorithm Output

More information

Analysis of Algorithms

Analysis of Algorithms Aalysis of Algorithms Ruig Time of a algorithm Ruig Time Upper Bouds Lower Bouds Examples Mathematical facts Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite

More information

Fast Fourier Transform (FFT) Algorithms

Fast Fourier Transform (FFT) Algorithms Fast Fourier Trasform FFT Algorithms Relatio to the z-trasform elsewhere, ozero, z x z X x [ ] 2 ~ elsewhere,, ~ e j x X x x π j e z z X X π 2 ~ The DFS X represets evely spaced samples of the z- trasform

More information

Homework 1 Solutions MA 522 Fall 2017

Homework 1 Solutions MA 522 Fall 2017 Homework 1 Solutios MA 5 Fall 017 1. Cosider the searchig problem: Iput A sequece of umbers A = [a 1,..., a ] ad a value v. Output A idex i such that v = A[i] or the special value NIL if v does ot appear

More information

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures COMP 633 - Parallel Computig Lecture 2 August 24, 2017 : The PRAM model ad complexity measures 1 First class summary This course is about parallel computig to achieve high-er performace o idividual problems

More information

Communication-Optimal Parallel Algorithm for Strassen s Matrix Multiplication

Communication-Optimal Parallel Algorithm for Strassen s Matrix Multiplication Commuicatio-Optimal arallel Algorithm for Strasse s Matrix Multiplicatio Grey Ballard James Demmel Olga Holtz Bejami Lipshitz Oded Schwartz Electrical Egieerig ad Computer Scieces Uiversity of Califoria

More information

Pattern Recognition Systems Lab 1 Least Mean Squares

Pattern Recognition Systems Lab 1 Least Mean Squares Patter Recogitio Systems Lab 1 Least Mea Squares 1. Objectives This laboratory work itroduces the OpeCV-based framework used throughout the course. I this assigmet a lie is fitted to a set of poits usig

More information

Module 8-7: Pascal s Triangle and the Binomial Theorem

Module 8-7: Pascal s Triangle and the Binomial Theorem Module 8-7: Pascal s Triagle ad the Biomial Theorem Gregory V. Bard April 5, 017 A Note about Notatio Just to recall, all of the followig mea the same thig: ( 7 7C 4 C4 7 7C4 5 4 ad they are (all proouced

More information

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time ( 3.1) Aalysis of Algorithms Iput Algorithm Output A algorithm is a step- by- step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects.

More information

The Magma Database file formats

The Magma Database file formats The Magma Database file formats Adrew Gaylard, Bret Pikey, ad Mart-Mari Breedt Johaesburg, South Africa 15th May 2006 1 Summary Magma is a ope-source object database created by Chris Muller, of Kasas City,

More information

Analysis of Algorithms

Analysis of Algorithms Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Ruig Time Most algorithms trasform iput objects ito output objects. The

More information

1.2 Binomial Coefficients and Subsets

1.2 Binomial Coefficients and Subsets 1.2. BINOMIAL COEFFICIENTS AND SUBSETS 13 1.2 Biomial Coefficiets ad Subsets 1.2-1 The loop below is part of a program to determie the umber of triagles formed by poits i the plae. for i =1 to for j =

More information

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis Itro to Algorithm Aalysis Aalysis Metrics Slides. Table of Cotets. Aalysis Metrics 3. Exact Aalysis Rules 4. Simple Summatio 5. Summatio Formulas 6. Order of Magitude 7. Big-O otatio 8. Big-O Theorems

More information

Graphs. Minimum Spanning Trees. Slides by Rose Hoberman (CMU)

Graphs. Minimum Spanning Trees. Slides by Rose Hoberman (CMU) Graphs Miimum Spaig Trees Slides by Rose Hoberma (CMU) Problem: Layig Telephoe Wire Cetral office 2 Wirig: Naïve Approach Cetral office Expesive! 3 Wirig: Better Approach Cetral office Miimize the total

More information

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects. The

More information

Appendix D. Controller Implementation

Appendix D. Controller Implementation COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Appedix D Cotroller Implemetatio Cotroller Implemetatios Combiatioal logic (sigle-cycle); Fiite state machie (multi-cycle, pipelied);

More information

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 1 Itroductio to Computers ad C++ Programmig Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 1.1 Computer Systems 1.2 Programmig ad Problem Solvig 1.3 Itroductio to C++ 1.4 Testig

More information

An Efficient Algorithm for Graph Bisection of Triangularizations

An Efficient Algorithm for Graph Bisection of Triangularizations Applied Mathematical Scieces, Vol. 1, 2007, o. 25, 1203-1215 A Efficiet Algorithm for Graph Bisectio of Triagularizatios Gerold Jäger Departmet of Computer Sciece Washigto Uiversity Campus Box 1045, Oe

More information

Alpha Individual Solutions MAΘ National Convention 2013

Alpha Individual Solutions MAΘ National Convention 2013 Alpha Idividual Solutios MAΘ Natioal Covetio 0 Aswers:. D. A. C 4. D 5. C 6. B 7. A 8. C 9. D 0. B. B. A. D 4. C 5. A 6. C 7. B 8. A 9. A 0. C. E. B. D 4. C 5. A 6. D 7. B 8. C 9. D 0. B TB. 570 TB. 5

More information

condition w i B i S maximum u i

condition w i B i S maximum u i ecture 10 Dyamic Programmig 10.1 Kapsack Problem November 1, 2004 ecturer: Kamal Jai Notes: Tobias Holgers We are give a set of items U = {a 1, a 2,..., a }. Each item has a weight w i Z + ad a utility

More information

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov Sortig i Liear Time Data Structures ad Algorithms Adrei Bulatov Algorithms Sortig i Liear Time 7-2 Compariso Sorts The oly test that all the algorithms we have cosidered so far is compariso The oly iformatio

More information

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved.

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved. Chapter 11 Frieds, Overloaded Operators, ad Arrays i Classes Copyright 2014 Pearso Addiso-Wesley. All rights reserved. Overview 11.1 Fried Fuctios 11.2 Overloadig Operators 11.3 Arrays ad Classes 11.4

More information

CSC165H1 Worksheet: Tutorial 8 Algorithm analysis (SOLUTIONS)

CSC165H1 Worksheet: Tutorial 8 Algorithm analysis (SOLUTIONS) CSC165H1, Witer 018 Learig Objectives By the ed of this worksheet, you will: Aalyse the ruig time of fuctios cotaiig ested loops. 1. Nested loop variatios. Each of the followig fuctios takes as iput a

More information

. Written in factored form it is easy to see that the roots are 2, 2, i,

. Written in factored form it is easy to see that the roots are 2, 2, i, CMPS A Itroductio to Programmig Programmig Assigmet 4 I this assigmet you will write a java program that determies the real roots of a polyomial that lie withi a specified rage. Recall that the roots (or

More information

Counting the Number of Minimum Roman Dominating Functions of a Graph

Counting the Number of Minimum Roman Dominating Functions of a Graph Coutig the Number of Miimum Roma Domiatig Fuctios of a Graph SHI ZHENG ad KOH KHEE MENG, Natioal Uiversity of Sigapore We provide two algorithms coutig the umber of miimum Roma domiatig fuctios of a graph

More information

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions U.C. Berkeley CS170 : Algorithms Midterm 1 Solutios Lecturers: Sajam Garg ad Prasad Raghavedra Feb 1, 017 Midterm 1 Solutios 1. (4 poits) For the directed graph below, fid all the strogly coected compoets

More information

Algorithm. Counting Sort Analysis of Algorithms

Algorithm. Counting Sort Analysis of Algorithms Algorithm Coutig Sort Aalysis of Algorithms Assumptios: records Coutig sort Each record cotais keys ad data All keys are i the rage of 1 to k Space The usorted list is stored i A, the sorted list will

More information

EE260: Digital Design, Spring /16/18. n Example: m 0 (=x 1 x 2 ) is adjacent to m 1 (=x 1 x 2 ) and m 2 (=x 1 x 2 ) but NOT m 3 (=x 1 x 2 )

EE260: Digital Design, Spring /16/18. n Example: m 0 (=x 1 x 2 ) is adjacent to m 1 (=x 1 x 2 ) and m 2 (=x 1 x 2 ) but NOT m 3 (=x 1 x 2 ) EE26: Digital Desig, Sprig 28 3/6/8 EE 26: Itroductio to Digital Desig Combiatioal Datapath Yao Zheg Departmet of Electrical Egieerig Uiversity of Hawaiʻi at Māoa Combiatioal Logic Blocks Multiplexer Ecoders/Decoders

More information

Creating Exact Bezier Representations of CST Shapes. David D. Marshall. California Polytechnic State University, San Luis Obispo, CA , USA

Creating Exact Bezier Representations of CST Shapes. David D. Marshall. California Polytechnic State University, San Luis Obispo, CA , USA Creatig Exact Bezier Represetatios of CST Shapes David D. Marshall Califoria Polytechic State Uiversity, Sa Luis Obispo, CA 93407-035, USA The paper presets a method of expressig CST shapes pioeered by

More information

Appendix A. Use of Operators in ARPS

Appendix A. Use of Operators in ARPS A Appedix A. Use of Operators i ARPS The methodology for solvig the equatios of hydrodyamics i either differetial or itegral form usig grid-poit techiques (fiite differece, fiite volume, fiite elemet)

More information

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method A ew Morphological 3D Shape Decompositio: Grayscale Iterframe Iterpolatio Method D.. Vizireau Politehica Uiversity Bucharest, Romaia ae@comm.pub.ro R. M. Udrea Politehica Uiversity Bucharest, Romaia mihea@comm.pub.ro

More information

Σ P(i) ( depth T (K i ) + 1),

Σ P(i) ( depth T (K i ) + 1), EECS 3101 York Uiversity Istructor: Ady Mirzaia DYNAMIC PROGRAMMING: OPIMAL SAIC BINARY SEARCH REES his lecture ote describes a applicatio of the dyamic programmig paradigm o computig the optimal static

More information

Outline and Reading. Analysis of Algorithms. Running Time. Experimental Studies. Limitations of Experiments. Theoretical Analysis

Outline and Reading. Analysis of Algorithms. Running Time. Experimental Studies. Limitations of Experiments. Theoretical Analysis Outlie ad Readig Aalysis of Algorithms Iput Algorithm Output Ruig time ( 3.) Pseudo-code ( 3.2) Coutig primitive operatios ( 3.3-3.) Asymptotic otatio ( 3.6) Asymptotic aalysis ( 3.7) Case study Aalysis

More information

BOOLEAN MATHEMATICS: GENERAL THEORY

BOOLEAN MATHEMATICS: GENERAL THEORY CHAPTER 3 BOOLEAN MATHEMATICS: GENERAL THEORY 3.1 ISOMORPHIC PROPERTIES The ame Boolea Arithmetic was chose because it was discovered that literal Boolea Algebra could have a isomorphic umerical aspect.

More information

Evaluation scheme for Tracking in AMI

Evaluation scheme for Tracking in AMI A M I C o m m u i c a t i o A U G M E N T E D M U L T I - P A R T Y I N T E R A C T I O N http://www.amiproject.org/ Evaluatio scheme for Trackig i AMI S. Schreiber a D. Gatica-Perez b AMI WP4 Trackig:

More information

The Closest Line to a Data Set in the Plane. David Gurney Southeastern Louisiana University Hammond, Louisiana

The Closest Line to a Data Set in the Plane. David Gurney Southeastern Louisiana University Hammond, Louisiana The Closest Lie to a Data Set i the Plae David Gurey Southeaster Louisiaa Uiversity Hammod, Louisiaa ABSTRACT This paper looks at three differet measures of distace betwee a lie ad a data set i the plae:

More information

A Recursive Blocked Schur Algorithm for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012.

A Recursive Blocked Schur Algorithm for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012. A Recursive Blocked Schur Algorithm for Computig the Matrix Square Root Deadma, Edvi ad Higham, Nicholas J. ad Ralha, Rui 212 MIMS EPrit: 212.26 Machester Istitute for Mathematical Scieces School of Mathematics

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5 Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases:

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter The Processor Part A path Desig Itroductio CPU performace factors Istructio cout Determied by ISA ad compiler. CPI ad

More information

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015. Presetatio for use with the textbook Algorithm Desig ad Applicatios, by M. T. Goodrich ad R. Tamassia, Wiley, 2015 Hash Tables xkcd. http://xkcd.com/221/. Radom Number. Used with permissio uder Creative

More information

A Generalized Set Theoretic Approach for Time and Space Complexity Analysis of Algorithms and Functions

A Generalized Set Theoretic Approach for Time and Space Complexity Analysis of Algorithms and Functions Proceedigs of the 10th WSEAS Iteratioal Coferece o APPLIED MATHEMATICS, Dallas, Texas, USA, November 1-3, 2006 316 A Geeralized Set Theoretic Approach for Time ad Space Complexity Aalysis of Algorithms

More information

Speeding-up dynamic programming in sequence alignment

Speeding-up dynamic programming in sequence alignment Departmet of Computer Sciece Aarhus Uiversity Demark Speedig-up dyamic programmig i sequece aligmet Master s Thesis Dug My Hoa - 443 December, Supervisor: Christia Nørgaard Storm Pederse Implemetatio code

More information

Combination Labelings Of Graphs

Combination Labelings Of Graphs Applied Mathematics E-Notes, (0), - c ISSN 0-0 Available free at mirror sites of http://wwwmaththuedutw/ame/ Combiatio Labeligs Of Graphs Pak Chig Li y Received February 0 Abstract Suppose G = (V; E) is

More information

The isoperimetric problem on the hypercube

The isoperimetric problem on the hypercube The isoperimetric problem o the hypercube Prepared by: Steve Butler November 2, 2005 1 The isoperimetric problem We will cosider the -dimesioal hypercube Q Recall that the hypercube Q is a graph whose

More information

Improving Template Based Spike Detection

Improving Template Based Spike Detection Improvig Template Based Spike Detectio Kirk Smith, Member - IEEE Portlad State Uiversity petra@ee.pdx.edu Abstract Template matchig algorithms like SSE, Covolutio ad Maximum Likelihood are well kow for

More information

Perhaps the method will give that for every e > U f() > p - 3/+e There is o o-trivial upper boud for f() ad ot eve f() < Z - e. seems to be kow, where

Perhaps the method will give that for every e > U f() > p - 3/+e There is o o-trivial upper boud for f() ad ot eve f() < Z - e. seems to be kow, where ON MAXIMUM CHORDAL SUBGRAPH * Paul Erdos Mathematical Istitute of the Hugaria Academy of Scieces ad Reu Laskar Clemso Uiversity 1. Let G() deote a udirected graph, with vertices ad V(G) deote the vertex

More information

Lecture 2: Spectra of Graphs

Lecture 2: Spectra of Graphs Spectral Graph Theory ad Applicatios WS 20/202 Lecture 2: Spectra of Graphs Lecturer: Thomas Sauerwald & He Su Our goal is to use the properties of the adjacecy/laplacia matrix of graphs to first uderstad

More information

Solution printed. Do not start the test until instructed to do so! CS 2604 Data Structures Midterm Spring, Instructions:

Solution printed. Do not start the test until instructed to do so! CS 2604 Data Structures Midterm Spring, Instructions: CS 604 Data Structures Midterm Sprig, 00 VIRG INIA POLYTECHNIC INSTITUTE AND STATE U T PROSI M UNI VERSI TY Istructios: Prit your ame i the space provided below. This examiatio is closed book ad closed

More information

Lecture 18. Optimization in n dimensions

Lecture 18. Optimization in n dimensions Lecture 8 Optimizatio i dimesios Itroductio We ow cosider the problem of miimizig a sigle scalar fuctio of variables, f x, where x=[ x, x,, x ]T. The D case ca be visualized as fidig the lowest poit of

More information

Computational Geometry

Computational Geometry Computatioal Geometry Chapter 4 Liear programmig Duality Smallest eclosig disk O the Ageda Liear Programmig Slides courtesy of Craig Gotsma 4. 4. Liear Programmig - Example Defie: (amout amout cosumed

More information

9.1. Sequences and Series. Sequences. What you should learn. Why you should learn it. Definition of Sequence

9.1. Sequences and Series. Sequences. What you should learn. Why you should learn it. Definition of Sequence _9.qxd // : AM Page Chapter 9 Sequeces, Series, ad Probability 9. Sequeces ad Series What you should lear Use sequece otatio to write the terms of sequeces. Use factorial otatio. Use summatio otatio to

More information

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation Improvemet of the Orthogoal Code Covolutio Capabilities Usig FPGA Implemetatio Naima Kaabouch, Member, IEEE, Apara Dhirde, Member, IEEE, Saleh Faruque, Member, IEEE Departmet of Electrical Egieerig, Uiversity

More information

Python Programming: An Introduction to Computer Science

Python Programming: An Introduction to Computer Science Pytho Programmig: A Itroductio to Computer Sciece Chapter 6 Defiig Fuctios Pytho Programmig, 2/e 1 Objectives To uderstad why programmers divide programs up ito sets of cooperatig fuctios. To be able to

More information

n n B. How many subsets of C are there of cardinality n. We are selecting elements for such a

n n B. How many subsets of C are there of cardinality n. We are selecting elements for such a 4. [10] Usig a combiatorial argumet, prove that for 1: = 0 = Let A ad B be disjoit sets of cardiality each ad C = A B. How may subsets of C are there of cardiality. We are selectig elemets for such a subset

More information

Chapter 5. Functions for All Subtasks. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 5. Functions for All Subtasks. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 5 Fuctios for All Subtasks Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 5.1 void Fuctios 5.2 Call-By-Referece Parameters 5.3 Usig Procedural Abstractio 5.4 Testig ad Debuggig

More information

prerequisites: 6.046, 6.041/2, ability to do proofs Randomized algorithms: make random choices during run. Main benefits:

prerequisites: 6.046, 6.041/2, ability to do proofs Randomized algorithms: make random choices during run. Main benefits: Itro Admiistrivia. Sigup sheet. prerequisites: 6.046, 6.041/2, ability to do proofs homework weekly (first ext week) collaboratio idepedet homeworks gradig requiremet term project books. questio: scribig?

More information

Reliable Transmission. Spring 2018 CS 438 Staff - University of Illinois 1

Reliable Transmission. Spring 2018 CS 438 Staff - University of Illinois 1 Reliable Trasmissio Sprig 2018 CS 438 Staff - Uiversity of Illiois 1 Reliable Trasmissio Hello! My computer s ame is Alice. Alice Bob Hello! Alice. Sprig 2018 CS 438 Staff - Uiversity of Illiois 2 Reliable

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 18 Strategies for Query Processig Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio DBMS techiques to process a query Scaer idetifies

More information

Sorting 9/15/2009. Sorting Problem. Insertion Sort: Soundness. Insertion Sort. Insertion Sort: Running Time. Insertion Sort: Soundness

Sorting 9/15/2009. Sorting Problem. Insertion Sort: Soundness. Insertion Sort. Insertion Sort: Running Time. Insertion Sort: Soundness 9/5/009 Algorithms Sortig 3- Sortig Sortig Problem The Sortig Problem Istace: A sequece of umbers Objective: A permutatio (reorderig) such that a ' K a' a, K,a a ', K, a' of the iput sequece The umbers

More information

A graphical view of big-o notation. c*g(n) f(n) f(n) = O(g(n))

A graphical view of big-o notation. c*g(n) f(n) f(n) = O(g(n)) ca see that time required to search/sort grows with size of We How do space/time eeds of program grow with iput size? iput. time: cout umber of operatios as fuctio of iput Executio size operatio Assigmet:

More information

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs What are we goig to lear? CSC316-003 Data Structures Aalysis of Algorithms Computer Sciece North Carolia State Uiversity Need to say that some algorithms are better tha others Criteria for evaluatio Structure

More information

Numerical Methods Lecture 6 - Curve Fitting Techniques

Numerical Methods Lecture 6 - Curve Fitting Techniques Numerical Methods Lecture 6 - Curve Fittig Techiques Topics motivatio iterpolatio liear regressio higher order polyomial form expoetial form Curve fittig - motivatio For root fidig, we used a give fuctio

More information

Analysis of Algorithms

Analysis of Algorithms Presetatio for use with the textbook, Algorithm Desig ad Applicatios, by M. T. Goodrich ad R. Tamassia, Wiley, 2015 Aalysis of Algorithms Iput 2015 Goodrich ad Tamassia Algorithm Aalysis of Algorithms

More information

FREQUENCY ESTIMATION OF INTERNET PACKET STREAMS WITH LIMITED SPACE: UPPER AND LOWER BOUNDS

FREQUENCY ESTIMATION OF INTERNET PACKET STREAMS WITH LIMITED SPACE: UPPER AND LOWER BOUNDS FREQUENCY ESTIMATION OF INTERNET PACKET STREAMS WITH LIMITED SPACE: UPPER AND LOWER BOUNDS Prosejit Bose Evagelos Kraakis Pat Mori Yihui Tag School of Computer Sciece, Carleto Uiversity {jit,kraakis,mori,y

More information

CIS 121. Introduction to Trees

CIS 121. Introduction to Trees CIS 121 Itroductio to Trees 1 Tree ADT Tree defiitio q A tree is a set of odes which may be empty q If ot empty, the there is a distiguished ode r, called root ad zero or more o-empty subtrees T 1, T 2,

More information

Heaps. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015

Heaps. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015 Presetatio for use with the textbook Algorithm Desig ad Applicatios, by M. T. Goodrich ad R. Tamassia, Wiley, 201 Heaps 201 Goodrich ad Tamassia xkcd. http://xkcd.com/83/. Tree. Used with permissio uder

More information

Accuracy Improvement in Camera Calibration

Accuracy Improvement in Camera Calibration Accuracy Improvemet i Camera Calibratio FaJie L Qi Zag ad Reihard Klette CITR, Computer Sciece Departmet The Uiversity of Aucklad Tamaki Campus, Aucklad, New Zealad fli006, qza001@ec.aucklad.ac.z r.klette@aucklad.ac.z

More information

A Parallel DFA Minimization Algorithm

A Parallel DFA Minimization Algorithm A Parallel DFA Miimizatio Algorithm Ambuj Tewari, Utkarsh Srivastava, ad P. Gupta Departmet of Computer Sciece & Egieerig Idia Istitute of Techology Kapur Kapur 208 016,INDIA pg@iitk.ac.i Abstract. I this

More information

Exact Minimum Lower Bound Algorithm for Traveling Salesman Problem

Exact Minimum Lower Bound Algorithm for Traveling Salesman Problem Exact Miimum Lower Boud Algorithm for Travelig Salesma Problem Mohamed Eleiche GeoTiba Systems mohamed.eleiche@gmail.com Abstract The miimum-travel-cost algorithm is a dyamic programmig algorithm to compute

More information

IMP: Superposer Integrated Morphometrics Package Superposition Tool

IMP: Superposer Integrated Morphometrics Package Superposition Tool IMP: Superposer Itegrated Morphometrics Package Superpositio Tool Programmig by: David Lieber ( 03) Caisius College 200 Mai St. Buffalo, NY 4208 Cocept by: H. David Sheets, Dept. of Physics, Caisius College

More information

AN OPTIMIZATION NETWORK FOR MATRIX INVERSION

AN OPTIMIZATION NETWORK FOR MATRIX INVERSION 397 AN OPTIMIZATION NETWORK FOR MATRIX INVERSION Ju-Seog Jag, S~ Youg Lee, ad Sag-Yug Shi Korea Advaced Istitute of Sciece ad Techology, P.O. Box 150, Cheogryag, Seoul, Korea ABSTRACT Iverse matrix calculatio

More information

Chapter 9. Pointers and Dynamic Arrays. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 9. Pointers and Dynamic Arrays. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 9 Poiters ad Dyamic Arrays Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 9.1 Poiters 9.2 Dyamic Arrays Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Slide 9-3

More information

CIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13

CIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13 CIS Data Structures ad Algorithms with Java Sprig 08 Stacks ad Queues Moday, February / Tuesday, February Learig Goals Durig this lab, you will: Review stacks ad queues. Lear amortized ruig time aalysis

More information

Protected points in ordered trees

Protected points in ordered trees Applied Mathematics Letters 008 56 50 www.elsevier.com/locate/aml Protected poits i ordered trees Gi-Sag Cheo a, Louis W. Shapiro b, a Departmet of Mathematics, Sugkyukwa Uiversity, Suwo 440-746, Republic

More information

Thompson s Group F (p + 1) is not Minimally Almost Convex

Thompson s Group F (p + 1) is not Minimally Almost Convex Thompso s Group F (p + ) is ot Miimally Almost Covex Claire Wladis Thompso s Group F (p + ). A Descriptio of F (p + ) Thompso s group F (p + ) ca be defied as the group of piecewiseliear orietatio-preservig

More information

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1 Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Memory Hierarchy (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Itroductio Programmers wat ulimited amouts

More information

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000. 5-23 The course that gives CM its Zip Memory Maagemet II: Dyamic Storage Allocatio Mar 6, 2000 Topics Segregated lists Buddy system Garbage collectio Mark ad Sweep Copyig eferece coutig Basic allocator

More information

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers *

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers * Load balaced Parallel Prime umber Geerator with Sieve of Eratosthees o luster omputers * Soowook Hwag*, Kyusik hug**, ad Dogseug Kim* *Departmet of Electrical Egieerig Korea Uiversity Seoul, -, Rep. of

More information