Efficient Distributed Linear Classification Algorithms via the Alternating Direction Method of Multipliers

Size: px

Start display at page:

Download "Efficient Distributed Linear Classification Algorithms via the Alternating Direction Method of Multipliers"

Phoebe Gaines
5 years ago
Views:

1 Effcent Dstrbuted Lnear Classfcaton Algorthms va the Alternatng Drecton Method of Multplers Caoxe Zhang Honglak Lee Kang G. Shn Department of EECS Unversty of Mchgan Ann Arbor, MI 48109, USA Department of EECS Unversty of Mchgan Ann Arbor, MI 48109, USA Department of EECS Unversty of Mchgan Ann Arbor, MI 48109, USA Abstract Lnear classfcaton has demonstrated success n many areas of applcatons. Modern algorthms for lnear classfcaton can tran reasonably good models whle gong through the data n only tens of rounds. However, large data often does not ft n the memory of a sngle machne, whch makes the bottleneck n large-scale learnng the dsk I/O, not the CPU. Followng ths observaton, Yu et al. 2010) made sgnfcant progress n reducng dsk usage, and ther algorthms now outperform LIBLINEAR. In ths paper, rather than optmzng algorthms on a sngle machne, we propose and mplement dstrbuted algorthms that acheve parallel dsk loadng and access the dsk only once. Our large-scale learnng algorthms are based on the framework of alternatng drecton methods of multplers. The framework derves a subproblem that remans to be solved effcently for whch we propose usng dual coordnate descent and trust regon Newton method. Our expermental evaluatons on large datasets demonstrate that the proposed algorthms acheve sgnfcant speedup over the classfer proposed by Yu et al. runnng on a sngle machne. Our algorthms are faster than exstng dstrbuted solvers, such as Znkevch et al. 2010) s parallel stochastc gradent descent and Vowpal Wabbt. 1 Introducton Large-scale lnear classfcaton has proven successful n many areas, such as machne learnng, data mnng, computer vson, and securty. Wth the burgeonng of socal meda, whch provdes an unprecedented amount of user-provded supervsed nformaton, there wll lkely be more extreme-scale data requrng classfcaton. Tranng on these large data can be done on Appearng n Proceedngs of the 15 th Internatonal Conference on Artfcal Intellgence and Statstcs AISTATS) 2012, La Palma, Canary Islands. Volume XX of JMLR: W&CP XX. Copyrght 2012 by the authors. a sngle machne or a dstrbuted system. Many algorthms for a sngle machne, such as LIBLINEAR [6] and PEGASOS [16], have been developed and extensvely used n both academa and ndustry. These algorthms can sequentally tran on the entre data n a few tens of rounds to obtan a good model and utlze the sparsty of the data. However, tranng wth largescale data on a sngle machne becomes slow because of the dsk I/O, not the CPU. Snce the dsk bandwdth s 2-3 orders of magntude slower than memory bandwdth and CPU speed, much of the tranng phase s spent on loadng data rather than processng. Ths scenaro worsens when the data cannot ft n man memory, causng severe dsk swaps. Yu et al. [19] addressed ths problem by readng blocks of data and processng them n batches. Ther algorthm reduced the dsk I/O tme va pre-fetchng and acheved a sgnfcant speedup over the orgnal LIBLINEAR, whch s one of the state-of-art lbrares for lnear classfcaton [6]. However, snce all sngle-machne algorthms go through the entre data many tmes, they have to load the same data from the dsk multple tmes when the data cannot ft n memory. If the data requres more than tens of ggabytes of storage greater than the typcal RAM capacty), those algorthms are stll neffcent snce the dsk needs to be accessed more than once and the tranng takes hours to complete. On the other hand, dstrbuted systems composed of commercal servers nodes) are becomng more prevalent both n research labs and medum-scale companes. A large Internet company may have thousands of nodes. Algorthms runnng on dstrbuted systems can now load the data n parallel to the dstrbuted memory and tran wthout usng dsk any more, thereby sgnfcantly reducng the cumbersome dsk loadng tme. One mportant challenge n the feld s to desgn new algorthms n the parallel settng, snce the current lnear classfcaton algorthms are nherently sequental. In ths paper, we propose dstrbuted algorthms for lnear classfcaton. The system archtecture s llustrated n Fg. 1. We use the alternatng drecton method of multplers ADMM) as the framework for solvng dstrbuted convex optmzaton; see Boyd et al. [2] for a revew of ADMM. Ths framework ntroduces addtonal varables to regularze the dfference 1398

2 Caoxe Zhang, Honglak Lee, Kang G. Shn Data dstrbuted n machnes # of features Data x B1 Data x Bm Solvng subproblems n parallel x B1, z w 1, u 1 Aggregaton and broadcastng x B2, z w 2, u 2 w + u z Data x B2 x Bm, z w m, u m Fgure 1: An llustraton of the dstrbuted lnear classfcaton algorthm. The data s splt by nstances across machnes. Each machne loads the local data n parallel and keeps t n the man memory. Then each machne uses effcent algorthms e.g., Algorthm 1 n Sec. 3.1) to solve a subproblem wth the nput beng the local data and a shared vector z to generate the output of a weght vector and an auxlary vector u. After solvng the subproblems, all machnes aggregate these two vectors usng averagng to generate z and broadcast t for the next round of teratons. among the models solved by the dstrbuted machnes. The ADMM framework provdes the freedom to propose effcent methods for solvng the subproblems n dstrbuted machnes. We propose a dual coordnate descent method that acheves lnear run-tme complexty n the number of samples and takes advantage of the feature vector s sparsty. We also use the trust regon Newton method to handle the dense data matrx. Early work on dstrbuted algorthms for kernel SVM was done by Chang et al. [3], who used an approxmate matrx factorzaton to solve the convex optmzaton. However, the run-tme complexty of kernel SVMs s at least quadratc n the number of samples. The authors n [2, 7] used the ADMM framework to solve a SVM problem n a sngle machne. However, they used a general convex optmzaton method for the subproblem, whch has super lnear complexty n the number of samples; moreover, they dd not evaluate performance on large data n realstc dstrbuted envronments. Our contrbutons are as follows. Frst, we propose an effcent dstrbuted lnear classfcaton algorthm that acheves parallel dsk loadng and lnear run-tme and space complexty. Second, our expermental evaluatons on large datasets n dstrbuted envronments show that the proposed approach s sgnfcantly faster than other exstng dstrbuted approaches, such as ) parallel stochastc gradent descent usng averagng to aggregate solutons as proposed by Znkevch et al. [20] and ) Vowpal Wabbt VW) [11]. Our proposed algorthm s also sgnfcantly faster than sngle-machne based algorthms, such as Block LIBLINEAR [19]. 2 A Dstrbuted Framework for Lnear Classfcaton In ths secton, we apply ADMM to lnear classfcaton to yeld the dstrbuted algorthm. ADMM s a general 1399 framework for solvng dstrbuted optmzaton. The frst work on ADMM may date back to the 1970s [8] and most of the theoretcal results have been establshed n the 1990s [5]. However, untl recently, ADMM was not wdely known n the feld. For completeness, we provde a revew of ts applcaton to lnear classfcaton. Gven a dataset {x, y )} l =1 x R n, y { 1, +1}), we consder L2-regularzed L2-loss squared hnge loss) SVM as the lnear classfcaton model. Our algorthms also apply to hnge loss and squared loss as we show n Sec. 5. Here, we focus on L2-loss for the sake of presentaton. mn w f 1w) = 1 2 w C l max1 y w T x, 0) 2, 1) =1 where C s a hyperparameter. For smplcty, we gnore the bas term, although one can append a constant to the feature vector. To make the problem amenable to decomposton, we frst let {B 1,..., B m } be a partton of all data ndces {1,..., l}. Then, we wrte down an equvalent problem as follows: mn w 1,...,w m,z 1 2 z C + m m ρ 2 w j z 2 2, B j max1 y w T j x, 0) 2 subject to w j z = 0, j = 1,..., m, 2) where ρ s a constant step sze for later teratons. Here we ntroduce a new weght vector w j that s assocated wth data B j, and a regularzaton vector z. The term m ρ 2 w j z 2 2 helps the algorthm converge more robustly. Let us denote w := {w 1,..., w m } and λ := {λ 1,..., λ m }, λ j R n, j = 1,..., m. We can have the Lagrangan of problem 2) as follows: Lw, z, λ) = 1 m 2 z C max1 y w T j x, 0) 2 B j m ρ + 2 w j z λ T j w j z)), 3) where λ are the dual varables. ADMM conssts of the followng teratons: w k+1 = arg mn Lw, z k, λ k ) 4) w z k+1 = arg mn Lw k+1, z, λ k ) 5) z λ k+1 j = λ k j + ρw k+1 j z k+1 ), j = 1,..., m. 6) Snce the Lagrangan L s separable n w j, we can solve

3 Caoxe Zhang, Honglak Lee, Kang G. Shn problem 4) n parallel: w k+1 j = arg mn w C B j max1 y w T x, 0) 2 7) + ρ 2 w z λ T j w z), j = 1,..., m. Also, z k+1 has a closed form soluton: z k+1 = ρ m wk+1 j + m λk j. 8) mρ + 1 A smple change of varables, lettng λ j = ρu j, can make the quadratc form of w j n Eq. 8) more compact. We now have the new ADMM teratons as: w k+1 j = arg mn C max1 y w T x, 0) 2 9) w B j + ρ 2 w zk + u k j 2 2, j = 1,..., m. m ) z k+1 w k+1 j + u k j = m + 1/ρ 10) u k+1 j = u k j + w k+1 j z k+1, j = 1,..., m. 11) Here, each machne j solves the subproblem 9) n parallel, whch s only assocated wth data x Bj {x : B j }. Machne j also loads the data x Bj from the dsk only once and stores them n the memory n the ADMM teratons. Each machne only needs to communcate w j and u j wthout passng the data. The ADMM teratons also gve the followng theoretcal guarantees. For any ρ > 0 we have: 1. If we defne the resdual varable r k = [w k 1 z k,..., w k m z k ], then r k 0 as k. 2. The objectve functon n problem 2) converges to the optmal objectve functon of problem 1). 1 To show the above, snce 2 z 2 2 and max1 y w T x, 0) 2 are both closed and proper convex functons and problem 1) possesses strong dualty, they meet the condtons to guarantee the convergence results [2]. These results mply that eventually w k j would agree upon a consensus vector z k, whch would converge to the soluton of the orgnal SVM problem 1). The auxlary varables u k j convey to machne j how dfferent ts w k j s from zk and serve as the sgnal to pull w k+1 j nto consensus when machne j solves the subproblem 9). ADMM s essentally a subgradent-based method that solves the dual problem of 2). However, t has a better convergence result as compared to other dstrbuted approaches, such as the dual decomposton method [2]. ADMM s a frst order method, so t would seem that t would take many teratons to acheve 1400 hgh accuracy; however, our emprcal experence suggests that ADMM usually takes only tens of teratons to acheve good accuracy. Ths few number of the teratons can be enough for large-scale classfcaton tasks snce the SVM or logstc regresson) uses loss functons that approxmate the msclassfcaton errors, and t may not be necessary to mnmze these objectves exactly [1, 17]. In our experments, we show that our algorthm acheves fast convergence n tranng optmalty and test accuracy wthn only a few tens of ADMM teratons for large-scale datasets. 3 Effcent Algorthms for Solvng the Subproblem Although ADMM provdes a parallel framework, the ssue of solvng the subproblem 9) effcently stll remans. In ths subproblem, each machne contans a porton of the data, whch can stll be qute large to solve. In ths secton, we present both a dual method and a prmal method. The dual method s a coordnate descent method smlar to the one n [9]. To obtan an ɛ-optmal dual soluton, our method has a computatonal complexty of On nz log1/ɛ)), where n nz s the total number of non-zero features n the data. For well-scaled data, the term log1/ɛ) ncludng the constant) usually becomes a few of tens to acheve good accuracy, namely, the algorthm only needs to sequentally pass the data n a few of tens rounds. The prmal method s a trust regon Newton method smlar to the one n [12]. Ths method obtans an approxmated Hessan usng the conjugate gradent. The computatonal complexty s On nz number of conjugate gradent teratons number of Newton teratons). The prmal method can be more effcent f the data matrx s dense. We frst detal the dual coordnate descent method and then brefly descrbe the trust regon Newton method. 3.1 A Dual Coordnate Descent Method We rewrte the subproblem 9) n a more readable way: mn w f 2w) = ρ 2 w v C s max1 y w T x, 0) 2, =1 12) where v = z k u k j at the k-th ADMM teratons, and {x 1,..., x s } denotes the data n x Bj for some machne j. The above problem s dfferent from tradtonal SVM n ts regularzaton term, whch also consders the consensus to other machnes solutons. Therefore, we stll need an effcent specal solver for ths problem. The dual problem of 12) can be wrtten as a quadratc programmng: mn α f 3 α) = 1 2ρ αt Qα b T α subject to α 0,, 13) where Q = Q + D, Q j = y y j x T x j, D s a dagonal matrx, D = ρ/2c) and b = [1 y 1 v T x 1,..., 1 y s v T x s ] T.

4 Caoxe Zhang, Honglak Lee, Kang G. Shn We solve ths problem usng dual coordnate descent, whch optmzes one varable n α at a tme and then crcularly moves to the next varable and so on. For problem 13), the one-varable optmzaton has a close form soluton snce t s a quadratc mnmzaton. In other words, for any, we can optmze α whle fxng other varables. Let be the partal dervatve of f 3 wth respect of α at the t-th teraton, then we have α t+1) = = max s α t) j Q j b. 14) Thus, the optmal α wll be the root of projected on [0, ). We can update α as b ) j αt) j Q j Q, 0 = max α t) / Q ), 0 15) We can also use the projected partal dervatve, denoted as P, to determne the stoppng crtera of coordnate descent: { P mn0, = ) f α t) = 0, 16) otherwse. If P = 0, we do not need to update α t). To get n Eq. 14), the man computaton s Os). However, due to the specal structure of the problem, we can reduce to O n), where n s the average number of non-zero features n the data, to make the computatonal complexty ndependent of the sze of the data, s. The key dea here s to mantan an ntermedate vector w t) at each teraton as s w t) = y j α t) j x j + v. 17) Then, we can express as = y w t)t x 1 + D α t) 18) The man computaton s then the dot product of w t) and x, whch s O n) f we save x as a sparse form. To mantan w t) after α t) changes to α t+1), we update t as follows: ) w t+1) = w t) + α t+1) α t) y x. 19) Ths operaton also takes O n). The overall procedure s provded n Algorthm 1. For theoretcal results, we can easly apply results n [13] to show that Algorthm 1 converges and t takes Olog1/ɛ)) teratons of whle-loops to acheve an ɛ- accurate soluton,.e., f 3 α) f 3 α ) + ɛ. Therefore, the total computaton s Os n log1/ɛ)) for an ɛ-accurate soluton, where s n s the total number of non-zero features n data x Bj Algorthm 1 A dual coordnate descent method for solvng the problem 12) Intalze α 0), w 0) = s =1 y α 0) x + v and t = 0. whle α t) s not optmal do for = 1... s do t = t + 1 = y { w t)t x 1 + D α t) mn0, ) f α t) P = f P α t+1) 0 then = max w t+1) = w t) + else α t+1) = α t) w t+1) = w t) end f end for end whle α t) = 0, otherwse. ) /Q, 0 α t+1) α t) ) y x 3.2 A Trust Regon Newton Method We use the trust regon Newton method n [12] to solve the problem 12). Snce the Hessan n L2 loss SVM does not exst, the authors n [12] used a generalzed Hessan for approxmaton. Wth the generalzed Hessan they use the conjugate gradent method to fnd the Newton-lke drecton and use the trust regon Newton method to terate. The method s general and can be drectly appled to our problem 12). The only mportant dfference s the gradent of the objectve functon: f 2 w) = ρ + 2C I x x T )w 2C I where I = { 1 y w t) x ) > 0}. y x ρv, 20) 4 Improvng the Dstrbuted Algorthms We have dscussed the use of two effcent methods to solve the subproblem 9) under the framework of ADMM. However, there s stll room for further mprovement of the dstrbuted algorthms. In ths secton, we descrbe several smple but mportant technques that can sgnfcantly affect effcency. Random permutaton of data. Snce each machne processes only a porton of the data and communcates ts solutons to other machnes, f the local data contans mostly the same label, t may take a large number of ADMM teratons to acheve consensus. A useful technque s to randomly shuffle the data to dfferent machnes to ensure that class labels n the data are balanced. Ths technque reduces ADMM s total number of teratons. Warm start n solvng subproblem 9). In solvng subproblem 9), t s not necessary to start from some fxed e.g., zero) vector. In partcular, w k j may not change much at the very end of the ADMM teratons.

5 Caoxe Zhang, Honglak Lee, Kang G. Shn So, we use w k 1 as the startng pont for obtanng w k. Specfcally, n Algorthm 1 we save the prevous α used for obtanng w k 1 and reuse t as α 0. For the prmal method, we can drectly use w k 1 as the startng pont for Newton teraton. Ths technque shortens the tme wthn each ADMM teraton. Inexact mnmzaton early stoppng) of subproblem 9). Solvng 9) exactly, especally at the ntal ADMM teratons, may not be worthwhle snce the exact solutons usually requre a large amount of tme and mght not be the drecton for achevng good consensus. In fact, the problem 9) can be solved approxmately. In Algorthm 1, we can lmt the whleloop s maxmum number of teratons to be, for example, M tmes. Ths corresponds to gong through the local data only M rounds. In the trust regon Newton method, we can lmt the number of Newton teratons. Ths technque can dramatcally speed up the ADMM teratons. Over-relaxaton. w k+1 j s used for updatng z and u n 10) 11). In fact, w k+1 j can be added wth the prevous value of z k to mprove convergence. The new n 10)-11) can be wrtten as: ŵ k+1 j ŵ k+1 j = βw k+1 j + 1 β)z k. 21) We used β [1.5, 1.8], as reported n [4]. Ths technque can reduce the total number of ADMM teratons whle achevng the same accuracy. 5 Implementatons Our algorthms are smple and easy to mplement on dstrbuted systems. Frst, we consder ther mplementatons from a data perspectve. The data s splt unformly by data nstances for processng on dfferent machnes. At the start, each machne loads ts own data n parallel to ft n ts RAM. Data n memory s represented as a sparse matrx, so the space complexty s O B j n) for machne j. Each machne j mantans ts own w j and u j n ts memory and processes them n w-update and u-update n 9, 11) n parallel wth other machnes. Machne j broadcasts w k+1 j +u k j and wats to collect m wk+1 j + u k j ) for z-update n 10). For the communcaton of w k+1 j + u k j, and synchronzaton,.e., watng to collect m wk+1 j +u k j ), we use the Message-Passng Interface MPI) [14] for nter-machne coordnaton, whch s one of the most popular parallel programmng frameworks for scentfc applcatons. MPI supports hgh-level message communcaton and elmnates most of the programmng burdens for low-level synchronzaton Second, we also mplement our algorthms for dfferent loss functons such as hnge loss and square loss. These loss functons would yeld dfferent forms of subproblem 9). However, we can apply the same dea of dual coordnate descent to solvng the subproblem snce t s stll a quadratc programmng and allows us to use the sparsty trck of Eq. 17). We mplemented all the algorthms, e.g., ADMM and subproblem solvers, n C/C++. Specfcally, we use OpenMPI for the communcaton n ADMM, and also modfy the LIBLINEAR lbrary for solvng the problem 9) usng the dual and prmal methods. To further mprove the mplementatons, we also used the followng addtonal technques. Dstrbuted normalzaton and evaluaton of test accuracy. Our emprcal fndngs suggest that normalzng the feature vector to unt length can be very helpful to the convergence of classfcaton algorthms. Therefore, when each machne loads the data, t can normalze the data n parallel wth other machnes. Moreover, we can also evaluate the test data n a dstrbuted fashon f test data s too bg to load n one machne. These smple deas allow the experments to be done more effcently. Cross-valdaton and multclass classfcaton. Cross-valdaton hyperparameter selecton) s easly carred out n our mplementatons. Each machne can separate ts own data nto tranng and valdaton sets and perform both tranng and valdaton n a dstrbuted fashon. For multclass classfcaton, we mplement a one-versus-the-rest OVR) method, snce ts accuracy s comparable to other surrogate-loss multclass classfcatons [15, 10]. The OVR has essentally N bnary classfcatons, where N s the number of classes. Note that, n our algorthms, the N bnary classfcatons need to load the data only once. 6 Experments In ths secton, we frst show that our proposed algorthms are faster than other exstng dstrbuted approaches on four large datasets. We then demonstrate a sgnfcant mprovement over a sngle-machne solver and provde an analyss on these gans. We consder four large datasets: a document dataset webspam), an educaton dataset kddcup10), and two synthetc datasets epslon and kappa). 1 The dataset kappa s generated as follows: x and w are unformly sampled from [ 1, 1] n ; y = sgnw T x ) and y wll flp ts sgn wth probablty 0.1. Fnally, we normalze x such that x 2 2 = 1. All datasets are splt nto tranng and test sets wth an 8:2 rato, except the kddcup10 dataset, whch has already been separated nto tranng and test sets. We also use fve-fold cross valdaton to choose the best hyperparameter C snce we need to compare the accuracy to VW, whch has a dfferent optmzaton problem. The detals of the datasets are summarzed n Table 1. Our dstrbuted algorthms are evaluated by runnng on 8 machnes. Each machne has an Intel Core processor at 3.06GHz) and 12 GB RAM. The tranng/testng data s splt and dstrbuted evenly across these nodes. The dsk throughput s 140 MB/sec. The machnes are connected n a star network through 1 The frst three datasets are avalable at http: // For kddcup10, we used a pre-possessed verson of brdge to algebra n KDD Cup 2010 such that each feature vector has unt norm.

6 Caoxe Zhang, Honglak Lee, Kang G. Shn Table 1: Summary of the datasets. l s the number of examples, and n the number of features. We also show the total number of non-zero features n the dataset. The memory represents the actual sze of the data. Each element n the data s represented as 4 bytes for ndex and 8 bytes for values. A 64-bt machne wll algn the data structure and cause each element to have 16 bytes. Splt s the ntal tme to splt and compress the datasets nto fles. Spread s the ntal tme for our algorthms to dssemnate the fles to the correspondng machnes. Dataset l n # nonzeros Data sze GB) C Splt s) Spread s) webspam 350,000 16,609,143 1,304,697, kddcup10 20,012,498 29,890, ,310, epslon 500,000 2,000 1,000,000, kappa 500,000 8,000 4,000,000, a 1 Ggabt Ethernet wth the TCP throughput of 111 MB/sec between any two machnes. 6.1 Comparson wth Other Dstrbuted Solvers Frst, we compare our algorthms wth three dstrbuted solvers that can load data n parallel and access the dsk only once. The frst solver s an extended verson of Block LIBLINEAR [19]. We set up the seralzed verson of lnear classfcaton to run on multple machnes. All machnes load the local data n parallel. Then only one machne runs at a tme usng Block LIBLINEAR, passng the processed model to the next machne as the ntal model n a round-robn manner. Ths method saves a large amount of dsk loadng tme but leaves all machnes dle except one durng tranng. The second solver uses parallel stochastc gradent descent SGD), smlar to Znkevch et al. [20]. The orgnal algorthm sequentally passes the data only once usng SGD, and then aggregates an average model. Here, we extend the algorthm by repeatng such a procedure: the averaged weght vector s used as the ntal pont for the next round of SGD. The thrd solver s the most recent verson of Vowpal Wabbt VW). 2 VW mplements a fast onlne learnng algorthm usng SGD wthout regularzaton. The current verson 6.0 starts to support runnng on clusters by usng smlar deas n parallel SGD. VW drectly uses sockets for the communcatons and synchronzaton of nodes. We would lke to compare our method to VW snce VW mxes the dsk loadng and tranng, and therefore mght be faster than the second solver. Now we descrbe the algorthms settngs. Unless explctly ndcated, the loss functons are all squared hnge loss. We used four dfferent settngs for our algorthms based on ADMM: : Ths uses the dual coordnate descent method,.e., Algorthm 1 to solve the subproblem 9). The stoppng crteron for Algorthm 1 s max P G < 0.001, where all P G are n the same for-loop n Algorthm 1. : Ths s smlar to, except that t goes through the data only once n solvng 9). : Ths uses the trust regon Newton method we descrbed n Sec Its stoppng crteron s that the norm of gradent s less than : Ths s smlar to, but t uses only one Newton step. For ADMM teratons, we use over-relaxaton wth β = 1.6 and step sze ρ = 1 for all cases. Expermental settngs for other dstrbuted solvers are: D-B-LIBLINEAR: Ths uses dstrbuted systems to run the seralzed Block LIBLINEAR. D-B-LIBLINEAR.M1: Ths s smlar to D-B- LIBLINEAR, except that each node only passes through the data only once n each run. : Parallel stochastc gradent descent as n Znkevch et al. [20]. We compute the aggregated model every tme the algorthm passes the whole data. The learnng rate s set to as n [20]..D[x]: Rather than usng a constant learnng rate n the SGD updates of, t uses a decayng learnng rate where ηt) = x/t+1). We used x = {10 4,,, 0.1, 0.5, 1, 2, 5} and plot the best x. VW-Cluster squared): We use the square loss for VW snce we emprcally found that t acheves better accuracy than other loss functons, such as hnge loss. We also use VW to compress the datasets and use the cached fle for tranng. We set the number of bts for each feature to 24. VW-Cluster.A squared): Ths s smlar to VW-Cluster, but we add flags --adaptve and --exact adaptve norm, such that the gradent norms wll be accumulated across nodes as well. These would be used to perform the non-unform averagng of weghts across the nodes for better convergence. We frst show the ntal tme for splttng the data and spreadng t to dfferent machnes n Table 1. 3 We then measure the tranng performance n two metrcs: tranng tme vs. relatve tranng) optmalty and tranng tme vs. test accuracy. We defne the tranng tme startng from the dsk loadng. The relatve optmalty s defned as the followng relatve dfference between the prmal functon value to the mnmum functon value found by all algorthms: f 1 f 1best )/f 1best. 22) 3 We note that t s possble to even avod ths one-tme cost by desgnng a dstrbuted learnng system that accumulates the data n a dstrbuted way.

7 Caoxe Zhang, Honglak Lee, Kang G. Shn We also compare the dfference between current test accuracy and best accuracy acc %) over tme, usng acc % acc%. 23) All dstrbuted solvers enjoy the beneft of loadng n parallel only once from dsk, whch allevates the cumbersome dsk loadng. Here, we are nterested n whch approach can converge quckly both n prmal objectve and test accuracy. As shown n the left column of Fg. 2, has the fastest convergence rate n reducng prmal functon value. Interestngly, even though goes through data only once when solvng the subproblem 9), ADMM stll mproves n most of the teratons despte ths nexact mnmzaton. Usng the exact mnmzaton, such as and, does not yeld much less prmal functon value and takes a longer tme. * 4, despte our sgnfcant efforts of tunng the learnng rate, cannot reach 1% relatve optmalty n a reasonable amount of tme for the sparse datasets webspam and kddcup. * also has a slower convergence rate n the dense datasets epslon and kappa. We conjecture that * s slower because t does not have auxlary varables e.g., u as n ADMM) to convey the dfferences n local models that can more strongly pull toward the consensus. D-B-LIBLINEAR* s slower because t does not use parallel tranng thus leavng most machnes unutlzed. As the rght column of Fg. 2 shows, usng squared hnge loss) s the fastest method to acheve the best accuracy n datasets webspam, kddcup and epslon, whle usng hnge loss outperforms others n the kappa dataset. We show usng hnge loss n the kappa dataset because t yelds better accuracy than squared hnge loss, though not n the rest of the datasets. Note that VW does not support squared hnge loss, but does support other loss functons, such as squared loss and hnge loss. We found that VW usng squared loss s much better than hnge loss, and therefore, we show the best results of VW n the fgures. Snce the objectve functon s dfferent, VW s not drectly comparable to *, *, *, and D-B-LIBLINEAR*. VW-Cluster.A s faster than VW-Cluster except for dataset kddcup) because t uses non-unform averagng to mprove convergence. However, the non-unform averagng stll yelds slower convergence of VW than our algorthms. In summary, these results suggest that our proposed optmzaton methods converge faster n test accuracy wth a proper choce of loss functon. 6.2 Comparson to a Sngle-machne Solver. Now, we study how our dstrbuted algorthms compare aganst Block LIBLINEAR runnng n a sngle machne n terms of tranng tme. We denote the sngle-machne solver usng Block LIBLINEAR as B- LIBLINEAR and ts varant that passes the data only once n each block as B-LIBLINEAR.M1. See Fg. 3 for the break-down of tranng tme. Although Block 4 * means the wldcard character. Here * stands for and.d[x] LIBLINEAR attempts to reduce the tme spent n dsk, the dsk loadng tme stll occupes a sgnfcant porton snce Block LIBLINER would load the same samples from dsk multple tmes. We also observe that both and B-LIBLINEAR.M1 spend much less tme n processng data than loadng, and that they are more effcent than those that use exact mnmzaton, such as and B-LIBLINEAR. These fndngs reveal that, n large-scale classfcaton, the man component n the tranng tme s data loadng, whch motvated us to mprove t by usng a dstrbuted system for parallel loadng. In addton to performng parallel dsk loadng only once, our algorthms ntroduce coordnatons between machnes, such as communcatons and synchronzaton. Communcatons nvolve the passng of w j + u j at each ADMM teraton, and synchronzaton corresponds to the watng tme to collect ths message. For relatvely modest dmensonal features such as n epslon and kappa, the coordnaton overhead s qute small. For datasets that have hgh dmensonal features, such as webspam and kddcup10, the coordnaton tme turns out to be greater than the processng tme. Overall, acheves 7 60 fold speedup over B- LIBLINEAR or B-LIBLINEAR.M1. 7 Dscussons and Concluson Extremely hgh-dmensonal data, e.g., havng more than bllons of features, would create sgnfcant overheads for our dstrbuted algorthms to communcate due to the network bandwdth constrant. One soluton for ths s to use the hash trck [18] to randomly group dfferent features to reduce the dmenson and therefore allevate communcaton overheads. We evaluated and performed experments on 8 machnes, a typcal scale n academa or research labs. It would be nterestng to evaluate n a klo-node scale as n a data center. In such settngs, coordnatons between nodes wll need to be carefully desgned when calculatng average; for example, nodes n the same rack can aggregate the sums and then communcate across the racks. Our current mplementaton requres that the data be ft n the dstrbuted memory to ensure fast tranng. If the data s larger than the dstrbuted memory, t s straghtforward to apply the dea of Block LIBLINEAR to load and tran a porton of data n batch when solvng the subproblems. In ths paper, we proposed smple and effcent dstrbuted algorthms for large-scale lnear classfcaton. Our algorthms provde a sgnfcant performance gan over state-of-the-art lnear classfers runnng n a sngle machne and exstng state-of-the-art dstrbuted solvers, whch shows the promse of our approach n large-scale learnng problems. Our code s avalable at: Acknowledgments Ths work was supported n part by a Google Faculty Research Award and AFOSR Grant FA

8 Caoxe Zhang, Honglak Lee, Kang G. Shn Relatve functon value dfference Relatve functon value dfference Tme s) Tme s) Tme s) Tme s) 10 5.D2.M Tranng tme s) a) webspam.d2.m1 Dfference to the best accuracy %) Dfference to the best accuracy %) 10 2 Best accuracy: 99.58% Tranng tme s) b) webspam.d2.m1 VW Cluster squared) VW Cluster.A squared).d2.m1 VW Cluster squared) VW Cluster.A squared) Best accuracy: 89.99% Relatve functon value dfference Tranng tme s) c) kddcup.d2.m1 Dfference to the best accuracy %) Tranng tme s) d) kddcup.d2.m1 VW Cluster squared) VW Cluster.A squared) Best accuracy: 89.85% Relatve functon value dfference Tranng tme s) e) epslon Tranng tme s) g) kappa.d2.m1 Dfference to the best accuracy %) Tranng tme s) f) epslon Tranng tme s) h) kappa.d2.m1 VW Cluster squared) VW Cluster.A squared) hnge) Best accuracy: 85.90% Fgure 2: Performance comparsons between our algorthms, Block LIBLINEAR, and VW. Each marker ndcates an teraton. We only show the results usng hnge loss n for kappa snce usng hnge loss s better than squared hnge loss only for ths dataset. We do not show n kddcup10 dataset because t takes too long to converge. We do not show VW for the relatve optmalty snce VW uses a dfferent objectve functon a) webspam b) kddcup10 c) epslon d) kappa scaled by 1/ scaled by 1/ Coordnatons Process Load Fgure 3: The break-down of tranng tme for our dstrbuted algorthms and Block LIBLINEAR B-LIBLINEAR) runnng n a sngle machne. We measure the data loadng tme, processng tme and coordnaton communcaton and synchronzaton) tme when both algorthms acheve 1% relatve optmalty wth C = 1. Indeed, our methods spend much less tme n data loadng than B-LIBLINEAR. 1405

9 References [1] L. Bottou and Y. LeCun. Large scale onlne learnng. In Advances n Neural Informaton Processng Systems, [2] S. Boyd, N. Parkh, E. Chu, B. Peleato, and J. Ecksten. Dstrbuted optmzaton and statstcal learnng va the alternatng drecton method of multplers. 31):1 122, [3] E. Y. Chang, K. Zhu, H. Wang, and H. Ba. PSVM: Parallelzng support vector machnes on dstrbuted computers. In Advances n Neural Informaton Processng Systems, [4] J. Ecksten. Parallel alternatng drecton multpler decomposton of convex programs. Journal of Optmzaton Theory and Applcatons, 801):39 62, [5] J. Ecksten and D. P. Bertsekas. On the douglas-rachford splttng method and the proxmal pont algorthm for maxmal monotone operators. Mathematcal Programmng, 551): , [6] R.-E. Fan, K.-W. Chang, C.-J. Hseh, X.-R. Wang, and C.-J. Ln. LIBLINEAR: A lbrary for large lnear classfcaton. Journal of Machne Learnng Research, 9: , [7] P. A. Forero, A. Cano, and G. B. Gannaks. Consensus-based dstrbuted support vector machnes. Journal of Machne Learnng Research, 11: , [8] D. Gabay and B. Mercer. A dual algorthm for the soluton of nonlnear varatonal problems va fnte element approxmaton. Computers and Mathematcs wth Applcatons, 21):17 40, [9] C.-J. Hseh, K.-W. Chang, C.-J. Ln, S. S. Keerth, and S. Sundararajan. A dual coordnate descent method for large-scale lnear SVM. In Proceedngs of the 25th Internatonal Conference on Machne Learnng, [10] S. S. Keerth, S. Sundararajan, K.-W. Chang, C.-J. Hseh, and C.-J. Ln. A sequental dual method for large scale mult-class lnear svms. In Proceedngs of the 14th ACM SIGKDD Internatonal Conference on Knowledge Dscovery and Data Mnng, [11] J. Langford, L. L, and A. Strehl. Vowpal Wabbt onlne learnng project. Techncal report, [12] C.-J. Ln, R. C. Weng, and S. S. Keerth. Trust regon Newton method for large-scale logstc regresson. Journal of Machne Learnng Research, 9: , [13] Z. Q. Luo and P. Tseng. On the convergence of the coordnate descent method for convex dfferentable mnmzaton. Journal of Optmzaton Theory and Applcatons, 72:7 35, Caoxe Zhang, Honglak Lee, Kang G. Shn 1406 [14] MPI Forum. MPI: A Message-Passng Interface Standard, verson 2.2, [15] R. Rfkn and A. Klautau. In defense of one-vsall classfcaton. Journal of Machne Learnng Research, 5: , [16] S. Shalev-Shwartz, Y. Snger, and N. Srebro. Pegasos: Prmal estmated sub-gradent solver for SVM. In Proceedngs of the 24th Internatonal Conference on Machne Learnng, [17] S. Shalev-Shwartz and N. Srebro. SVM optmzaton: nverse dependence on tranng set sze. In Proceedngs of the 25th nternatonal Conference on Machne Learnng, pages , [18] Q. Sh, J. Petterson, G. Dror, J. Langford, A. Smola, and S. Vshwanathan. Hash kernels for structured data. Journal of Machne Learnng Research, 10: , [19] H.-F. Yu, C.-J. Hseh, K.-W. Chang, and C.-J. Ln. Large lnear classfcaton when data cannot ft n memory. In Proceedngs of the 16th ACM SIGKDD Internatonal Conference on Knowledge Dscovery and Data Mnng, [20] M. Znkevch, A. J. Smola, M. Wemer, and L. L. Parallelzed stochastc gradent descent. In Advances n Neural Informaton Processng Systems, 2010.

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr