Classification algorithms on the cell processor

Size: px

Start display at page:

Download "Classification algorithms on the cell processor"

Homer Holt
6 years ago
Views:

1 Rochester Insttute of Technology RIT Scholar Works Theses Thess/Dssertaton Collectons Classfcaton algorthms on the cell processor Mateusz Wyganowsk Follow ths and addtonal works at: Recommended Ctaton Wyganowsk, Mateusz, "Classfcaton algorthms on the cell processor" (2008). Thess. Rochester Insttute of Technology. Accessed from Ths Thess s brought to you for free and open access by the Thess/Dssertaton Collectons at RIT Scholar Works. It has been accepted for ncluson n Theses by an authorzed admnstrator of RIT Scholar Works. For more nformaton, please contact rtscholarworks@rt.edu.

2 Classfcaton Algorthms on the Cell Processor by Mateusz Wyganowsk A Thess Submtted n Partal Fulfllment of the Requrements for the Degree of Master of Scence n Computer Engneerng Supervsed by Dr. Muhammad Shaaban Department of Computer Engneerng Kate Gleason College of Engneerng Rochester Insttute of Technology Rochester, NY August 2008 Approved By: Dr. Muhammad E. Shaaban Prmary Advsor R.I.T. Dept. of Computer Engneerng Dr. Juan C. Cockburn Secondary Advsor R.I.T. Dept. of Computer Engneerng Dr. Roy W. Melton Secondary Advsor R.I.T. Dept. of Computer Engneerng

3 Abstract: The rapd advancement n the capacty and relablty of data storage technology has allowed for the retenton of vrtually lmtless quantty and detal of dgtal nformaton. Massve nformaton databases are becomng more and more wdespread among governmental, educatonal, scentfc, and commercal organzatons. By segregatng ths data nto carefully defned nput (e.g.: mages) and output (e.g.: classfcaton labels) sets, a classfcaton algorthm can be used develop an nternal expert model of the data by employng a specalzed tranng algorthm. A properly traned classfer s capable of predctng the output for future nput data from the same nput doman that t was traned on. Two popular classfers are Neural Networks and Support Vector Machnes. Both, as wth most accurate classfers, requre massve computatonal resources to carry out the tranng step and can take months to complete when dealng wth extremely large data sets. In most cases, utlzng larger tranng mproves the fnal accuracy of the traned classfer. However, access to the knds of computatonal resources requred to do so s expensve and out of reach of prvate or under funded nsttutons. The Cell Broadband Engne (CBE), ntroduced by Sony, Toshba, and IBM has recently been ntroduced nto the market. The current most nexpensve teraton s avalable n the Sony Playstaton 3 computer entertanment system. The CBE s a novel mult-core archtecture whch features many hardware enhancements desgned to accelerate the processng of massve amounts of data. These characterstcs and the cheap and wdespread avalablty of ths technology make the Cell a prme canddate for the task of tranng classfers. In ths work, the feasblty of the Cell processor n the use of tranng Neural Networks and Support Vector Machnes was explored. In the Neural Network famly of classfers, the fully connected Multlayer Perceptron and Convoluton Network were mplemented. In the Support Vector Machne famly, a Workng Set technque known as the Gradent Projecton-based Decomposton Technque, as well as the Cascade SVM were mplemented.

4 Chapter 1: Introducton The Problem Doman Classfers The Cell Broadband Engne Organzaton... 6 Chapter 2: Mult Layer Perceptron Chapter Introducton Background The Neuron Model The Lnear Separator Learnng Algorthms Tranng the Mult Layer Perceptron Network Archtecture Convolutonal Networks Related Work Chapter 3: Support Vector Machnes Chapter Introducton Background Tranng Implementatons Concluson Chapter 4: The Cell Broadband Engne Chapter Introducton Desgn Challenges Top Level Desgn Low Level Desgn Decsons Power Processng Element Synergstc Processng Elements Floatng Pont Number Representaton Element Interconnect Bus Memory Interface Prevous Work on Cell Processor Chapter 5: Hgh Performance Programmng on the Cell Processor Chapter Introducton Support and Development Tools CBE Embedded SPE Object Format Levels of Programmng Programmng the PPE Chapter 6: Implementaton of the Mult-Layer Perceptron on the Cell Processor Chapter Introducton Hgh Level Implementaton Overvew Detaled Implementaton Chapter 7: Implementaton of Support Vector Machnes on the Cell Processor Chapter Introducton Parallel Gradent Projecton-based Decomposton Technque Cascade SVM

5 Chapter 8: Test Methodology and Results Chapter Introducton Mult Layer Perceptron Gradent Projecton-Based Decomposton Technque Other Results Chapter 9: Concluson and Future Work Chapter Introducton Multlayer Perceptron Gradent Projecton-based Decomposton Technque Cascade SVM Overall Concluson

6 Chapter 1: Introducton 1.1 The Problem Doman The strong and steady mprovement n data storage technology has lead to an abundance of data beng stored by vrtually every exstng organzaton. E-commerce companes store transacton detals on a per-customer bass. Amazon, for example, generates mllons of transactons per day. Natonal Securty organzatons may collect nformaton about suspected and unsuspected ndvduals such as travelers or nternet users. Banks collect nformaton about ther customers whch ncludes spendng habts, credt dept and transacton hstores. Google, whch specalzes n data storage, collects nformaton about every search term and every lnk clcked by every user. Most companes log some porton of ncomng and outgong packet data between ther nternal ntranet(s) and the nternet. There s a lot of valuable nformaton wthn ths data the problem les n extractng t. It should be possble, for example, for a bank to predct a new customer s lkelhood to go bankrupt by studyng all prevous smlar customers n terms of specfed attrbutes. An advanced ntruson detecton system may use a sequence of ncomng packets along wth tmng nformaton to detect novel ncomng network attacks. A large e-commerce company may use the data as an ad n choosng whch advertsements to show based on the current logged n user (or browser cooke receved). The problem s how to search ths massve data and create a knowledge model so that future examples from the same doman are classfed nto some predefned set of classes. 1.2 Classfers Ths s one problem n whch classfcaton algorthms, or classfers, excel. These mathematcal tools work by takng as nput a set of features of an object, stuaton, or a pece of data and to producng as an output a dscreet value whch denotes that nput s class. In order to do so, an assocated tranng algorthm s used to generate an nternal classfer model, or knowledge, usng prevous nput/output parngs from the same source (problem) doman. Ths process of buldng an nternal predcton model s known as Data Mnng or Machne Learnng. Several classfers have been nvented over the course of the last half century, none of whch are perfect, and none of them can be sad to outperform the rest n all possble crcumstances. Also, there s no optmal method for decdng whch algorthm should be used for a partcular problem or applcaton. At best heurstcs are used but even then a set of algorthm-specfc tranng parameters must be chosen, tested for performance, adjusted, retested, and so on untl satsfactory classfcaton performance s acheved. Each tranng run may take a very long tme to complete and often requres access to specalzed hardware n order to complete wthn an acceptable tme perod. It s never

7 possble to conclude that the chosen parameters have produced an optmal classfer however. The multple algorthms and varatons avalable, the parameter adjustablty, and most mportantly the computatonal ntensty for tranng make t dffcult to explore all possbltes wthn a reasonable amount of tme. The most popular classfcaton algorthms are the Neural Network (of whch the Multlayer Percepton s most utlzed), Support Vector Machnes, k-nearest Neghbours, Gaussan, Naïve Bayes, Decson Tree and RBF classfers. Ths work s concerned wth the frst two (Multlayer Perceptrons and Support Vector Machnes). 1.3 The Cell Broadband Engne The Cell Broadband Engne Archtecture (Cell for short) was released n 2005 by Sony, Toshba, and IBM. The processor conssts of one Power Processor Element, and nne Synergstc Processor Elements (sx of whch are accessble on the Playstaton 3 system used n ths work) on one chp ted wth a hgh speed rng bus. The archtecture features varous noveltes that suggest exceptonal performance. However, n order to acheve ths performance, programs for the Cell Processor need to be explctly wrtten to take advantage of the hardware. Issues of memory latences, reformattng of data, workload dstrbuton, all the way down to nstructon schedulng need to be carefully consdered. The Playstaton 3 s arguably the best performance-per-dollar system avalable due to marketng and busness reasons. The goal of ths work s to mplement and explore the performance of a two novel Support Vector Machne mplementatons the Parallel Gradent Projecton-Based Decomposton Technque (PGPDT) and Cascade SVM along wth the standard Multlayer Perceptron and a recent Convoluton Layer Neural Network archtectures on the Cell Broadband Engne Archtecture. 1.4 Organzaton The document was wrtten wth the assumpton that t wll be read n order and can be logcally dvded nto four man sectons. Chapters 2 and 3 make up the frst secton and ntroduce the two classfers, placng detal nto the varatons that are relevant to ths work. It s recommended that even those readers famlar wth MLPs and SVMs read the sectons descrbng the relatvely new MLP convoluton layers, and the sectons about the novel GPDT workng set method and cascade SVM solvers. The next secton, consstng of Chapters 4 and 5, changes course and s concerned wth the Cell Broadband Engne and the programmng strateges for wrtng hgh performance applcatons on the sad archtecture. The programmng strateges were ncluded n a separate chapter due to the sgnfcance that they played n ths work. Chapters 6 and 7 make up the thrd and most techncal secton. Here, the mplementaton of the algorthms descrbed n chapters 2 and 3 are dscussed n detal. Graphcs were

8 used as much as possble to convey the concepts, especally where pushng, storage, or processng of data s concerned. Fnally, Chapter 8 closes the document wth concrete results that nclude detaled analyss and educated explanatons. As a concluson, arguments are put forth for the applcablty and sutablty of the Cell Processor n ths feld as well as hndsght of what could have been done dfferently.

9 Chapter 2: Mult Layer Perceptron 2.1 Chapter Introducton In ths chapter, the Multlayer Perceptron the most popular member of the Artfcal Neural Network (ANN) famly s ntroduced. The chapter begns by ntroducng the basc buldng block of any ANN the sngle Neuron model. Ths less techncal secton s broken nto the most sgnfcant developments and mprovements leadng up to ths day. Detals covered are kept wthn the scope of ths work. The followng sectons ncrease n techncalty and expand on the concepts ntroduced n the frst secton. Frst, the Sngle-Layer Perceptron s dssected along wth ts learnng methodology. Next, the Mult-Layer Perceptron s examned by expandng on the Sngle- Layer verson. Next, the Backpropagaton method the heart of the tranng power of the MLP s explaned both mathematcally and ntutvely. Here, the ptfalls and shortcomngs of the algorthm, as well as the many varatons and parameters that attempt to reduce them are further elaborated. Typcal mplementatons of the algorthm on modern computer archtectures, the feasblty of parallelzaton, and some recent work are dscussed n the next secton. Fnally, the convoluton layer s descrbed. Ths layer type was desgned to outperform a standard fully-connected layer when 2-dmensonal mages are the source of the tranng data. 2.2 Background The Mult Layer Perceptron s a supervsed learnng algorthm that s t trans from a prevously manually classfed tranng set. Every nstance of the tranng set s an nputoutput par. Ths s n contrast to unsupervsed tranng algorthms whch tran on data that has not been prevously classfed and requre that the algorthm generates ts own data segregaton rules (.e.: clusterng algorthms). The MLP s bult from multple layers and mplements a Feedforward network archtecture that s, the network has one nput layer and one output layer wth no loops such as those appearng n Recurrent networks. Each layer contans some number of neurons.

10 Two types of layers were mplemented n ths work the fully-connected layer and the convoluton layer. In the fully-connected layer each neuron on a layer has a separate dedcated weghted connecton to each neuron on the prevous layer provded that t s not the nput layer. A convoluton layer conssts of a number of two-dmensonal feature maps, each pared wth a kernel that s used to generate the values of the feature map by performng a kernel pass over all of the prevous layer s feature maps. Whle n the more general case t s possble to select whch feature maps to process for each feature map on the gven layer, the mplementaton n ths work processes all nput feature maps for each one. The learnng algorthm used n the MLP s known as Backpropagaton, whch s a gradent descent method havng the goal of mnmzng an error functon at the output layer generated by feedng the network wth elements from the tranng set. The optmzaton varables are the weghts n the fully-connected layers and kernel elements n the convoluton layers. 2.3 The Neuron Model The Artfcal Neuron s the basc buldng block of the MLP. The theory behnd the MLP can be better understood by frst obtanng an ntuton nto the theory behnd the sngle neuron. The frst neuron model was proposed by McCulloch and Ptts n 1943 [2]. Ths bologcally nspred computatonal model (from hereon referred to as the M-P model) was very basc, capable of functonng only wth bnary nputs and outputs. Over the years, the orgnal M-P model has receved varous modfcatons from the felds of statstcs and probablty theory whch have allowed t to be appled to a broader range of problems. The modern revson of the neuron unt s shown n Fg Fgure 2-1: The Artfcal Neuron Model The neuron n the fgure s the th neuron on layer n > 0 (n=0 represents the nput layer). The weghted arrows represent the connectons between ths neuron and each of the neurons on the prevous layer (assumng a fully-connected archtecture). An addtonal

11 arrow s used to represent the bas b, the purpose of whch wll be dscussed later. There are two steps that are taken to calculate the scalar output x. The frst step nvolves calculatng the actvaton value z as a functon of the nput vector and weght vector. Most commonly, z s calculated by takng a weghted sum of all the nputs (a dot product of the weght and nput vector). The result s nput to an actvaton functon F(.) to produce the fnal output x. For the neuron n the fgure, the two steps are summarzed by: x L n = F * x + b j l, j l (2.1) ( ) n n n 1 l= 0 1 w n whch L n-1 s the number of neurons n the prevous layer. The M-P model used only +1 or -1 for the weghts (known as exctatory and nhbtory nputs respectvely), and used a bnary threshold functon for an actvaton. The modern updated model can be thought as a generalzaton of the M-P model wth the ablty to work wth real numbers. The bnary threshold functon has been generalzed nto the set of scalar nput, scalar output functons. Typcal choces of functons are nonlnear, contnuous, and dfferentable wth an output rangng between -1 and +1. These functon characterstcs make t possble to apply the Backpropagaton algorthm, as wll be shown n later sectons of ths chapter. 2.4 The Lnear Separator The neuron model descrbed s nothng more than a bologcally-nspred graphcal representaton of a smple, but powerful, mathematcal functon. Ths functon, when set equal to 0, represents a hyperplane n an n-dmensonal space and defnes what s called a lnear separator. The dmenson n the case of the artfcal neuron s the number of nputs plus one. For example, the equaton for a neuron wth two nputs s shown n equaton 2.2. (2.2) w x + w x + b = Gven known weght and bas values, the graph of ths functon becomes a lne (a twodmensonal hyperplane) wth the normal w and offset -b/ w as shown n Fg Fgure 2-2: Lnear Separator of a two-nput neuron

12 Snce multplyng both sdes of the equaton by a nonzero value does not modfy the orentaton of the lne, both sdes are dvded by the magntude of the vector w, normalzng the weght vector to obtan the same graph, but wth updated labels. Fg. 2-3 shows the updated graph along wth a new vector (dashed arrow) that s to be classfed. Fgure 2-3: Lnear Separator after normalzng the weght vector. The dotted lne s an nput vector that s to be classfed. The power of the lnear separator les n ts ablty to classfy any gven nput x by determnng on whch sde of the hyperplane t sts. In the smple two-nput example (Fg. 2-4) t can be shown that gven any 2-dmensonal nput vector the output of the equaton ŵ x b s the perpendcular dstance of the pont from the lne of the lnear separator. Fgure 2-4: Classfcaton of an nput vector x. The dot product between the nput and weght vectors gves the magntude of the projecton of the nput vector onto the weght normal. Subtractng b (addng -b) results n a postve value f the vector projects past the lne, and vce versa. By encapsulatng the functon by a sgn functon, any nput can be classfed as belongng to ether the postve or the negatve class. Ths two-dmensonal example generalzes to any number of dmensons makng t a powerful tool n varous machne learnng algorthms. After defnng the outputs of the examples n the tranng set to be ether +1 or -1, a value known as the margn can be calculated by takng the product of the sgned dstance of the nput vector, as descrbed n the prevous paragraph, and the known output for that vector. The result s postve f the nput s correctly classfed, negatve f not. The

13 usefulness of the margn value wll become clear n the followng secton n whch learnng methods are ntroduced. 2.5 Learnng Algorthms Sngle Layer Perceptron The early artfcal neuron was very capable but requred that the weghts were set manually. Many smple systems, such as the dgtal AND, OR, and NOT gates were easly mplemented. However, larger systems requred qute a bt more work. In 1957, Frank Rosenblatt, n hs publshed book, Prncples of neurodynamcs: Perceptrons and the theory of bran mechansms [3], ntroduced the Perceptron a model based on the artfcal neuron and ncludng a learnng algorthm whch was the frst bg step n supervsed tranng methods. The algorthm for tranng a sngle neuron, n a slghtly altered form, s shown n Lstng 1. Pck ntal weght vector (ncludng b), e.g.: [0 0] Repeat untl all ponts correctly classfed Repeat for each Input Calculate margn y w, x for pont If margn > 0, pont s correctly classfed Else change weghts to ncrease margn; change n weght proportonal to y x Loop Loop Lstng 1: Frank Rosenblatt s Perceptron Learnng Algorthm Ths s an error-drven algorthm whch modfes the weghts n an effort to mnmze msclassfcatons. The algorthm was proven to converge to a vald soluton after a lmted amount of teratons under the condton that the data are lnearly separable (there exsts a hyperplane that can fully dvde the postve and negatve set). In the case where the data s not lnearly separable, the algorthm never termnates. The choce of yx as the weght ncrement s attrbuted to the gradent ascent functon optmzaton strategy. The gradent ascent method uses the dervatve (gradent) of the functon surface wth respect to the optmzaton varable as a means to maxmze the functon output. Once the gradent of the functon s calculated, a step s taken n that drecton. The sze of the step depends on the gradent slope. In ths case the adjustment s made over many nput-output pars, and thus a small fracton of the gradent s actually taken repeatedly over multple teratons. The gradent s recalculated at every teraton and the algorthm termnates once t becomes small enough (a relatvely flat surface s reached).

14 Recallng the sgnfcance of the margn, negatve values represent those nputs whch were msclassfed. A large magntude mples a bg adjustment. It therefore makes sense to defne the optmzaton problem as maxmzng the margns for the msclassfed ponts. Optmally, the sum of the margns becomes zero whch mples perfect classfcaton of the tranng set. The functon to be optmzed s therefore: (2.3) f ( ) = w w, x. y msclassfed The gradent of ths functon wth respect to the weghts happens to be: (2.4) f ( w) = w y x. msclassfed Rosenblatt s research coned the term Sngle Layer Perceptron n whch there are one or more output neurons on the output layer and a multple nputs on the nput layer. Other networks that Rosenblatt expermented wth were the cross-coupled Perceptron n whch connectons joned unts of the same layer, and multlayer back-coupled perceptrons, whch had feedback paths from unts located near the output. Whle he dd propose a back-propagatng error method, no one could come up wth a way for t to converge. The dscoveres made by Rosenblatt naturally brought new nterest nto the area of Neural Networks. Unfortunately, ths nterest dwndled n 1969 when Mnksy and Papert publshed ther book Perceptrons [4] n whch they revealed ther dscoveres about the lmtatons of the applcatons of the Perceptron model. One elegant example used n ther arguments was the basc XOR problem. Ths two nput and one output system s lnearly nseparable and cannot be learned usng the Perceptron model. It s shown n Fg Fgure 2-5: Input vectors for the XOR problem There s no possblty of drawng a straght lne so that the four nputs, two of each class, are separated. Applyng the Perceptron tranng algorthm would result n an nfnte loop. Ths smple example, as well as ther other dscoveres had huge mplcatons for the applcablty of the Perceptron to everyday problems. They dd propose that addng another hdden layer between the nput and output layers would theoretcally allow for tranng of lnearly nseparable problems, but they had no proposed method for tranng weghts on ths layer. An example of a manually confgured network capable of dvdng

15 the XOR problem space s shown n Fg The addton of the hdden layer s equvalent to creatng lnear separators of lnear separators. In other words, the output of two decson boundares (at the hdden layer), s used as nput to the fnal lnear separator. Fgure 2-6: A Multlayer soluton to the XOR problem The resultng outputs for each of the possble nputs are: x 1 x 2 o 1 o 2 y Fgure 2-7: Outputs of each of the lnear separators for each nput combnaton The hdden layer and ts weghts defne two lnear separators (Fg. 2-8a). The outputs of these separators are processed by the lnear separator of the output neuron (Fg. 2-8b). Fgure 2-8: (a) Lnear separators on the hdden layer. (b) Lnear separator at the output neuron.

16 Whle ther ntroducton of a hdden layer was an mportant step, Mnksy and Papert s other, rather pessmstc, dscoveres were enough to cause wdespread declne of further research nto the subject The Multlayer Perceptron In 1974, Paul Werbos presented a method for weght adjustment n the hdden layers durng a dssertaton at Harvard Unversty [5]. Unfortunately, hs work went largely unnotced. It wasn t untl 1986 when Rumelhart, Hnton, and Wllams publshed Learnng Internal Representatons by Error Propagaton [6] n whch they proposed a modfed Multlayer Neural Network along wth a tranng algorthm for adjustng weghts connected nto the hdden layers. Ther work was ndependent to that of Werbos, but they are often credted wth the dscovery. The bggest addton n Rumelhart s, Hnton s, and Wllams model s the ncorporaton of a dfferentable actvaton (or transfer) functon. Ths smple modfcaton allowed them to use an exstng mathematcal tool called the generalzed delta rule (also known as the heavy ball method n lterature [7] [8] [9]) whch s related to gradent ascent. The technque was termed Backpropagaton n the context of the neural network and remans the most popular method for weght adaptaton n the hdden layers. It functons smlarly to Rosenblatt s method, but t mnmzes the msclassfcaton error rather than maxmzng the margn of msclassfed tranng pars. The ablty to adjust weghts n the hdden layers opened up new possbltes such as the tranng of lnearly nseparable data. Wth the brth of the Mult Layer Perceptron, nterest n the feld of neural networks rekndled. The key to the Backpropagaton algorthm s the dfferentablty of the actvaton functon whch mples smooth changes to the output values due to a small weght changes n the network. The optmzaton functon at the output s defned to be the tranng error over all tranng vectors: 1 2 (2.5) E = ( y( x, w ) d ) T = 0 n whch y(x, w) s the network output gven the th nput vector, d s the expected output, and T s the number of tranng vectors. It s clear that snce y(x, w) s smooth and d s constant, the error functon s smooth as well. To utlze gradent descent, the gradent of ths functon s necessary (Eq. 2.6). T (2.6) we = [ ( y( x, w) d ) w y( x, w) ] = 0 The gradent of the error s the sum over all tranng elements of the dfferences of the actual and expected outputs multpled by the gradent of the output wth respect to the weghts. An expresson for the dervatve of the output wth respect to the weghts needs to be found. Recall that an output y s the weghted sum of the nputs processed by the actvaton functon. 2

17 The gradent for the output wth respect to the weghts s obtaned by applyng the chan rule of dfferentaton as shown n Eq. 2.7 and Fg (2.7) y w y z = z w F = z F = z ( z) z w ( z) ( x ) Fgure 2-9: Calculatng the output gradent wth respect to the weghts. The dervatve s broken down nto a product of the dervatve of the output wth respect to the actvaton value (the slope of the actvaton functon wth the current value z) and the dervatve of the actvaton value wth respect to the current weghts (for a Sngle Layer Perceptron, ths value s smply the current nput x). By the rule of gradent descent, for every neuron at the output layer, the weght vector should be updated by an element-by-element product of the output error vector and the actvaton functon gradent at that actvaton value. (2.8) w w η w w η w E ( y d ) F z ( z) x The parameter η s a small value known as the learnng rate. If a hdden layer s ntroduced, whch ncludes ts own set of weghts, a method s requred for obtanng the gradent of the output y wth respect to one of the hdden layer weghts. The same procedure s used, but snce the nput nto the output layer s an output from the hdden, a recursve loop s observed as shown n Eq. 2.9 and accompanyng Fg

18 (2.9) y w j y z = z wj F = z F = z ( z) ( z) F z ( z) z w j y w wj y w wj Fgure 2-10: Chan rule for the output gradent wth respect to the weghts. In more general networks, multple weghts near the nput may affect the sgnal comng nto a neuron on a further layer. In such cases, the gradent of the sgnal can be obtaned by summng the effects over all possble paths. The ntuton s bult on the observaton that the output gradent s a functon of the sgnal on each of the nputs nto the output neuron. Changng a weght on a hdden layer of some hdden neuron produces a gradent on the sgnal between the hdden neuron and the output neuron. Assumng that there s only one hdden layer as n most networks, the gradent on any of the sgnals s calculated usng the base case above. In order to mplement Backpropagaton a new quantty δ s ntroduced. At the output layer, the delta for a sngle neuron s: (2.10) = = ( y d ) ( z) E F δ. z z To obtan the deltas for the j th neuron n the hdden layer l, the deltas from the (l+1) th layer are propagated backward and multpled by the dervatve of the actvaton functon actng on the nput of the neuron: (2.11) l L F( z ) l + 1 l l+ 1 δ j = δ k w jk. z j l j k= 0 By substtutng Eq nto the general case of Eq. 2.8, the weght connectng neuron from layer l to neuron j n layer l+1 s updated by: (2.12) wj wj ηδ j y.

19 The name Backpropagaton comes from the way n whch the δ value s propagated backwards through the network. Reusng the deltas as they are propagated back s the grounds for ts effcency. Lstng 2 summarzes the Backpropagaton algorthm. Intalze weghts to small random values Repeat untl error over entre tranng set s small enough Repeat for all Tranng nput-output pars Propagate forward by computng the outputs of each layer n successon Loop Compute δ at the output layer Propagate the δ value backward through the layers Updatng the weghts on each layer Loop Lstng 2: Backpropagaton Summary 2.6 Tranng the Mult Layer Perceptron Ths secton summarzes some of the ssues that plague MLP tranng as well as some of the parameters and modfcatons to the tranng algorthm that the mplementer has control over. One of the major drawbacks of tranng the MLP s the lack of automated tranng parameter dscovery. The mplementer s forced to experment wth such thngs as numbers of layers, layer szes, learnng rates, methods of learnng, convergence condtons, length of tranng, etc. The decsons made are mostly based on heurstcs and experence. Optons exst that may smplfy ths process, but usually nvolve tradeoffs and never completely eradcate the problem. Several ssues due to mproper setup as well as those nherent n the MLP are presented, along wth methods that am to reduce them. As wll become evdent, many desgn choces are nfluenced by what s desred and how much tme and resources are avalable Combnaton Functons The combnaton functon defnes the transformaton of the two vectors (w,x) to scalar (z) n the frst neuron processng step. The MLP, n nearly all cases, uses the dot-product. However, other networks exst most popular beng the Radal Bass Functon network n whch, as the name mples, the RBF functon s used nstead [10] Actvaton Functons As dscussed n the prevous secton, the actvaton functon s utlzed n the second step of the neuron s forward process and ts dervatve s utlzed durng Backpropagaton. The actvaton functon n the hdden layer needs to be nonlnear for the network to be capable of learnng from datasets that may not be lnearly separable. Actvaton functons that produce values centered around zero (.e.: [-1.0,1.0]) are usually preferred as they generally converge qucker due to mproved numercal condtonng [11]. The two most popular actvaton functons are the sgmod and hyperbolc tangent. Table 1 shows these two functons as well as an approxmated verson (used n ths work) [12].

20 Sgmod Hyperbolc Tangent Hyperbolc Tangent (LeCun s Appx..) tanh F ( z) F ( z) 1+ e z z 1 F 1 F( z) ( ( z) ) e e z ( z) = z z z e + e 2 =.7159 tanh z tanh = 1 F 2( z) ( z) 2 1 ( F( z) )( F( z) ) Table 1: Actvaton Functons The popularty of these actvaton functons s attrbuted to the ease of obtanng the dervatve usng values already n memory. The graph of each s shown n Fg Fgure 2-11: Actvaton functon curves Bas Term As descrbed n the prevous secton, each neuron n the MLP model contans a bas term that shfts the lnear separator s hyperplane along the weght normal. In many mplementatons of neural networks, the bas s represented smply as another weghted nput wth the constant value of +1 or -1 (nether have an advantage when tranng). The correspondng weght s traned just lke any other on that neuron. The bas term s optonal, although the theory of the unversal approxmaton does not hold wthout t [13]. One way n whch the bas may be avoded s by forcng the condton that the outputs on any layer sum to a nonzero constant [14]. Ths s most easly done by preprocessng the nput data so that each vector has elements whch sum to the same constant value. The neuron processng equaton, wth the bas term treated as a weghted nput s L = + n 1 1 x n F wn l= 0 * x j l, j l (2.13) ( ) n 1

21 L j n whch w n 1+1, n s the weght assocated wth the bas term. Ths smplfed equaton wll make the mplementaton much smpler as wll be seen n Chapter Intal Weghts The choce for ntal weght values s qute mportant when ntalzng the network snce t decdes the ntal poston n the search space. Values should be small enough so to prevent the neuron outputs from saturatng. Recall that the slope of the actvaton functon becomes flat at the nput extremes. Because weght updates are proportonal to ths slope, learnng may become extremely sluggsh. A good choce of weghts, therefore, s one that produces mdrange functon sgnals. A good practce s to select zero centered random weghts wthn a small range Local Mnma One of the major ssues that plague MLP tranng s the possblty of the gradent descent method to become trapped n a local mnmum. Ths s a drect result of the fact that the error surface that s beng searched s non-convex. The strateges descrbed n the followng sectons may help n ths matter Learnng Rate An mportant parameter that needs to be selected f performng ncremental learnng (n whch weghts are updated after every tranng par) s the learnng rate η. The purpose of ths parameter s to subdue the effect of the gradent descent on the values of the weghts. Every applcaton of the gradent descent method s vald only locally for the nput-output par currently appled. Too bg of a change would lkely negatvely affect the accuracy on the remanng set. Values are typcally n the range of (0, 1]. The learnng rate value should be selected based on, among other thngs, the number and sze of the layers and the number of tranng vectors. Too hgh of a value can cause chaotc oscllatons around the soluton. Values too low may cause a very long tranng tme. One opton s to mplement a dynamc learnng rate that uses nformaton about the converson progress to adapt the learnng rate. There s a good amount of research on the topc. However, a smple and effectve method s to decrease the learnng rate perodcally over tme. Ths practce also guarantees convergence Momentum Term By usng nothng but the current error gradent nformaton n choosng the drecton and step sze on the error surface, the weght vector can become very erratc and constantly change drecton at the applcaton of each new tranng sample. Havng no mass, the search has no memory of the tendency of the search drecton and can be thought of as takng the frst opportunty of a mnmum that t gets. In order to ntroduce memory, the weght step at each sample s stored for use by the next teraton. The momentum term specfes how much nfluence the prevous step has n the current teraton s step vector. Havng a momentum has the effect of averagng out the gradents through tme helpng

22 the search overcome any mnmums encountered along the general downward slope of the gradent. The updated weght updatng step, utlzng the deltas of the prevous weght updates s: (2.14) w w t t+ 1 = η w t w E + w + α w t t 1. Note the addton of the momentum term α whch needs to be selected. The effect of the change s that sequental weght changes n the same drecton accelerate, whle those of opposte sgns cause the changes to decelerate, as expected Multple Tranng Sessons Another, smple and brute force, method to mprove the chance of fndng the global mnmum s to tran on the same data modfyng only the ntal weght values. Of course, ths method can sgnfcantly ncrease the learnng tme and has no guarantees Incremental vs. Batch Learnng Weght updates may be performed after the calculaton of the error of each ndvdual nput-output sample or at the end of a full epoch. The former case s called ncremental learnng (also nstantaneous or pattern learnng). In ths mode, the delta s propagated back as descrbed n the prevous secton, and the weghts are updated followng the Backpropagaton of each tranng vector s output error. It s mportant that the deltas are pushed onto the neurons of the prevous layer before modfyng the weghts themselves on any partcular layer so that the error blame s placed on the neurons for whch the connectons were strongest durng the forward propagaton. When performng ncremental tranng, a recommended strategy s to present the network wth the tranng samples n random order wthn the epoch. Ths prevents the network from gvng the early samples n the set a greater prorty and may reduce the rsk of gettng stuck n a local mnmum. In batch learnng mode (also epoch learnng), the samples from the entre tranng set are propagated forward and back through the network accumulatng all the weght adjustments for each layer. Tranng samples do not need to be appled randomly as ths would make no dfference n the fnal values. The asynchronous nature between teratons of one epoch makes batch learnng a better canddate for parallel mplementatons. The greatest beneft to batch learnng, however, s the possblty of applyng adaptve learnng methods that are descrbed next Adaptve Learnng Technques By ncorporatng batch learnng, several newer adaptve weght adjustment rules can be appled that have shown to provde qucker converson. Whle theoretcally not as accurate as ncremental learnng (f good ncremental learnng parameters are chosen), they can be orders of magntude faster at reachng convergence. Two technques that receve much prase are the Reslent Backpropagaton (RPROP) [15] and Quckprop (Qprop) [16] algorthms.

23 Standard BP uses the dervatve of the actvaton functon as a part of the weght updatng calculaton. Due to the shallow slope at hgh magntude nput values, the change n weght may be very small for large error values. RPROP, on the other hand, uses only the sgn of the de/dw value along wth an externally defned value to update the weghts. The weght change s calculated as shown n Eq (2.15) w t j t j, t = + j, 0, E f wj E f wj else > 0 < 0 The adaptve feature of the RPROP algorthm comes from the self-adjustment of the weght delta values as follows: (2.16) t j + t η t = η 1 j, 1 j, t 1 j, t 1 E E f wj wj t 1 E E f wj wj else t t > 0 < 0 + n whch 0 < η < 1< η. In words, f the slope of the error curve wth respect to some weght negated, ths sgnfes an overshoot and the multpler for that weght s decreased. In ths stuaton, the weght s not modfed at ths epoch. If the slope drecton remaned the same, the converson can be accelerated by ncreasng the multpler for that weght. + Some parameters that can be modfed are the values of η and η, and the maxmum and mnmum of the deltas. The Quckprop (or Qprop) algorthm s a secant method whch uses the property that the gradent of the error surface wth respect to the weghts s zero at all local mnma. One way to obtan the optmal weght values s to solve the followng system of equatons: (2.17) n whch follows: E n E E ( w0, w1,..., wn ) ( w, w,..., w ) 0 = 0 = 0 ( w, w,..., w ) = n n E s the th coordnate of the error gradent. The weghts are updated as

24 (2.18) w t+ 1 = w t E t t 1 ( w ) E( w ) t ( w ) η t t E. 1 w w Generalzaton and Overfttng Defnton The typcal purpose of a neural network s to be accurate not only on the tranng set, but any new novel nput from the same populaton. A network capable of classfyng many novel nputs correctly has a good generalzng characterstc. Tranng a network so that t retans hgh generalzng capablty s not easy. There s much research n ths area that s beyond the scope of ths thess. The followng are just several recommendatons that should be followed when choosng the tranng data [14]: 1. The choce for attrbutes n the nput data must make sense wth that of the output class. In other words, there needs to be some correlaton, even f theoretcal, between the cause and effect. 2. For the case of contnuous functons, the functon that s beng learned should be smooth. Whle ths s not a necessty, t s benefcal for the Backpropagaton algorthm whch reles on the gradent of the functon as a choce for weght updates. In some cases, a preprocessng of the nput data, such as by usng Prncpal Component Analyss, can help. 3. The tranng set should be suffcently large and be a good representaton of the data that the neural network s expected to come across. A good representaton s defned as nput samples that cover a large part of the nput space and are evenly separated. Followng these rules requres some pror knowledge of the functon beng approxmated. Generalzaton can be lost f the tranng algorthm s run for too long on the tranng data. In ths phenomenon, known as overfttng, the surface of the hypothess functon, n an effort to touch every pont n the tranng set, may become jagged havng many rregulartes between the tranng ponts. Ths n turn leads to the loss of nterpolatng ablty and lack of generalzaton. It s therefore necessary to prematurely stop the tranng process before t enters the overfttng stage. Ths s dffcult, f not mpossble, to acheve f usng only the errors n the tranng set as the crtera for the stoppng condton Cross-Valdaton and Early Stoppng A common and smple method used to reduce overfttng s cross-valdaton a technque frst ntroduced by Seymour Gesser [17] used n many machne learnng algorthms. The technque nvolves segmentng the orgnal tranng data nto tranng and valdatons sets. The dea s to tran on the tranng set and montor the generalzaton on the valdaton set. The nformaton ganed from the valdaton set can be used to select

25 model parameters (such as network sze), or for determnng when to stop tranng (early stoppng). Varous parttonng schemes exst based on the cross-valdaton technque. In the smplest method, called Holdout Valdaton, the ntal tranng data s splt nto the two aforementoned sets. The tranng data s used to tran the model. After each teraton of the tranng set the valdaton set s used to compute the accuracy. Fg shows typcal behavor of the errors on the tranng and valdaton sets as a functon of the teraton number. Early stoppng often ncorporates the holdout valdaton scheme. The pont at whch the algorthm should stop s marked by a vertcal dotted lne. Note that the error on the tranng set contnues to decrease. Fgure 2-12: Tranng Set error and Valdaton Set error as a functon of tranng teraton number. A more advanced parttonng scheme, known as K-Fold cross-valdaton, entals parttonng the orgnal set nto k subsets. The same algorthm s run k tmes by leavng one of the k subsets out and usng t as the valdaton set. The results from the k runs are accumulated and processed (often by takng an average). Whle t s not obvous how to use ths parttonng scheme for early stoppng, t can be used to evaluate the network structure for generalzaton. A thrd parttonng scheme, known as Leave-one-out cross-valdaton, s smply K-Fold cross-valdaton wth k set to the number of tranng samples. In other words, the network s traned k tmes, reservng only one of the tranng samples for valdaton. As expected, ths method takes a very long tme and s more applcable to smaller problems Jtter Ths smple, but powerful method works on the prncple that nput vectors that are close together should produce outputs that are close together. The method nvolves modfyng the nput vectors by changng ther contnuous attrbute values n very small percentages. Ths can also be effectve when the tranng set avalable s qute small and works partcularly well wth mages where the 2-dmensonal mage s systematcally rotated, stretched, etc.

26 2.6.9 Weght Decay Large weghts n an MLP are known to negatvely affect generalzaton performance. Havng large weghts at the hdden layer can easly cause the hypothess functon to become rough wth small changes n the nput and have near dscontnutes. Large weghts leadng nto the output layer can cause the output to become too large and leave the range of the possble outputs. Weght Decay s a method n whch the error functon s augmented wth a weght penalty term. An example of a penalty term s a fracton of the squared sum of the weghts (Eq. 2.19). Ths smple addton forces the network to try and keep down the weght values and prevents these knds of problems (2.19) E = ( y( x, w) d ) + T = 0 k n whch k s the set of ndces for all the weght parameters. 2.7 Network Archtecture The most obvous and arguably most crtcal parameter that the mplementer has control over s the number of neurons n the hdden layer, or whether to nclude a hdden layer at all. Naturally, choosng to forego the hdden layer would speed up the learnng process consderably. In contrast, there are rare cases n whch there s a need for multple hdden layers. The sze of the nput layer s set by the dmenson of the nput data as s the output layer due to the output data. The hdden layer, however, s under full control. Selectng a proper number of unts n the hdden layer(s) takes nto consderaton the dmenson of the nputs and outputs, sze of the tranng set, complexty of the functon to be learned, number of layers and network archtecture, type of actvaton functon used, tranng algorthm, and other factors. Often, the best acton s to smply tran multple networks and see whch one works best. As part of an effort to automate the task of choosng the network archtecture, new mechansms whch dynamcally adapt the number of neurons as well as connectons between them have been developed. These generally fall nto constructve and prunng algorthms n whch neurons are added or removed automatcally as the learnng progresses. These technques are beyond the scope of ths work. A summary of popular constructve algorthms s presented n [18], [19] and [20]. Prunng algorthms are summarzed n [21] and [20]. These are only some of the optons that the desgner s able to tune and experment wth. There s much research beng done n many areas, whether for speed or accuracy of tranng or smplfyng the mplementaton steps by makng the network more adaptve to the problem at hand. When tweakng a network that s to functon on large data sets (or havng a large nput space), consderng that a sngle tranng sesson can take days or weeks to complete, expermentaton can be very tme consumng and would beneft largely from an optmzed modular and expendable toolset runnng on affordable hardware. w k

27 2.8 Convolutonal Networks The standard Mult-Layer Perceptron model conssts of two or more layers of neurons. Each layer s usually fully-connected to the layer before t, meanng that every neuron on layer l s connected to every neuron on layer l-1. Tradtonally, when applyng neural networks to 2-d nputs, such as mages, so called feature extractors are placed between the raw mage nput and the nput layer of the network. These feature extractors, or kernels, are hand tuned to extract nformaton such as edges and dscard rrelevant varables. The process s called convoluton. LeCun et al. [22] [23] have tred to remove ths extra step of needng to manually create the kernels. Instead, ther desgn, known as the Convolutonal network, uses Backpropagaton to automatcally tran the feature extractors. Convoluton networks (consstng of convoluton and subsamplng layers) are smlar to fully-connected networks (consstng of fully-connected layers), and can n fact be nterconnected wth convoluton layers appearng before fully-connected layers n the network. Convolutonal layers borrow the dea of an mage feature extractor (or kernel) from mage processng and tran these kernels automatcally (just as fully-connected layers tran the weghts gong nto them). Fgure 2-13: Convoluton Layer network archtecture. Fg shows the concept of convoluton layers. In the archtecture, each layer has multple kernels that t trans ndependently. Each kernel s pared wth ts own feature map whch s the output after the kernel traverses over an nput mage. The feature maps are analogous to neuron values n the fully-connected network. The kernels are analogous to the weghts although there are a sgnfcantly smaller number of them beng that the same nxn kernel s appled to every possble block of nxn nput pxels. By substantally reducng the number of weghts, the network s much less prone to Overfttng, s qucker to learn, and greatly decreases memory requrements.

The man advantage to Convolutonal layers n comparson to standard layers, however, s ther nherent nvarance to mage transformatons such as shfts, scales, and rotates.

28 The man advantage to Convolutonal layers n comparson to standard layers, however, s ther nherent nvarance to mage transformatons such as shfts, scales, and rotates. A standard layer, upon learnng from a tranng set of shapes, for example, would result n havng smlar weghts repeated at certan ntervals. On the other hand, the Convolutonal layer kernel whch s made of a relatvely small set of weghts traverses the entre mage. Ths shared weght concept s not new, but s found to work well for 2-dmensonal nputs. An example of LeCun s networks s shown n Fg It s the LeNet-5 [12] whch was desgned to recognze handwrtten characters. Fgure 2-14: LeNet-5 Network Note that the LeNet-5 archtecture conssts of convoluton layers, whch functon as descrbed above, and subsamplng layers, whch take averages of closely spaced pxels to produce a smaller feature map. Subsamplng helps reduce the effect of shfts of the nput mage between tranng samples. The mplementaton n ths work acheves a smlar effect by combnng the convoluton and subsamplng layers nto one by movng the nxn kernel two pxels at a tme n the x and y drectons. 2.9 Related Work Beng a relatvely easly parallelzable algorthm, the standard MLP algorthm (consstng of only fully-connected layers) has been parallelzed on a vast number of devces, ncludng graphcs processng unts [24], computer clusters [25], dstrbuted systems [26], FPGAs [27], etc. Convolutonal network mplementaton, however, reman scarce largely due to the ncreased complexty nvolved for Backpropagaton.

29 Chapter 3: Support Vector Machnes 3.1 Chapter Introducton Support Vector Machnes (SVMs) are smlar to the Mult-Layer Perceptron (MLP) n that they generalze nto the famly of lnear classfers, or separators, ntroduced n the prevous chapter. The methodology for tranng, however, s qute dfferent. Whle MLPs are largely based on heurstcs (even the very frst neuron model was based on heurstcs), SVMs borrow proven theorems and tools from the felds of optmzaton theory, generalzaton theory, and statstcal learnng theory makng them more understood. SVMs, as wll be revealed, hold many mportant advantages compared to other tranng methods such as the MLP. The chapter dscusses early motvaton for ths learnng system and ntroduces some of the concepts whch serve as the foundatons for ts functonalty. 3.2 Background It s dffcult to pnpont the exact pont n hstory at whch Support Vector Machnes appeared, but much of ts advancement s credted to Vladmr Vapnk who s consdered by many to be the man nventor and contrbutor. It has been wdely accepted that the begnnngs of SVMs appeared wth the publshng of the book Estmaton of Dependences Based on Emprcal Data (1979) [28] n whch Vapnk provded the foundatons of statstcal learnng theory. The SVM algorthm tself was ntroduced n [29]. The man purpose of the SVM was to overcome some of the shortcomngs of exstng classfers especally those of the MLP. Both the SVM and the MLP belong n the class of lnear learnng machnes that were ntroduced n the prevous chapter but are very dfferent n the technques that they employ. Ths dfference s largely attrbuted to the dssmlar processes by whch the two came about. SVMs were desgned usng exstng proven theores and concepts taken from optmzaton theory, generalzaton theory, statstcal learnng theory, etc. The MLP takes concepts from exstng theores as well, but t s also largely based on heurstc models such as the very frst McCulloch and Ptts neuron model tself (whch n some way was based on the yet lttle understood neuron nsde the bran). 3.3 Tranng From a hgh level perspectve, tranng SVMs s very smlar to tranng the MLP. A set of tranng samples s provded along wth a number of tranng parameters, the learnng process s started, and after some tme a classfer model that can be used for future classfcaton s obtaned. It s the nternal part, and the representaton of the fnal traned

30 model that really set the two apart. Also, unlke the MLP, the SVM s nherently only capable of dstngushng between two classes, makng t a bnary classfer. For the purpose of ths chapter, elements are ether classfed as ether postve (+) or negatve (-). Lke the MLP, the goal of SVM tranng s to produce a lnear n-1 dmensonal hyperplane n an n-dmensonal nput space that separates the two classes. The SVM s known as a maxmal-margn classfer because t attempts to fnd a hyperplane that not only classfes each tranng vector correctly, but t optmzes ts poston and orentaton by takng nto account the perpendcular dstances of the closest ponts to the hyperplane. In other words, the algorthm attempts to fnd the wdest possble street (of wdest margn) that can stll classfy all tranng vectors. Ths constrant can be relaxed, as wll be explaned later. The name support vectors comes from the algorthm s objectve to dscard any vectors that do not affect the margn; those that do not touch the edges of the street. A toy example of a lnearly separable two-dmensonal problem s shown n Fg. 3.1 wth the support vectors crcled n red. Fgure 3-1: A toy example of a wdest-margn classfer n two dmensons Lnearly Separable Case The SVM can be appled to many real problems. However, for the purpose of ths secton, the theory s frst ntroduced n the smplest case n whch all tranng vectors are lnearly separable. Once the man concept s ntroduced, further modfcatons wll be descrbed that extend the SVMs applcaton to more realstc cases. As a revew, the hyperplane of a lnear learnng machne s defned by the functon: d = k= 1 (3.1) f ( x) w x + b or f ( x) = x w b k k + n whch d s the number of dmensons of the nput space. The perpendcular dstance from the hyperplane to the orgn s b w n whch w s the Eucldean norm of the vector w. Gven a set of tranng vectors wthn the nput space that are correctly classfed by the hyperplane, let d + be the dstance to the closest postvely classfed

31 tranng vector and d be the dstance to the closest negatvely classfed tranng vector, maxmzng the margn mples maxmzng d + + d. For the purposes of optmzaton, the goal s to fnd a representaton of the margn wdth n term of the optmzaton varables w and/or b. It s a known property of lnear classfers that scalng the values w, b by the same postve factor does not change the functon. On the other hand, the values of d + and d wll be mpacted. For dervaton purposes, d + and d can be forced to 1, resultng n the followng two nequaltes: (3.2) ( a) ( b) x x w w + b b 1 { y = + 1} { y = 1} whch can be combned nto one convenent expresson of Eq (3.3) y ( x w + b) 1 0 Assume that there exst ponts for whch the equalty n Eq. 3.2a holds and another set of ponts for whch the equalty n Eq. 3.2b holds (ths mples a traned w and b). The ponts satsfyng Eq. 3.2a le on the hyperplane x w + b = 1 whch has a perpendcular dstance from the orgn of 1 b w. Smlarly, those satsfyng Eq. 3.2b le on the hyperplane x w + b = 1 whch has a perpendcular dstance from the orgn of 1 b w. These ponts are known as the support vectors as they touch the extremtes of the margn. All other ponts (satsfyng the nequalty of Eq. 3.2a and 3.2b) are nonsupport vectors and would not change the fnal values of w and b should they be removed. Snce both hyperplanes are parallel, the dstance, m, between them s obtaned by takng the dfference between ther perpendcular dstances from the orgn. (3.4) m = = 2 ( 1 b w ) ( 1 b w ) w 2 Maxmzng the hyperplane, therefore, amounts to mnmzng w 2. The problem can be reformulated nto Lagrangan form as shown n Eq By dong so the constrant n Eq. 3.2 s replaced wth constrants of Lagrange multplers α, = 1,..., l (one for each of the l nput tranng vectors) whch are easer to handle. Also, the tranng data wll appear durng the tranng and classfcaton phases n the form of dot products a property that s central to the SVM s ablty to generalze to the nonlnear case as wll be shown n later sectons. The prmary Lagrangan s (3.5) LP w α y ( x w + b) + l = 1 l = 1 α

32 Because the objectve functon tself s convex and those ponts whch satsfy the constrants are convex, the problem becomes a quadratc programmng problem. Wth the help of optmzaton theory, the problem can be reformulated once agan nto a dual form whch s known as the Wolfe dual [30]. The dual formulaton s often easer to solve due to the dffculty n handlng nequalty constrants. The procedure for transformng a prmal nto a dual s to zero the dervatves of the prmal functon wth respect to the optmzaton varables, and substtute the resultng relatons nto the prmal. Ths process removes the dependence on these varables. For the prmal obtaned above, the process s as follows. Dfferentatng the prmal form wth respect to w and b and settng t to zero gves: (3.6) ( ) ( ) 0,, 0,, 1 1 = = = = = = = l P l l P y b b L y y b L α α α α α w x w x w w w. Substtutng back nto the prmal results n the dual problem: (3.7) ( ) ( ) [ ] = = = = = = + = + ), ( (1,1) ), ( 1 1 ), ( (1,1) ), ( l l j j j j l l l l j j j j l D y y y y b y L x x x x w x w w α α α α α α α. The constrants for ths problem, borrowed from the prmal verson, are: (3.8) 0 0, 1 = = l y l α α. The Wolfe dual maxmum occurs at the same pont of w, b and α as the prmal Lagrangan s mnmum subject to the two problems correspondng constrants. The values of α at the soluton are greater than zero for those tranng vectors whch le on one of the two hyperplanes (the support vectors) and equal to zero for those vectors whch touch one or are outsde of the two hyperplanes (all other vectors whch would not nfluence the resultng hyperplane f removed).

33 The Karush-Kuhn-Tucker (KKT) condtons, also borrowed from the feld of nonlnear programmng technques, are condtons that f satsfed, are necessary and suffcent for optmalty. For the case of the prmal problem, they are: L w (3.9) y ( x w + b) P v LP = b α y x = 1,..., l [ y ( x w + b) 1] = 0, α 0, α = w l = 1 v l = 1 α y = 0 v 1 0, = 0, v = 1,..., d Once theα s are found, w s obtaned drectly from one of the prmal constrants: (3.10) w = y α x l = 1 Once w s calculated, b can be obtaned. Snce the b does not appear n the dual problem, a safe way to obtan t s to use the mean of the solutons of the complmentarty KKT condton: (3.11) α [ y ( x w + b) 1 ] = 0 for s correspondng to non-zero Lagrange multplers ( α > 0 ). Another method s to solve the followng: (3.12) ( w x ) + max ( w x ) max y = y =+ 1 b = 2 1 Classfyng a novel vector becomes a smple matter of calculatng: (3.13) f ( x) = sgn( w x + b) Non-separable Case Up to now, t was assumed that the tranng vectors are lnearly separable. Ths assumpton rarely holds n real world problems. Thankfully, there s a soluton that was provded by Cortes & Vapnk n 1995 [31]. By ntroducng slack varables, the constrants gven n Eq. 3.2 can be relaxed. The new constrants are as follows: (3.14) x w x w ξ 0 + b + 1 ξ + b 1+ ξ { y = + 1} { y = 1}

34 The goal s to keep slack varables as small as possble. A value greater than one mples a msclassfcaton. The sum of slack varables s, therefore, an upper bound on the number of classfcaton errors. The orgnal objectve functon w 2 s augmented wth the penalty factor for the slack varables as follows to form l 2 (3.15) w 2 + C ξ. = 1 where C s user modfable. Hgher values of C put a hgher penalty on msclassfcatons. The dervaton of the dual Lagrange problem s smlar to that above. Detals can be found n [32]. The new dual optmzaton problem becomes: 2 l ( l, l) 1 (3.16) LD α y y jαα j = 1 2 (, j) = (1,1) x x j Wth the constrants: (3.17) 0 α C, l = 1 α y = 0 The only dfference s an upper bound on the α s makng t nto a box-constraned problem. Note that the slack varables dsappear. For reference, the new prmal Lagrangan s: l l l 1 2 (3.18) LP w + ξ α[ y ( x w + b) 1+ ξ ] 2 = 1 = 1 n whch the µ s are new Lagrange multplers ntroduced to enforce postvty of the ξ s. The complete KKT condtons for the prmal problem are shown below. Note that d s the number of dmensons n the nput space, and l s the number of tranng samples. = 1 µ ξ

35 (3.19) ( ) ( ) [ ] b y l b y C L y b L d v x y w w L P l P l v v v P = = + + = + = = = = = = = = = 0, 0, 1 0, 0, 0, 1,..., 0, 1 0, 0 1,..., 0, 1 1 µ ξ ξ α µ α ξ µ α ξ α α w x w x The hgher complexty of the prmal Lagrangan wth the addton of slack varables further valdates the preference for solvng the smpler Wolfe dual problem. From ths pont on, only the dual formulaton wll be consdered Non-lnear Case The Wolfe dual can be rewrtten to conform to the representaton of a lnearly constraned quadratc programmng problem: (3.20) ( ) α α α α α α α ψ T T n T c G G = = = n whch c s a vector of all -1 s. The problem soluton s subject to condtons: (3.21) 0 1,...,, 0 1 = = = n y n C α α n whch G takes on the values j j j x x y y a,, =. In order to extend the applcaton of SVMs to nonlnear problems, [29] appled an old kernel technque, known as kernel-nduced spaces, n whch they replaced the values of the matrx G wth ( ) j j j x x K y y a,, =. In effect, they replaced dot product operatons wth a kernel functon ( ) b a K, - hence the name kernel matrx for G. The kernel functon may be nonlnear and often performs a transformaton onto a hgher-dmensonal space n whch the dot product s performed. Common kernel functons are:

36 Polynomal (homogeneous) K ( a, b) = a b d Polynomal (nhomogeneous) ( a, b) = ( a b + ) d K 1 Radal Bass Functon 2 ( a, b) = exp( γ a b ) K, for γ > 0. Gaussan Radal Bass Functon K ( a, b) Sgmod exp a b = 2 2σ 2 ( a b) = tanh( κa b ) K, + c, for at least one κ > 0 and c < 0. The kernel operaton s smplfed before mplementaton whenever possble. For example, n the case of the nhomogeneous polynomal kernel: (3.22) K ( x, z) = ( a b + c) = = = = 1 l l l = 1 j= 1 ( l, l) (, j) = (1,1) a b + c a a b b j= 1 + 2c = 1 n ( a a )( b b ) + ( 2ca )( 2cb ) j j 2 j j n a b j n j = 1 + c a b + c Snce only the kernel matrx s affected by the nput vectors, the kernel functon decouples the data from the problem. The quadratc program problem s solved n exactly the same way, but wth a dfferent, usually larger kernel. More often than not, mappng the nput vectors usng the nonlnear kernel makes them lnearly separable n the nduced feature space. The classfcaton step s modfed n the same way wth dot products replaced by kernel operatons. 3.4 Implementatons As shown n the prevous secton, the underlyng operaton for tranng of SVMs s fndng the soluton to a lnearly constraned quadratc programmng problem. The 2 + c 2

37 ramfcatons of ths dscovery were sgnfcant n that t was possble to apply exstng well-developed solvers. Unfortunately, exstng solvers act on the assumpton that the entre kernel matrx s n fast memory (usually RAM) at all tmes. The sze of ths matrx n the context of SVMs grows quadratcally wth the number of nput tranng vectors. Wth number of tranng vectors n the hundreds of thousands beng not uncommon, the memory requrement for the tranng from such datasets becomes too large. Even wth the rapd ncrease n memory capacty n modern hardware, the poor scalablty to ncreased tranng set szes called for new, memory effcent methods. Two such methods whch ganed wde acceptance are the Workng Set technque (e.g.: Chunkng and Decomposton) and the Sequental Mnmal Optmzaton technque Workng Sets The man concern, as expressed above, s the lack of memory for the kernel matrx due to the large sze of tranng sets. The workng set technque, ntroduced by Osuna [], attempts to remedy ths ssue by workng on a relatvely small subset of the tranng set at a tme. The generalzed pseudo-code s shown n Lstng 3. Intalzaton - gven tranng set S - select set Ŝ from S - α 0 repeat generate problem data based on Ŝ (.e.: kernel matrx) run QP optmzer on the subproblem select new workng set Ŝ based on KKT volatons untl stoppng crteron satsfed return α Lstng 3: Generc Workng Set Pseudocode Intally, the subset of sze N s chosen arbtrarly, usually randomly and evenly dstrbuted. The quadratc programmng solver s put to work on ths subset and produces some set ofα s. Those tranng vectors for whchα s zero (non-support vectors) are dscarded, and n ther place M new vectors are ncluded based on the magntude of ther KKT volatons n the most-current soluton. As the algorthm contnues, non-support vectors are elmnated untl only support vectors are left. Chunkng does not completely elmnate the memory problem as there s no guarantee that the fnal number of support vectors does not produce a kernel matrx of a sze that cannot ft nto memory. A more complex method whch also falls nto the workng set method s the Decomposton technque. The man dfference n the decomposton algorthm s that the sze of the workng set stays constant and does not exceed the lmt of the hardware s avalable memory. Only thoseα s whch correspond to tranng vectors wthn the current workng set are modfed. All others are kept fxed. Ths

38 method optmzes the problem by workng on a small subset at a tme rather than tryng to fnd all the constrants and optmzng on all of them at once. These algorthms have not been proven to be optmal, but they have shown very good results. The decomposton technque wll be further explored n the Gradent Projectonbased Decomposton Technque secton n ths chapter Sequental Mnmal Optmzaton The Sequental Mnmal Optmzaton (SMO) algorthm s an extreme case of the Chunkng technque, as t analyzes exactly two tranng vectors at a tme the smallest amount possble. Ths method s very attractve for many reasons. Havng only two free varables allows the use of analytcal methods nstead of complex QP solvers. Ths property makes ths algorthm much easer to mplement and requres less computatonal and storage resources. Whle generally needng many more teratons for convergence, theoretcally convergence tme may be cut by up to a thousand tmes [33]. Another advantage of the SMO algorthm s that t does not utlze a kernel matrx. Not only does ths relax memory requrements, but t s much less susceptble to precson errors due to the nature of floatng pont operatons on modern hardware. SMO conssts of two man parts: the analytc method for updatng the two Lagrange multplers, and a method for choosng the next two varables for update. The analytcal method s based on the dual Lagrange formulaton. The orgnal constrans are gven n Eq It follows from the frst condton that gven two free Lagrange multplers, both need to be modfed n such a way that they le on the lne: (3.23) α 1new y1 + α 2new y2 = c = α1 old y1 + α 2old y2 The second condton bounds the values nto a box. The freedom for choce of the two Lagrange multplers s represented graphcally n Fg There are two cases: one n whch y 1 = y2 (Fg. 3-2a) and one n whch y1 y2 (Fg. 3-2b). Fgure 3-2: Box constraned search

39 The value ofα 2 s calculated frst. Analytcally, the bounds are: (3.24) L = max H = mn f y = y 1 L = max H = mn f y y 1 ( 0, α 2 α1 ) ( C, C + α α ) 2 ( 0, α 2 + α1 C) ( C, α + α ) 2 and In order to provde the algorthm for calculatng the new Lagrange multplers, two mathematcal defntons are needed: (3.25) ( x, x ) + K( x, x ) 2K( x x ) η = K , 2 E = u y η s the second dervatve of the objectve functon along the dagonal and u s the actual output from the classfer. Calculatng α 2new and α 1new s performed as shown n Eq and 3.28 respectvely. (3.26) α α 2new, unclpped 2new = α = α + H L 2 y 2new, unclpped 2 ( E E ) f f f 1 η 2 α L < α α 2new, unclpped 2new, unclpped 2new, unclpped H < H L (3.27) α = α + y ( α ) 1new 1 y1 2 2 α 2new It s mportant to note that for ths technque to work, the chosen kernel functon needs to satsfy Mercer s condton [32] and that no two tranng vectors are equal n the set. The frst necessty guarantees a postve η and the second guarantees a non-zero η. There s a method to work around the postvty requrement that requres more computatons [33], but t s beyond the scope of ths thess. The second part of the SMO algorthm s used for selectng the next par of Lagrange multplers whch are to be modfed. Accordng to Osuna s theorem [34], as long as one

40 of the chosen multplers volated the KKT condton, the objectve functon wll decrease and convergence wll eventually occur. Convergence s sped up by applyng heurstcs. The premse behnd the procedure s to focus on KKT volatng bound tranng samples that s those samples for whch 0 < α < C and the KKT test does not pass. Samples whch are bounded α = 0 or α = C are lkely to stay bounded, and thus are only checked once all bound samples meet the KKT condtons. Two samples need to be selected. The outer loop of the heurstc algorthm s responsble for choosng the frst sample. Intally, all samples are terated and those whch volate the KKT condtons are processed. The next teratons of the loop only work on the nonbound samples that volate the KKT condtons. The loop contnues untl all of the nonbound samples obey the KKT condtons wthn a threshold ε. Once ths occurs, the entre set s processed agan. The loop contnues untl all α s obey the KKT condtons wthn the threshold. For every selecton of the frst parameter a second one s chosen so that good progress s made n the optmzaton functon (large step sze). Ths step sze can be approxmated by takng the absolute values between E 1 and E 2. The E values for all the non-bound samples are cached to reduce kernel evaluatons. For more detals of the algorthm, ncludng choces for the threshold ε, unusual crcumstances and ther workarounds, as well as a pseudo code of the entre process, the reader s drected to [33] Gradent Projecton-based Decomposton Technque The defnton of the workng set technque functons only as a hgh level framework for the peces that make up the algorthm. The choce of QP solver, workng set replacement heurstcs, termnaton condtons, etc. s up to the mplementer. Research has advanced greatly n the feld of these ndvdual sub-algorthms, makng ths a hot area n the feld of machne learnng. The Gradent Projecton-based Decomposton Technque (GPDT) [35] s an mplementaton of the decomposton technque whch wraps several recent promsng works nto a self-contaned package. The research papers and the C source code mplementaton of the algorthm, called the Parallel Gradent Projecton-based Decomposton Technque (PGPDT) are avalable onlne [70]. The source code avalable was the startng pont of the SVM porton of ths work. In ths secton, only those features that are relevant to ths work are descrbed. The GPDT algorthm was desgned to be effectve on medum-to-large szed workng sets (O(10 2 ) or O(10 3 )) and s touted as beng the frst mplementaton that s well sutable for applcaton on mult-processor systems. Parallelsm exsts manly due to the desgn of the QP Solver and a global gradent updatng step used for the replacement algorthm. The QP solver, whch solves each of the workng set subproblems, s based on a gradent projecton method whch has a large matrx-vector multplcaton operaton for the root of

41 ts computaton tme. As part of the subproblem tranng set replacement strategy, the gradent of the global objectve functon must be calculated wthn each teraton. Ths step, whch nvolves another matrx-vector multplcaton, can also be parallelzed. Specal steps need to be taken, however, as the matrx s assumed to be out of memory and needs to be generated or recalled from a cache on the fly. Gven n tranng samples, the top-level algorthm s shown n Lstng 4. Note that ths s just a more detaled verson of Lstng 1. Intalzaton - gven tranng set S - gven n > n sp > n c > 0, n c even - select set Ŝ from S, s.t. #{Ŝ} = n sp - α 0 repeat generate problem data based on Ŝ (.e.: kernel matrx) run gradent projecton-based QP optmzer on subproblem update global gradent vector based on results create new workng set Ŝ of sze n sp - select up to n c vectors based on gradent values - retan some elements of current workng set so that #{Ŝ} = n sp untl stoppng crteron satsfed return α Lstng 4: GPDT Algorthm, top level. Let G, α, and y be the global matrx kernel, Lagrange multplers, and classfcatons { 1, + 1} for the entre tranng set (wth G beng out of memory), β be the set of tranng vector ndces wthn the workng set of the current teraton, and δ be the complementng set not n the current workng set. The decomposton s performed by separatng the data nto: Gββ Gβδ α β yβ (3.28) G =, α =, y = Gδβ Gδδ α δ yδ QP Solver The solver used n the GPDT mplementaton s tself made up of exstng well-developed deas and several modfcatons added for more effcent convergence. The method s descrbed n detal n [36]. It s based on gradent projecton onto a box-constraned space a method that s relatvely trval to mplement. The problem s restated below. mn T (3.29) ( x) x x c x f = 2 1 T A s.t.:

42 (3.30) T a x = b L x U, = 1,..., n Note that when tranng SVMs, n s the number of tranng samples, A s G ββ, c s a vector of ones, L s 0 and U s specfed as a tranng parameter. If the matrx A s postve defnte and dagonal ( A = dag( d, 0 d1,..., d n )), the projecton step alone s enough to fnd the soluton. Otherwse, a necessary lne search and step length calculaton step must also be performed for convergence to occur. The projecton step wll be descrbed frst followed by the lne search and step sze algorthms. The frst step s to represent the problem a dfferent way by transformng the frst constrant nto a penalty term n the objectve functon: 1 T T T (3.31) φ( x; λ) = x Ax c x λ( a x b) 2 T n whch λ s a scalar parameter that needs to be found. The porton a x b s derved T from the constrant a x = b of the orgnal optmzaton problem. Next, an educated guess for an ntal value for λ s taken resultng n the problem: (3.32) mn s.t.: φ 1 T T T ( x) = x Ax c x λ ( a x b) L x 2 U, const = 1,..., n. x for some λ s very easy as t separates nto n separate problems, each wth one varable. Each varable x s found by solvng: The method for fndng the mnmzer ( λ) (3.33) d x = c + λa. Rearranged, ths gves: = + λ. (3.34) x ( c a) d Due to the box constrant, each varable s clamped as shown n Eq (3.35) x, clamped U = x L f f f x L < x x U < U L T Due to clampng, the constrant a x( λ) b = 0 wll most lkely not be met, and thus λ wll need to be readjusted teratvely usng an outer secant-lke method untl the equalty n 3.36 holds. λ : = λ b = T (3.36) r( ) a x( ) 0

43 To summarze, the controlled varable s λ, and the output s r whch s to be equal to 0 to satsfy the second of the two constrants n the orgnal problem. The value of r s determned by the ntermedate soluton to the optmzaton problem The task, therefore, s to fnd the pont of ntersecton on the graph of λ vs. r. Ths graph represents a monotoncally ncreasng pecewse lnear contnuous functon of λ. However, t s not smooth, and therefore common gradent methods cannot be easly appled. Instead, a modfed secant-lke method s used. Ths algorthm, from here on referred to as Algorthm 1, s dvded nto two parts: the bracketng phase and the secant phase. The bracketng phase s desgned to fnd a mnmum and maxmum of the λ between whch the soluton les. The pseudo code s shown n Lstng 5. Calculate x by (3.34) and (3.35) r = a T f( r < 0 ) else end x b λ = λ; = r; λ = λ + λ l r l Calculate x by (3.34) and (3.35) r = a T x b whle( r < 0 ) end λ λ λ = λ; r l l = r; s = max ( r r 1,0.1) λ = λ + λ s; λ = λ + λ Calculate x by (3.34) and (3.35) r = a u = ; r u = T r x b λ = λ; = r; λ = λ λ u r u Calculate x by (3.34) and (3.35) r = a T x b whle( r > 0 ) end λ λ λ = λ; r u u = r; s = max Lstng 5: Bracketng phase of Algorthm 1 The modfed secant search step searches wthn ths space untl the soluton s found. Its pseudo code s shown n Lstng 6. l ( r r 1,0.1) λ = λ + λ s; λ = λ λ Calculate x by (3.34) and (3.35) r = a l = ; r l = r T x b u

44 ( r r ); λ = λ λ = λ λ s = 1 s ; l u Calculate x by (3.34) and (3.35) r = a T x b whle not converged f( r > 0 ) f( s <= 2 ) end else else end λ = λ; r u λ = λ f( s >= 2 ) else = r; s = 1 r ( λ λ ) s ; λ = λ λ u s = max new u λ = λ; r u s = λ = l ( ru r 1,0.1 ); λ = ( λu λ) ( λ λ,0.75λ λ ) = max = r; λ = λ ( λ λ ) ( λ λ) u λ = λ; r λ l l u l Lstng 6: Secand phase of Algorthm 1 As mentoned, the projecton method on ts own s only vald for problems n whch matrx A s postve-defnte and dagonal a case that s hghly unlkely n the case of SVMs. It s possble to extend the algorthm to the general non-dagonal case by encapsulatng the algorthm nto an outer loop along wth the lnesearch and step length algorthms that were mentoned prevously. Ths new framework s smlar to that n [37] and s shown n Lstng 7. ntalzaton repeat Calculate Projecton usng Alg 1 Possbly carry out a lne search Calculate BB-lke step length Update the lne search control parameters untl stoppng crteron satsfed u new = r; s = 1 r ( λ λ ) s ; λ = λ λ u s = max new λ = λ; r l s = l Lstng 7: Outer Loop for QP solver allowng for non-dagonal matrces A. l l ( rl r 1,0.1 ); λ = ( λ λl ) ( λ + λ,0.75λ λ ) = mn end end Calculate x by (3.34) and (3.35) r = a T x b = r; λ = λ ( λ λ ) ( λ λ) u l l u new u r u r u u l ul s s

45 The frst step n the loop uses Algorthm 1 to take the steepest descent step from the current locaton x k wth a fxed step length and project the result onto the feasble space as defned by the constrans n the orgnal problem. The result of ths operaton gves a feasble step drecton. Ths s shown mathematcally as = Ω α (3.37) dk P ( xk kgk ) xk n whch Ω. g k = A x k c and P ( z) s the projecton of a vector z onto the feasble space Ω The second step of the algorthm s to carry out a lne search along the step drecton. Ths step s not requred when mnmzng unconstraned quadratc problems, but t has been shown that the algorthm may fal n the constraned case [38]. The reasonng for ncludng the lne search step s that the step length calculaton, whch s the next part of the algorthm, reles on the steady ncrease of the objectve functon f ( x) n order to work properly. The lne search algorthm only performs the search when necessary, thus cuttng down on the computatons needed. It reles on a control parameter f ref, whch along wth certan other parameters, s dynamcally updated n the fnal step n the man loop. The lne search step executes only f f ( x + d) f. The process s carred out by quadratc nterpolaton along f ( x + d) and the slope g T d. ref x + λd by usng the objectve functon values of f (x) and The thrd step of the algorthm uses a modfed Barzla-Borwen (BB) step sze formula for computng the step sze to take. The BB steepest descent method has been used extensvely for large-scale optmzaton problems. Fletcher gves an overvew of the recent developments n the applcaton to large scale unconstraned optmzaton n [39]. The choce for selectng the step sze n ths algorthm s: (3.38) α m 1 = = 0 k+ 1 m 1 = 0 s s T k 1 T k 1 s y k 1 k 1 In whch: (3.39) s y k k = x = g k + 1 k + 1 x k g k In the formula above, m s a preset nteger, wth 2 beng a common value. A new varable m s defned to be the maxmal nteger for whch s T y k 1 k 1 > 0 for all 0 m. In the mplementaton, m s replaced by mn( m, m). The resultng step sze s then clamped between two extremes [ α,α mn max ]. If m s 0, the value α max s used.

46 The last step n the algorthm s the updatng of the control parameters for the lne search algorthm. These parameters are f ref, descrbed above, a canddate value f c for possble reducton of f ref, the current best value f best, a counter l of consecutve teratons durng whch f ( x) f best, and the number L of such teratons allowed before reducng f ref to f c. The algorthm s shown n Lstng 8. f ( f k < f best ) f best = f k f c = f k l = 0 else f c = max( f c, f k ) l = l + 1 f ( l == L ) f ref = f c f c = f k l = 0 endf endf Lstng 8: Updatng lne search control parameters The descrpton and reasonng for ths algorthm can be found n [36], and [38]. Convergence of the algorthm s montored durng the projecton step. The soluton to the problem occurs when PΩ ( x k g k ) = x k ; n other words, when PΩ ( x k g k ) x k = 0. In practce, a threshold value s used nstead of 0. To prevent an extra projecton step, gven a proper sze of a k an alternatve calculaton of d a may be used to smlar effect. In the full algorthm, the most computatonally expensve part s the calculaton of the gradent g k = A x k c. Ths matrx-vector operaton s easly parallelzable. Intalzng the subproblem matrx s also expensve due to the kernel calculatons requred. Dependng on the low level mplementaton, ths step can also be optmzed for parallel systems. The procedure of dong so n ths work s descrbed n Chapter 7. For further detals on ths algorthm, common alteratons, as well as workarounds to the case that the orgnal constraned problem s not solvable, the reader s nvted to refer to [36] Updatng the Global Gradent The QP solver descrbed above can only functon on problems small enough that they ft nto the avalable memory. After each subproblem soluton, t s necessary to update the global gradent whose dmenson s the sze of the global set. The gradent s necessary so that the elements for the subsequent workng set can be effcently selected. Ths step s smlar to the step wthn the QP solver namely the calculaton of g k = A x k c before projectng nto the feasble space, wth the excepton that A s most lkely out of memory and unavalable and thus ts elements must be calculated on the fly. k

47 The large number of kernel operatons typcally requred n ths step tends to make t the most computatonally expensve n the entre algorthm. One way n whch the PGPDT mplementaton reduces the requred kernel operatons s by ntroducng a least recently used cachng strategy (whch s carred over to the mplementaton n ths work). The program accepts a command lne parameter specfyng the amount of memory to reserve for the cache whch stores the most recently calculated columns of the global kernel matrx. The other trck that PGPDT utlzes s the processng of those nput vectors wthn the subproblem for whchα changed by more than a small threshold. Let N p be the total number of tranng examples and G denote the th column of the global kernel matrx. Let (3.40) β = { β abs( α + 1 α ) sv0} gu k k > n whch sv0 s a value close to 0, G gu be the kernel matrx ncludng only column ndces from the set β gu. The gradent s updated usng: (3.41) F( α ) = F( α ) + G ( α α ) k + 1 k gu k+ 1 k. Note, that for smplcty, the cache strategy, as outlned n the orgnal paper [35] has been omtted. The mplementaton n ths work dffers, and wll be detaled n chapter Workng Set Replacement Strategy For the fnal step of the algorthm, a new workng set must be chosen for the followng teraton. The selecton procedure was ntroduced n [40] and tested n [41]. The procedure s to solve the problem: mn s. t.: F y d d T ( α ) 1 d # k+ 1 d = T 1 d for such that for such that { d d 0} nc α α, k + 1, k + 1 = 0 = C Ths porton was not modfed n any way wthn the PGPDT mplementaton as t was deemed to be nsgnfcant n the overall computaton tme. The procedure for the soluton of ths problem was mplemented as shown n Lstng 9.

48 Sort the ndces of the varables accordng to ( ) T I,..., be the sorted lst. 1, 2 n ( k + 1 y F α ) n decreasng order and let,, wth, as follows: Repeat the selecton of a par ( b t ) I I t < b - movng down from the top of the sorted lst, choose t I top ( α k+1 ) - movng up from the bottom of the sorted lst, choose ( α ) t I top k+1 untl n c ndces are selected or a par wth the above propertes cannot be found Let βˆ be the set of these selected ndces Fll βˆ up to n sp entres by addng the most recent (least amount of consecutve appearances n the workng set and currently n the workng set) ndces j satsfyng β 0 α C ; f these ndces are not enough, then add the most recent ndces j β < j, k +1 < α, and eventually. the most recent ndces j β satsfyng C such that 0 j, k+ 1 = Set n mn{ n, max{ 10, J, n } =, where J s the largest even nteger such that c c new J n sp 10 and n new s the largest even nteger such that n { ˆ new # j, j β \ β} = ˆ β, k k + 1 β. Lstng 9: Workng Set Selecton Algorthm α. j, k +1 = ; set The Cascade SVM The recent advancement and wder avalablty of parallel archtectures and systems has placed new emphass on research toward parallel-frendly mplementatons for many exstng algorthms n scentfc lterature. In the case of the SVM, the workng set methods, ncludng chunkng and decomposton, are not suted for hgh-level parallelzaton due to the dependences between the major steps of the algorthm. The synchronous nature of the framework lmts parallelzaton to only wthn the subalgorthms themselves, as exhbted n the PGPDT desgn descrbed n the prevous secton. The cascade method [42] s one new framework whch promotes asynchronous executon of ndependent subproblems generated from a global set. Whle these algorthms are not optmal, the ntroducton of several new theorems and proofs argues for an acceptable level of correctness for some applcatons. By takng advantage of modern parallel hardware, the ncreased speed of convergence makes these algorthms very attractve when perfect accuracy s not requred. The concept of the Cascade SVM s the constructon of a tree of SVM solvers, as shown n Fg. 3.3, wth results generated by any one solver beng consumed by the parent node. The left-most chldren of the cascade are gven some porton of the global tranng vectors as nputs. The portons may be exclusve, or may repeat. Each SVM functons as a flter by producng an output consstng of only support vectors and ther correspondng

49 alpha values. Groups of outputs of the SVMs n the frst layer are combned usng the unon operaton and used as nputs for the next, smaller, layer of SVMs whch perform more flterng. The process contnues untl the fnal node (root node) s reached. The exact combnaton and network rules are not defned, but several theorems exst [43] [42] that should be followed for proper operaton. The cascade framework has several attractve propertes. Each layer n the network s guaranteed to advance the optmzaton functon, the level of communcaton s mnmal between the layers, and convergence has been shown to be relatvely quck. Examples of exstng publshed works follow. Fgure 3-3: Cascade SVM Network In [42], the authors created a bnary network, as shown n Fg At the frst layer, all tranng vectors are splt evenly and exclusvely for each of the solvers. Results from the solvers are combned n pars and nput nto a solver n the next layer. Thus there s exactly one solver for every two solvers n the prevous layer. The premse s that vectors elmnated from a subset of the global set are unlkely to be support vectors n the global set. Ths dea s shown graphcally usng a 2-dmensonal feature space n Fg. 3.3.

50 Fgure 3-4: Cascade SVM tranng subset concept The output from each solver s a subset of the nput vectors along wth each output vector s nonzero alpha value. There are several ways to combne outputs from multple solvers for use as nput nto a new solver. In [42], Graf et al. look at two possbltes. In the frst, alpha values from one are combned wth those of the second. In the other, alpha values generated by the second SVM are dscarded and set to 0. In both stuatons, the condton a T y = 0 holds. The frst opton should be used when the two solvers operate on completely ndependent vectors. The other extreme s the case that both solvers are operatng on exactly the same vectors, n whch the second opton should be used. In general, the optmal opton s somewhere n between. The theores presented n the research suggest that the global optmum can be reached f the best set of support vectors produced n one layer s used n at least one of the subsets of the next layer. The bnary archtecture shown n Fg. 3-3 accomplshes that by provdng a feedback lnk n whch the vectors produced by the fnal solver are combned wth each of the orgnal nputs and the process s repeated. The number of passes requred, accordng to ther results, s around 2 to 5. Smlar networks, also based on the Cascade SVM method, have been proposed n [44] and [45]. In the M 3 -SVM, ntal nput vectors are frst dvded nto two sets: one wth only postvely classfed tranng vectors, and one wth only negatvely classfed tranng vectors. The resultng postve set s randomly dvded nto N+ equally szed subsets and the negatve set s randomly dvded nto N- equally szed subsets. A total of N+ * N- problems are generated by generatng every possble par of a postve and negatve set. A problem T,j s one that ncludes the th postve subset and j th negatve subset. These subproblems are used as the ntal problems and determne the number of leaf nodes n the cascade network. The leaf nodes can be further subdvded f necessary. The authors recommend ths step f a good processng load balance s desred.

51 Two mathematcal constructs the MIN and MAX mathematcal ntegraton unts are defned as: MIN : = q = mn MAX : = q = max ( p1, p2,..., pn ) ( p, p,..., p ) 1 2 n These two unts are used to ntegrate the resultng transfer functons to obtan the overall transfer functons. Frst, all subproblems are grouped accordng to common postve subsets. Each of these groups s then ntegrated usng a MIN unt and one transfer s produced for each group. The resultng transfer functons are ntegrated usng the MAX unt, producng the fnal transfer functon. The archtecture s dsplayed n Fg Fgure 3-5: M 3 -SVM Lu et al. also desgned a standard cascade SVM, but ntroduced a new method nput vector selecton n generatng subproblems for each of the SVMs n the cascade tree. Ther network s shown n Fg Frst, both the postve classfed tranng vectors and negatve classfed tranng vectors are dvded nto two subsets usng the same rato r (0 < r < 1). Next, the four ntal subproblems are generated by choosng all four possble combnatons of selectng from one of the two postve subsets and the two negatve subsets.

52 Fgure 3-6: Cascade SVM by Lu et. al. The tranng vectors for the two ntermedate SVMs are chosen by performng a unon operaton on the resultng support vectors from the correspondng chldren nodes. Smlarly, the resultng vectors from the second layer are combned usng the unon operator and used to generate the subproblem for the fnal SVM. The algorthm suggests that alpha values are dscarded on the completon of a tranng process. Only the nput vectors are needed to generate new problems. Agan, the leaf node problems can be further subdvded f necessary. The authors also proposed a slghtly mproved verson of ther cascade archtecture by removng the two ntermedate SVMs and combne the output support vectors from the frst layer by performng the unon operaton on all four SVMs. The revsed verson s shown n Fg Fgure 3-7: Improved Cascade SVM 3.5 Concluson Support Vector Machne tranng can be performed by solvng the Quadratc Programmng Optmzaton Problem. Usng exstng solvers s not possble, however, when very large tranng data sets are used. The workng set technque was developed to

53 overcome ths problem. The Gradent Projecton-Based Decomposton Technque, a dervatve of the workng set technque has not only shown to be hghly effcent wth large problem sets, but t also has the attractve property of beng easly parallelzed. The Cascade SVM s a radcally new method that was desgned for parallelsm from the start. Due to the novelty of these technques, the applcablty of them to modern parallel hardware s rpe for exploraton.

54 Chapter 4: The Cell Broadband Engne 4.1 Chapter Introducton The Cell Broadband Engne Archtecture (CBEA) has been desgned by Sony, Toshba, and IBM (STI allance) n an effort to fll a vod between general purpose processors, such as the AMD and Intel famly of desktop/laptop processors, and specalzed processors such as graphcs processng unts made by Nvda and ATI. The flexblty of such a system s very benefcal n many modern meda-rch applcatons. The new archtecture, an extenson of the PowerPC Archtecture, defnes one or more Power Processng Elements (PPEs) and multple hgh performance Synergstc Processng Elements (SPEs). The phlosophy of the desgn gves the PPE the role of managng and employng the avalable SPEs for performng computatonally ntensve work. The PPE, whch s based on the exstng IBM Power Archtecture, s capable of runnng exstng unmodfed 64-bt and 32-bt applcatons and operatng systems, but dong so takes absolutely no advantage of the extra power avalable. Applcatons must be wrtten or rewrtten to take advantage of the extra cores avalable. The Playstaton 3 (PS3) System released by Sony n 2006 s one of the frst CBEA mplementatons made avalable on the market and, due to ts use n ths work, wll be focused on n ths chapter. 4.2 Desgn Challenges Multmeda performance has always been lmted by the problem of unacceptable memory latency and bandwdth (known as the memory wall) as well as problems of dmnshng returns arsng from ncreasng the ppelne depth and decreasng the work done per cycle. It has been wdely known that memory smply cannot keep up wth the rapd ncrease n CPU performance. Varous trcks have been mplemented wth the goal of hdng ths latency, but all come at a hefty prce n transstor count, crcut complexty, and power consumpton (e.g.: speculatve nstructons and branch predcton logc). The frst desgn challenge, therefore, was to come up wth a way to allow more memory bandwdth at lower latences. Another challenge faced was power effcency. Modern processors can harness only so much performance per Watt wth the current CMOS transstor technology before sufferng from heat ssues.

55 The thrd challenge related to performance was overcomng the dmnshng returns of ncreasng the ppelne depth whle mantanng nstructon latences. Long ppelne depths mply more logc for dependency trackng and result n sgnfcant penaltes for ncorrectly predcted branches. A desgn goal was set to mnmze the ppelne depth and maxmze the effcency of ssue slots for ncomng nstructons. The fnal product was to be hghly responsve and reactve to the outsde world. Ths ncludes, for example, real-tme output to stmulate the gamer, and nput requred for broadband nternet applcatons. In ths respect, the processor was to exercse real-tme operatons for the workloads demanded from t. Beyond the Playstaton 3 system, the plan was to contnue developng the archtecture so as to mplement t n varous other future multmeda devces. Ths requred that the desgn be flexble and extendable. The hardware needed to be easly modfable and the software wthn reach of the software communty. A Lnux-based software development was planned to be developed concurrently for ths purpose. The three companes determned that no current organzatonal archtecture was capable of the computatonal power fttng ther vson of future multmeda devces. The IBM Research Dvson was frst n lne to explore new archtectural desgns for the proposed processor. Over the course of a half a year, IBM consdered a wde range of mult-core organzatons borrowng concepts from broadband nterconnect entertanment systems to supercomputer structures. Fnally, a desgn was agreed on at the end of The desgn was to be based on the currently avalable 64-bt Power Archtecture (to meet the four year deadlne) wth a memory flow controller and ndvdual processors, or cells, whch were termed synergstc processors. Almost mmedately, desgn commenced on a $400,000,000 start-up budget. 4.3 Top Level Desgn The mplementaton of the Cell on the Playstaton 3 features a sngle PPE and eght SPEs for a total of nne processng elements on the chp, ted together usng the Elementary Interconnect Bus (EIB). Whle a total of eght SPEs are ncluded on the chp, only 6 are avalable for use by the programmer. It s speculated that one of the SPEs has been purposefully dsabled to ncrease the producton yeld, and that the other s dedcated for runnng the software-hardware securty-focused nteracton layer known as the hypervsor. Ths SPE runs n a so called solaton mode [46].

56 Fgure 4-1: Archtecture of frst teraton of the CBEA (source: CBE Tutoral v2.1 [60]) The PowerPC based PPE s a fully featured processor capable of runnng any operatng system that supports t. On ts own, however, t s easly overpowered even by competng processors of the same generaton. It s the remanng 8 SPEs that provde the hgh degree of processng power, that f optmally programmed, are clamed to outperform specalzed systems that are consderably more costly. A sgnfcant desgn aspect of the CBEA s the concept of ndependent Local Stores (LS) on each ndvdual SPE. Each SPE s ted to a local LS unt (256 KB of SDRAM on the PS3) whch provdes t wth very fast memory. The contents of the LS are manually controlled by employng the SPE-resdent Memory Flow Controller (MFC). Data s transferred between the LS and system memory as well as between multple LSs va MFC-bound GET and PUT commands. Ths model provdes each SPU wth ts own memory address space that t can use as an explctly controlled cache. DMA transfers are globally coherent, meanng that the on-chp cache (on the PPU) may be used transparently. The entre desgn strves on parallelsm. The PPE ncludes hardware for two smultaneously executed resource-sharng threads and nherts the Altvec vector nstructon set, ncorporatng both thread-level and nstructon-level parallelsm respectvely. The SPEs feature a dual-ssue nstructon queue, SIMD (sngle nstructon multple data) functonal unts, and a dedcated asynchronously controlled MFC (memory flow controller). Contrary to many exstng processor desgns, the Cell has been desgned around parallelsm nstead of havng t ncorporated as an afterthought. Fully explotng the capabltes of the Cell processor nvolves not only takng advantage of all of these per-processor capabltes, but also effcently dstrbutng the workload among the avalable processng unts. The next chapter dscusses programmng technques that help n ths regard. In an effort to keep the heat down and clock rates hgh, and keepng wth the orgnal desgn phlosophy, the PPE and SPEs were desgned wth smplcty n mnd. The SPEs

57 do not nclude any exotc commodtes such as out of order executon or dynamc branch predcton. Logc crcutry s kept to a mnmum, shftng the responsblty onto the hands of the compler and software developers for generatng computatonally effcent code. An IBM desgned on-chp memory controller s connected to an XIO nterface desgned by Rambus to keep memory latences at a mnmum. Rambus' flexble and confgurable I/O nterface, termed FlexIO s used to support the hgh I/O bandwdth requrement mposed by multmeda applcatons. A custom desgned on-chp coherent nterconnect bus, known as Element Interconnect Bus (EIB) s used to supply the necessary bandwdth requred for the nne processors, memory controller, and bus nterface. The desgn of the bus allows for parallel memory transactons as long as the paths do not ntercept. All modules on the Cell are nterfaced wth the processng elements usng memorymapped control and I/O regsters. These regsters are grouped nto one of three classes: Prvlege 1, Prvlege 2, or Problem State. Prvlege 1 regsters are accessed only by the hypervsor or external frmware. Prvlege 2 regsters are those that should only be accessed by elevated prvlege OS modules and are defned for those cases n whch a hypervsor s not present. Problem State regster access s not enforced by hardware, but may be enforced by the OS or optonally by the hypervsor. The memory mapped regsters are used to nterface wth each of the SPEs, the Pervasve module (used for montorng performance and temperature, and managng power, and Relablty, Avalablty, Servceablty debuggng), the Memory Interface Controller (MIC) and Token Manager (TKM), the I/O Controller (IOC) Address Translaton module, the Bus Interface Controller (BIC), and the Elementary Interconnect Bus (EIB). Detals on all these regsters are avalable n [47]. In addton to supportng the 64-bt Power Archtecture ISA, CBEA-complant processors nhert the memory translaton, protecton and SMP coherence model of 64 bt Power processors. Other nclusons are vrtualzaton for allowng the smultaneous runnng of multple operatng systems, and large page szes whch are benefcal n many multmeda and scentfc applcatons. The overall desgn admts several favorable features to the equaton. The smplcty of each processor requres a smaller transstor count and thus allows for hgher clock frequences, lower power dsspaton, and better computaton/watt rato. The Power Archtecture n the PPE allows for easy transton for programmers already profcent at software development wth ths ISA. Beng a Sngle-Instructon-Multple-Data (SIMD) archtecture, and supportng vector meda extensons along wth provdng each processor much on-chp memory and many regsters only renforces the chp s performance capabltes n parallel applcatons. 4.4 Low Level Desgn Decsons The frst release of the PS3 featured a Cell processor utlzng a 90 nm SOI processng technology on a mm 2 de wth each processor ted to a 3.2 GHz clock. At the tme

of ths wrtng, a newer 65 nm verson (174.61 mm 2 de) has been made avalable and s used n currently sellng PS3s. A 45

58 of ths wrtng, a newer 65 nm verson ( mm 2 de) has been made avalable and s used n currently sellng PS3s. A 45 nm verson of the chp s n the works. 4.5 Power Processng Element The PPE s a smplfed dervatve of IBM s prevous 64-bt RISC Power Archtectures. The CBEA specfcatons dctate that the PPE module(s) are 64-bt based (wth 32-bt complance) and nclude a vector/simd multmeda extenson unt. The PPE s complant to the specfcatons outlned n the PowerPC Archtecture Books I [48], II [49], and III [50], wth a few ponters. CBEA complance requres the ncluson of several nstructons whch are only optonal n the PowerPC Archtecture. Two nstructons from the graphcs group n the PowerPC specfcatons floatng recprocal estmate sngle A-form (fres) and Floatng recprocal square-root estmate A-form (frsqte) are requred. Also, the Data cache block touch X-form (dcbt), s requred, whch s used by a program to provde a hnt as to what data to place nto the cache before t s accessed. The PPE also ncludes the optonal Altvec Vector Instructon Set n the Power Archtecture whch the CBEA makes mandatory. The ppelned hardware logc allows the PPU to perform two double precson operatons per clock cycle (6.4 GFLOPS at 3.2 Ghz) or eght sngle precson operatons per clock cycle (25.6 GFLOPS at 3.2 Ghz). Fgure 4-2: PPE Block Dagram (source: Cell Broadband Engne Programmng Handbook v1.1 [51]) The PPE conssts of the PPU and local L1 and L2 caches. 32 Kb of L1 cache s used for nstructons, and 32 Kb s used for data. L2 cache holds both nstructons and data and has a sze of 512 Kb. The PPE also supports two-way smultaneous multthreadng (SMT), whch s very smlar to Intel s Hyper-Threadng technology, exposng tself as two logcal processors to the operatng system.

59 In accordance wth the smplcty n the desgn the processor ncludes a two-ssue norder core. Ths approach drastcally cuts down on the transstor count, allowng for greater power effcency and hgher clock rates. The PPE also manages many of the on-chp and off-chp resources for the proper ntegraton wth the rest of the system and hence runs the operatng system. Hardware resources on a CBEA system are memory mapped. The PPE has exclusve access to these resources usng these real addresses. In summary, the PPE s a smplfed and power-effcent processor desgned to work at a hgh clock rate. In most applcatons, the PPE s responsble for conductng the executon flow among the SPEs. 4.6 Synergstc Processng Elements On ts own, the PPE s no match for modern processors. By the phlosophy of the desgn, the PPE s only there to offload and coordnate all compute ntensve tasks on the remanng processng elements known as the Synergstc Processng Elements (SPEs). Smlar to the desgn of the PPE, the SPEs are meant to be smple, power-effcent, and fast. The desgn and ncluson of the SPEs s meant to fll the vod between general purpose processors, whch are meant to acheve good performance on a wde range of applcatons, and specal-purpose processors, whch are optmzed for a specfc task (such as graphcs processng unts). The eght SPEs that are avalable on the Cell processor n the PS3 are based on a novel SIMD archtecture tweaked for a hgh throughput of manly floatng pont operatons. Each SPE conssts of the Synergstc Processng Unt (SPU) core, a Local Store (LS), and Memory Flow Controller (MFC). Fgure 4-3: SPE Block Dagram (source: Cell Broadband Engne Programmng Handbook v1.1 [51])

60 The floatng pont unt on the SPEs s heavly optmzed for speed and ppelne length. In ths regard, as were prevous desgns for the Playstaton 2, the SPE processor's sngle pont floatng pont calculatons are not fully complant wth the IEEE754 specfcaton (see secton 4.7) whch may detract researchers from usng the processor for crtcal scentfc smulatons. However, these decsons were deemed acceptable for the mprovement n performance and n many cases are not detrmental to obtanng vald results n the aforementoned applcatons. The SPE does contan a double precson unt whch s complant to the IEEE854 specfcaton, although t s sgnfcantly (10-fold accordng to IBM at the ISSCC 2005) slower than ts sngle precson brother. Therefore, whle boastng a 256 GFLOPS sngle-pont capablty, t s only capable of about GFLOPS n double precson mode. Smlar to the PPE, the SPEs are n-order two-ssue cores. In addton, the SPEs do not nclude hardware branch predcton logc, placng a heavy burden on the compler. Sx executon unts are dvded among the odd and even ppelne on each SPE. The floatng pont and fxed pont unts resde on the even ppelne, whle the permute, local store, channel, and branch unts exst n the odd ppelne. The followng table shows the cycle tmes and ppelne for each nstructon type handled. Unt Instructons Executon Ppe Unt Ppelne Depth Instructon Latency Smple Fxed word arthmetc, logcals, countng Even 2 2 leadng zeros, selects and compares Smple Fxed word shfts and rotates Even 3 4 Sngle multply-accumulate Even 6 6 Precson Sngle nteger multply-accumulate Even 7 7 Precson Byte pop count, absolute sum of Even 3 4 dfferences, byte average, byte sum Permute Quadword shfts, rotates, gathers, Odd 3 4 shuffles, recprocal estmates Load Store Load, store Odd 6 6 Channel Channel Read/Wrte Odd 5 6 Branch Branches Odd 3 4 Table 2: Instructon types and ppelne relatonshps The novel SIMD-based ISA desgned specfcally for the SPUs s capable of operatng on sxteen 8-bt ntegers, eght 16-bt ntegers, four 32-bt ntegers, or four sngle precson floatng pont operatons n one cycle. Double precson floatng pont operatons are supported, but are not fully ppelned by the FPU and take sgnfcantly more clock cycles. An nterestng feature s that the same arrays n the floatng pont unt are used for floatng pont computaton and nteger multplcaton. In ths way, nteger multplcatons are passed nto the FP ppelne whch bypasses the FP handlng to perform the multply. Table 3 descrbes the 6 unts n the two ppelnes:

61 Unt Floatng Pont Unt (SFP) Even Fxed Pont Unt (SFS) Odd Fxed Pont Unt (SFX) Control Unt (SCN) Load and Store Unt (SLS) Channel and DMA Unt (SSC) Responsbltes sngle-precson double-precson 16-bt nteger multples conversons byte operatons arthmetc nstructons logcal nstructons word SIMD shfts and rotates floatng-pont compares floatng-pont recprocal and square root estmates byte granularty shft rotate mask shuffle operatons on quadwords fetchng and ssung of nstructons to ppelnes branch nstructons arbtraton of access between LS and regster fle other control functons load and store nstructons hnt for branch nstructons DMA requests to the LS communcaton data transfer control nto and out of the SPE Table 3: SPE Executon Modules The SPE executon unt has access to a large unfed regster fle wth a total of 128 regsters, each 128 bts n sze. Most nstructons operate on the 128 bt operands by treatng them as four separate 32 bt operands. The regster fle contans 6 read ports and 2 wrte ports. Wth most nstructons havng 3 source operands and 1 destnaton operand, ths meets the requrement for havng two nstructons execute per cycle. The processor makes heavy use of a forward-and-delay concept to avod access latency of a regster fle access durng successve dependent nstructons n the ppelne. It s easy to see how relant such a desgn s on havng nstructons and data at hand n tme for executon. A cache, no matter how effcent, ntroduces an unrelable and unpredctable performance element, and therefore must be abandoned. Instead, each SPE ncludes 256 Kb of exclusve local memory known as the Local Store (LS). The LS s a prvate, non-coherent address space that s separate from the system address space and holds both data and nstructons. It s mplemented usng ECC protected arrays of sngle ported SRAM. Local Store access tmes are equvalent to that of a cache at 6 cycles per access. The programmer must manually transfer contents between the LS and man memory (as well as between LSs of dfferent SPEs) usng specal commands bound for the MFC. The MFC s controlled asynchronously and runs n parallel to the SPE. It s capable of sustanng 16 outstandng commands and uses the Power Archtecture page protecton model for ts DMA-based nterface. Ths model mples a consstency n memory mappng across the heterogeneous devces on the chp. The result of ths s that memory addresses can be nterchanged wthout ssue.

62 The LS s physcally the largest part on the SPE cell and s mplemented n four separate arrays of 64 KB each. It s nterfaced va a sngle port to reduce chp area, makng t necessary to arbtrate between DMA reads, wrtes, nstructon fetches, loads, and stores. The LS has a narrow (128 bt) and wde (128 byte) read/wrte port. The wde port s used for DMA reads and wrtes and nstructon fetchng and prefetchng. Hghest prorty s gven to DMA commands, followed by loads and stores. Instructon fetchng occurs at all other free cycles. There exsts a specal no-op nstructon avalable to the programmer whch allocates cycles for nstructon fetchng. The LS s connected to the man memory bus va a 128-bt memory bus. As mentoned, the LS s not transparent and the programmer/compler has full control of ts contents. To take advantage of the two avalable ppelnes, memory and core nstructons are capable of beng executed smultaneously. IBM has coned a new form of parallelsm called Compute-transfer parallelsm. An applcaton thread on an SPE has two threads of control - the SPU thread and SMF thread. Ths feature allows the compler/programmer to decouple data fetch and use. The operatons utlzed by the SPU thread tend to execute on the even ppelne, whle those for the MFC execute on the odd ppelne. These desgn decsons do place a larger porton of the burden onto the programmer and compler. However, they are stll deemed more flexble than those of a specal-purpose processor. 4.7 Floatng Pont Number Representaton The Cell processor was orgnally ntended for multmeda applcatons, such as real-tme 3-D gamng, meda streamng, and sgnal processng. In contrast to scentfc applcatons, these applcatons do not requre many of the features ncluded n the IEEE 754 floatng pont standard. Exact roundng, exceptons, and de-norm number handlng are not crtcal as they make an ndscernble dfference to the eye or ear of the customer. To take advantage of ths fact, the SIMD floatng pont unt on the SPU takes several shortcuts thus devatng from the IEEE 754 standard. Double precson support s IEEE 754 complant, but t almost an order of magntude slower. Newer mplementatons of the CBEA archtecture wll feature fully ppelned double precson floatng pont support. The SPU floatng pont mplementaton s capable of representng a slghtly larger range of normalzed numbers by utlzng the least sgnfcant bt of the exponent feld for the maxmum value (see Fg. 4-4). The representaton of postve, nonzero numbers ranges 126 from S mn = to S max = ( 2 2 ) 2. The correspondng numbers n the IEEE standard are and ( 2 2 ) s one least sgnfcant bt less than 2. Results whch exceed Smax are clamped to Smax wth the approprate sgn; those that are smaller than Smn are set to 0 (always postve sgn). Infnty values are not supported. n whch the value ( )

63 Fgure 4-4: Sngle Precson Floatng Pont Representaton (source: Cell Broadband Engne Programmng Handbook v1.1 [51]) Two other devatons from the IEEE standard are (a) the omsson of denormalzed number support wth such numbers beng treated as zero and (b) the ncluson of only the round towards zero (truncaton) roundng mode. Double-precson floatng pont operatons are performed on the FPU unt and do support the IEEE 754 standard, but are not fully ppelned. They are performed as two doubleprecson operatons n 2-way SIMD fashon. The operatons are performed back to back n consecutve nstructon slots n the ppelne and cannot be dual ssued wth any other nstructons. Of the 13 clock cycles requred, only 7 are ppelned. In addton, no nstructon can be ssued for sx cycles after the double precson nstructon s ssued [51]. 4.8 Element Interconnect Bus The PPE, SPEs, memory controller (MIC), and a bus nterface controller (BIC) are all connected usng an on-chp, low-latency, hgh-bandwdth rng bus known as the Element Interconnect Bus (EIB). Each unt s connected va ts own Bus Interface Unt (BIU). The EIB conssts of the data network (four rngs), command network (tree), and data arbter network (star). The rng bus runs at half the clock speed of the core clock frequency and conssts of ndependent data and command networks. Due to the hgh bandwdth capablty of each unt on the network (51.2 Gb/s aggregate njecton and recepton bandwdth) the EIB must be fast enough to avod beng a bottleneck. A total of four 16B drectonal data rngs are ncluded: two clockwse, and two counterclockwse. Each unt s allowed to transfer one 16B block every bus cycle and each rng s capable of processng three concurrent transfers smultaneously so long as ther paths do not overlap and the source and destnaton dstance s no more than half the dstance of the entre rng. Access to the data network s credt based wth each unt startng off wth a number of credts lnked to the sze of the command buffer wthn the EIB for that unt. A credt s used on the executon of a request and returned when the request moves nto a further

64 stage n the EIB request ppelne. The central Data arbter, connected to the unts by a star network s responsble for grantng access to the data rngs. Whle the MIC s gven greater prorty to mnmze stallng, the SPEs and other unts are gven equal access by usng a round-robn scheme. The decson for grantng access s based on two man factors: whether the total dstance of the transfer s less than half the total rng dstance (two of the four channels always meet ths crteron) and whether the transfer would nterfere wth an exstng transfer. Each element s allowed to have up to 64 outstandng requests (SPEs support only 16 outstandng requests). The sngle shared command network s arbtrated through the use of a tree of fve fully ppelned address concentrators (ACs) whch handle collson detecton and preventon. Each command s propagated up the tree up to the root address concentrator AC0, whch can process one command per every two bus cycles. The BIC controller s splt nto two separate noncoherent nterfaces (IOIF0 and IOIF1). Multple Cell chps may be nterconnected va the IOIF0 nterface to form one coherent rng wth the help of the Broadband Interface (BIF) protocol. 4.9 Memory Interface The Cell Processor s sandwched n between two Rambus nterfaces - the XIO memory nterface and the FlexIO host data and control bus nterface. External Rambus XDR memory connects through two XIO channels also a Rambus desgn. Together wth the memory nterface controller (MIC) on the chp, desgned by IBM, memory bandwdth s rated at 3.2 GHz/s (revew) per channel, whch translates to a theoretcal maxmum of 25.6 GB/s (assumng that the memory banks are kept actve contnuously by request streams of the same type and 128B request szes). Actual memory bandwdth s also lmted due to typcal memory operatons such as refreshng and scrubbng. Interleaved read and wrte requests result n an effectve bandwdth of about 21 GB/s due to the need for repettve overturnng of the MIC-to-XIO bdrectonal bus. Both channels can operate on eght banks smultaneously, and can operate on a maxmum of 256 MB of memory (512 MB for both channels). The MIC contans two queues for each channel one for readng and one for wrtng. It performs all arbtraton control ensurng hgh data rates. A hgh prorty read request s supported, and takes precedence over normal reads and wrtes. The system lnk (FlexIO) s also desgned by Rambus. The parallel nterface conssts of seven transmt and fve receve RAMBUS Redwood Rambus ASIC Cell (RRAC) FlexIO lnks. Each lnk s 1 byte wde. The nterface s capable of beng clocked ndependently, (5 GHz on the Playstaton 3). Several desgn challenges had to be overcome to allow for the mult-hundred Ggabt aggregate bandwdth that s demanded from the Cell processor. These nclude channel dstorton, temperature drft, and supply nose.

65 The FlexIO s theoretcally capable of a 35 GB/s total outbound and a 25 GB/s total nbound raw bandwdth at 5 GHz. The name flex can be attrbuted to the nterface s capablty of beng dvded nto two logcal nterfaces, allocatng each a porton of the pns and bandwdth. Ths way, t s possble to optmze the nterface scheme based on the devces present external to the cell chp. The Cell processor tself reserves 4 nbound and 4 output lanes for memory coherency. There s consderable overhead beng transmtted on the I/O nterface durng data transmsson. Data and commands are encapsulated n packets whch contan nformaton such as tags, data sze, command d, flow control nformaton, etc. Ths overhead can have a large mpact on actual throughput performance Prevous Work on Cell Processor The Cell Archtecture, whle orgnally desgned manly for multmeda applcatons, has ganed much respect n the scentfc communty over the course of ts lfespan. It turns out that the non-ieee complant hardware s stll applcable to many applcatons. Whle not entrely a new concept, the CBEA does hold potental to accelerate the computaton of exstng real world problems. In [52], Wllams et. al. mplemented several common scentfc calculatons on the Cell processor, ncludng dense matrx multplcaton, sparse matrx multplcaton, stencl computatons, and 1D/2D FFTs, and compared the results to those on exstng hardware. The Playstaton 3, whch has been used n ths research, has been successfully utlzed for scentfc research n many projects snce ts launch n late The most popular example s the Foldng@Home project [53] whch s a dstrbuted computng project amed at smulatng proten foldng wth the goal of better understandng the development of human dseases. Playstaton 3 users are gven the opton to download a Cell optmzed clent and run t when not usng ther system for other tasks. On September 15, 2007 the release of a new clent lead to the breakng of the petaflops barrer whch was a frst for any computng system n hstory and was recognzed by the Gunness book of records [54]. Another example that s recently gettng attenton s the smulaton of the collson of black holes and the gravtatonal waves that they produce n a project named the PS3 Gravty Grd lead by Dr. Gaurav Khanna at the Unversty of Massachusetts [55]. The tghtly-coupled Beowulf cluster s composed of 16 Playstaton 3 systems. Other examples nclude Axon Racng s utlzaton of the PS3 for ts stereo vson algorthms n ther 2007 entry nto the DARPA Urban challenge [56], Securty- Assessment.com s password crackng [57], and Intruson Detecton pattern matchng algorthms [58].

66 Chapter 5: Hgh Performance Programmng on the Cell Processor 5.1 Chapter Introducton Ths chapter s ntended as an ntroducton to the tools and recommended methods and strateges for programmng hgh performance applcatons on the Cell Processor. Its ncluson was deemed fttng due to the substantal effort placed nto the optmzaton of the Mult-Layer Perceptron and Support Vector Machne algorthms for the Cell archtecture. Whle the topcs covered are descrbed n the context of the Cell archtecture, many apply and have roots n the feld of general hgh performance parallel as well as vector programmng. In fact the practces descrbed n ths chapter can lkely be appled to other smlar archtectures. The objectve s to collect the major strateges nto one text, servng as a reference. The chapter starts wth a descrpton of the development tools that were freely avalable at the tme of ths work as well as the API and other programmng facltes and ntrnscs for common tasks such as nter-processor communcaton and data transfer. The avalable levels of programmng are compared n terms of development effort and control granularty. Next, programmng strateges are explaned n detal startng at the lowest levels. Instructon ssue tmngs, branch mnmzaton, vectorzaton, and smlar topcs are covered frst. Dscusson extends to hgher level strateges, such as data management, and job dstrbuton. Fnally, hgh level development strateges are outlned that have been made publc by several knowledgeable Cell programmers. Fnally, the type of work that s sutable for the SPEs s characterzed, provdng a path nto the next chapters that dscuss the specfc MLP and SVM Cell mplementaton. 5.2 Support and Development Tools It can be a dauntng task to begn development on a completely new and complex archtecture such as the Cell Processor. The decson to do so s even more rsky f the fnancal future of the developer (such as a game developer n the case of the Playstaton 3) s at stake. IBM, Sony, and Toshba understood ths concern and made t n ther best nterest to educate potental programmers as well as facltate the actual development process. For example, although occurrng after the release of the Playstaton 3, Sony dd hold sessons n whch they educated certan companes on the topc of proper explotaton of the Cell s hardware for maxmum performance.

67 Wth the Cell beng an open platform, there s an abundance of nformaton posted n the form of papers, dgtal books, tranng documents, manuals, and communty forums on IBM s servers. Ths materal, whch s updated on a regular bass, detals both the hardware and software sde of the system. Anyone nterested n wrtng applcatons for the Cell, or smply n understandng the hardware can easly obtan the techncal documentaton. Recommended documents are ncluded n the reference secton at the end of ths chapter. In addton to supportng documentaton, there are many freely avalable, yet qute advanced tools wrtten specfcally to help n maxmzng ths hdden performance. The Software Development Kt (verson 3.0 at the tme of ths wrtng) that s avalable for download from the IBM developerworks webste s a CD mage contanng all that s necessary to get started gven a supported Lnux operatng system. An optonal extra CD mage contanng prewrtten code examples, some lesser used tools, and useful SPE-targeted lbrares s avalable as well The Full System Smulator The most useful tool wthn the SDK whle developng code s systemsm - the Full System Smulator. Ths pece of software (whch runs on PowerPC, x86, and x86-64) smulates systems based on PowerPC (or related) archtectures such as the Cell tself. Cell support was desgned and utlzed throughout, and after, the product development cycle of the actual Cell hardware chp as a means of obtanng vtal feedback that s just not obtanable from actual hardware. In fact, Lnux was runnng on the smulator two years before the actual hardware was avalable. Today, wth the processor avalable on the market, the applcaton s freely avalable for Cell developers who use t as a debuggng and/or performance evaluaton tool [59]. Whle the smulator s not perfect, t s consdered to be farly accurate and complete. All the elements needed to run an operatng system such as Lnux are smulated. In fact, when runnng Lnux-based applcatons, the smulator frst loads a PowerPC based Lnux kernel from whch the applcaton under test s executed va the smulated command shell. It s also possble to run n standalone mode, n whch no underlyng Lnux OS s utlzed. The usefulness of the Full System Smulator les n ts ablty to expose the programmer to every essental component of the chp at any cycle n the smulated program s executon. Smulated components nclude all the elements pctured n Fg One notable dfference from the actual hardware s the smulaton of DDR2 memory nstead of RAMBUS. The Instructon Set Archtecture s also modeled down to the specfcatons.

68 Fgure 5-1: The layers of the Cell smulator (source: CBE Tutoral v2.1 [60]) Upon startng the applcaton, three wndows are presented, as shown n Fg. 5-1: a command lne and graphcal nterface nto the smulator and a text console as t would appear on the screen of the smulated system. The graphcal smulator nterface provdes a dsplay of the state of the smulated system, ncludng the PPEs and SPEs. It allows for the vewng and modfcaton of varous tems such as memory, regster, and channel contents through dalogs and the graphcal representaton of system state, hstory, and statstcs.

69 Fgure 5-2: The UI of the cell smulator (source: CBE Tutoral v2.1 [60]) The most useful feature of the smulator, whch has been utlzed several tmes n ths work, s the ablty to collect a summary of per SPE performance statstcs for any porton of the code. The programmer uses three functon calls to utlze ths faclty: prof_clear, prof_start, and prof_end. The actual code generated has no performance effect when runnng on real hardware and s utlzed only by the smulator envronment. Prof_clear zeros all collected statstcs for that partcular SPE. The smulator begns analyzng and collectng smulated events on prof_start and ends at prof_end. An example of the resultng report, whch s updated n real tme durng program executon s shown n Fg. 5-3.

Fgure 5-3: Proflng results (source: SystemSm Users Gude [[]) The nformaton presented n the proflng statstcs wndow gves a good representaton of how well the code s utlzng the avalable hardware.

70 Fgure 5-3: Proflng results (source: SystemSm Users Gude [[]) The nformaton presented n the proflng statstcs wndow gves a good representaton of how well the code s utlzng the avalable hardware. As evdent n the example, only 4% of the cycles utlzed dual ssue, and nearly half the cycles were spent watng for a data dependency resoluton. Lookng at the number of DPFP nstructons executed, and the small number of regsters used, t s lkely that the code performs a smple DPFP operaton n a loop. DPFP operatons are not well ppelned on the SPEs and are lkely the source of the dependency stalls. The smulator has access to the host system va two nterfaces the callthru utlty whch allows for the transfer of fles and the BogusNet nterface whch sets up a vrtual network nterface wth the host (BogusNet s actually an extenson of callthru). By enablng the BogusNet nterface, the possblty for remote debuggng of smulated applcatons va the GDB (Gnu Project Debugger) debuggng toolset s opened. The user starts the program

71 under test (PUT) under the context of gdb-server n the smulator and then attaches to the server usng a GDB clent on the host computer over BogusNet. The GDB toolset whch s ncluded n the Cell SDK has been enhanced wth some useful Cell-based functonalty. The user s capable of vewng and steppng through the source code for both PPE and SPE modules (separate PPE and SPE executables are usually necessary as wll be mentoned next). The enhanced debugger has the ablty to examne events, sgnals, malbox contents, and DMA transfers va new nfo spu [tem] commands as well as detect bus errors on DMA transfers. The common functonalty such as settng of breakponts and vewng and modfcaton of raw memory s also ncluded. The smulator ncludes a Tcl nterpreter whch allows for the creaton of scrpts that can be programmed to trgger on a certan smulator event. These scrpts may be used to control the smulator va the avalable smulator commands. For example, proflng events can be bound to a scrpt whch captures the statstcs and fles them for later processng. Ths allows defnng multple proflng sectons wthn the code, each fled separately. The Emtter Framework s a faclty for decouplng the producton and processng of smulator events. Emtter readers can be wrtten and attached to buffers assocated wth events for the purpose of graphng or collecton event statstcs. Examples are ncluded n the SDK CDs. Further nformaton s ncluded n the SystemSm Users Gude. Two other useful tools avalable on the SDK are Asmvs (Assembly Vsualzer for Cell Broadband Engne) and FDPR (Feedback Drected Program Restructurng) Pro. The frst s a statc performance-tunng utlty. Ths tool allows the programmer to manuall open a generated assembly fle, navgate, and reorder nstructons by hand usng an easy to use GUI nterface. The tool s bult n ruleset prevents the user from accdentally breakng the dependency logc. Ths tool can be used to optmze specfc sectons n the code that are known to be executed often [61]. FDPR Pro s a dynamc performance-tunng utlty. Ths utlty optmzes the executon of a program mage by collectng runtme nformaton under a typcal workload. Once the nformaton s collected, the mage s restructured to optmze performance [62]. 5.3 CBE Embedded SPE Object Format When wrtng software for the Cell Processor and usng a dual-source compler (such as the GNU GCC or IBM s dual-source XLC compler), the PPE and SPE bnares are wrtten and compled separately usng dfferent complers, each specalzed for one of the two nstructon sets. The tradtonal tool chan nfrastructure and Executable and Lnkng Format (ELF) that s defned n the Tool Interface Standard does not support nterarchtectural lnkng and therefore makes t dffcult to defne nter-archtectural symbols and bndngs. The CBE Embedded SPE Object Format (CESOF) was developed for ths reason and allows for the embeddng of the SPE executable wthn the PPE executable. The CESOF format s an applcaton and extenson of the ELF standard. It does not modfy t n any way. As another opton, t s possble to load the SPE executable at runtme usng the underlyng operatng system s fle operatons, thus keepng the two bnares separate. In addton to the two methods above, there s also a runtme

72 envronment for runnng standalone SPE programs, known as SPUlets, by smply executng them on the command lne. For more nformaton, see [[]. It s mportant to note that shared data structures may be represented dfferently on the two archtectures and specal care must be taken when ncludng the same header fles between source code targeted for the two archtectures. For example, when compled to make use of the PPE s 64-bt archtecture, ponters on the PPU are 64 bts long, whle on the SPE they are always 32 bts long. In ths stuaton, t s advsable to use a custom address data type such as shown n Lstng 10. typedef unon { unsgned long long ull; unsgned nt u[2]; } addr64; Lstng 10: Shared data structure of a ponter type It allows the SPE to receve effectve addresses va any of the communcaton mechansms avalable between the PPU and SPE. 5.4 Levels of Programmng On the SPEs, the programmer has the opton to wrte code on one of several levels, ncludng hand craftng assembly code, as well as C/C++ code utlzng some of the avalable ntrnscs for SIMD processng and MFC nterfacng. The complers avalable, whle generally very capable, are unable to vectorze scalar code and can only explot optmzaton opportuntes that are gven to them by the programmer. It s, therefore, a bg task for the programmer to wrte code n a manner that the compler can recognze. Naturally, programmng methods may be combned so that hot code s hand optmzed for best performance. Compler generated assembly code can also be examned and tweaked, although ths s generally very dffcult unless the developer has a deep understandng of the chosen compler s methods. The followng secton exposes several desgn strateges that can have a huge mpact on executon tme and hardware utlzaton. Tps are also provded to ease development and testng on ths complex archtecture. The concepts are presented n order from low level to hgh level. In general, the hardware has a greater nfluence on code at the low level, and the algorthm has a greater nfluence at the hgher levels. In practce, the tactcs descrbed n these sectons would generally be appled n reverse order, startng wth a hgh level code/algorthm organzaton, and optmzng segments of code as tme goes on. The reverse approach n ths document makes more sense when ntroducng the deas Low Level As expected, the low level desgn choces are made almost exclusvely due to the underlyng hardware. On the Cell, t tends to be the case that faster versons of code are more complex and are more prone to programmer mstakes (and frustraton). As a good

73 practce, t s useful to wrte non-optmzed and easy to understand code sectons frst as a baselne. Once verfed, new and optmzed sectons can be appended and surrounded by condtonal compler drectves so that the multple versons can be toggled for debuggng. Retanng the non-optmzed sectons n the source code also makes t easer to understand the purpose of code sectons wrtten prevously or by someone else. When expermentng wth these optmzatons, t s useful to employ systemsm and utlze the selectve proflng feature (profle_start, profle_stop) to study the mprovements SIMD One of the major factors to consder when wrtng software for the SPEs s that, due to the 128x128b unfed regster fle and SIMD nature of the functon unts, they can load and store only 128b at a tme (known as quadwords, or qwords for short). Wth the qword beng the only natve data type on the archtecture, readng a scalar value of any sze n the Local Store (stack or global space) requres the loadng of that value as well as the spatal overhead that surrounds t. In addton to ths overhead, for the functonal unts to process a scalar value, t needs to be loaded nto a preferred slot wthn that 128b regster. For ths to occur, the compler (gven low optmzaton optons) generates code that rotates the 128b regster so that the value s placed n the correct slot. Wrtng the varable requres readng the content of the destnaton memory nto a regster, nsertng the scalar value, and wrtng back the result. Hgher compler optmzaton optons algn the scalar varables so that the value s already n the proper slot. In all cases, however, the LS space s not utlzed effcently makng t a prorty to mnmze the use of scalar code whenever possble. In cases where they are necessary, the attrbute ((algned(128)) drectve should be used to help the compler algn the varable on a 128b boundary, placng t nto the proper slot. Utlzng the SIMD hardware on the SPEs s one of the strongest methods for exposng parallelsm on the entre Cell archtecture. To encourage ts use, C/C++ language extensons are provded whch defne vector datatypes and so called ntrnscs. All vector types are 128b n length and are always algned on 128 bts (fttng the hardware model) thus elmnatng the need for shufflng of data when loaded from memory. Intrnscs are functons whch provde all the necessary operatons for the processng of vector types. They nclude floatng pont operatons, bt operatons, comparsons, quadword shfts and shuffles, element-wse rotates, converson between float and scalar, etc. Two levels of ntrnscs are provded: composte and specfc. The composte ntrnscs are easer to use as they automatcally select the proper assembly nstructons based on the data types, and may generate multple assembly nstructons dependng on the context. The programmer does not need to remember every verson of the nstructon. Composte ntrnscs have the form spu_nstructon. A specfc ntrnsc, on the other hand, s mapped to exactly one assembly nstructon and s called usng the s_assemblynstructon form. In both cases, the programmer has more drect control over the nstructon code that the compler generates. For example, the programmer can use the shufflng and rotatng ntrnscs to overrde (and possbly optmze) any data algnment that the compler would otherwse generate.

74 Branch Reducton Due to the hardware s lack of dynamc branch predcton, there are numerous strateges that can be utlzed to mprove performance. Often, the value of a varable s determned based on some condton on the value or values of other varables. Lstng 11 summarzes the dea. f( test_var meets some condton ) determne value_a dest_var = value_a else determne value_b dest_var = value_b Lstng 11: Generc branch structure The SPE nstructon set features two separate nstructons vector comparson and select that can be utlzed n tandem to obtan the same result. The method often takes fewer cycles than t would to resolve a branch condton especally f makng use of the entre set of elements wthn the vector. The compare and select nstructons are exposed usng ntrnscs as dscussed n the prevous secton. Stckng to the same varable names as above, the procedure utlzng these ntrnscs s shown n Lstng 12. Determne value_a Determne value_b select_vector = spu_comparson( test_var, compare_value ) dest_var = spu_sel( value_a, value_b, select_vector ) Lstng 12: Alternatve wthout condtonal control flow As the general example shows, both the code requred to calculate value_a and value_b need to be executed but only one of the values s actually kept. The decson to use ths method often comes down to the relatve computatonal requrement for computng both values and to resolve a branch condton (and performng one of the computatons). In some cases, t s known that the condton evaluated wll be ether true or false for a majorty of the tme. The nstructon set provdes specal branch hnt nstructons that, f strategcally placed, can be used to explctly drect program executon. If the hnt s ncorrect, the ppelne ends up beng flushed and a stall occurs as new nstructons are loaded from the correct locaton. In order to be effectve, the branch hnt nstructon needs to appear several cycles before the branch resoluton nstructon Dual Issue As covered n Chapter 4, the SPEs have two nstructon ppelnes and allow for the dspatchng of two nstructons on one cycle. Whle the compler does a good job at algnng nstructons n the proper locatons by nsertng NOPS where necessary, the programmer needs to choose nstructons carefully to satsfy the remanng condtons that need to be met for dual ssue to occur. In many stuatons, t s often possble to perform the same functon usng dfferent sequences of nstructons. The programmer

75 should be aware whch nstructons execute on whch ppelne and attempt to dstrbute them evenly between the two. Loop unrollng a common practce n code optmzaton s an excellent way to ncrease dual ssue rates. Dong so reduces the dependency chan between nstructons gvng the compler more nstructons to shuffle around. The technque also reduces branches, further mprovng performance. In many cases, when the number of loop teratons s unknown, extra effort must be put nto wrtng the preamble and/or eplogue. When the number of teratons s small and known, the loop can be removed altogether. The man dsadvantage of loop unrollng s an ncreased code sze somethng the SPE gves the programmer a lmted budget of. It s up to the programmer to determne those loops whch are executed most frequently and wth the most teratons and gve those hgher prorty Md Level The optmzaton of code sectons can only help as long as there s a smooth and constant flow of work n the form of commands and/or assocated data. Fast code s only fast when t s executng. By analyzng the actual mplementng algorthm, md level optmzaton can be exercsed to keep the processors busy and keep them from stallng. Ths secton focuses on that very task Programmng the DMA Controller Each SPE contans ts own dedcated Memory Flow Controller. Ths module s prmarly responsble for the transfer of data between the gven SPE s LS and all system-wde addressable hardware. The SPE performs communcaton wth ts MFC va so called channels. The C/C++ SPE Extensons nclude MFC functons whch automatcally generate sequences of channel commands that perform common tasks such as queung a DMA transfer request or checkng on the status of an actve request. The most basc command s the request for the start of a DMA transfer. Parameters nclude the systemwde effectve address of the source, the local destnaton address wthn the LS of the current SPE, the sze of the transfer n bytes, and the tag for the DMA request. The source and destnaton addresses need to be 16 byte algned (128 byte algned for optmum performance), and the sze of the transfer needs to be a multple of 16 bytes (agan, 128 bytes s faster). The tag parameter does not need to be unque and s used as a means for subsequent nqures nto the status of the transfer(s). It can also be used as a way to enforce transfer order as wll be explaned n the next paragraph. Two addtonal parameters the transfer class ID (td) and replacement class ID (rd) are avalable. They may vary between CBE mplementatons. On the Cell, the rd nfluences L2-cache and TLB (translaton lookasde buffer) replacement. The td nfluences the allocaton of bus bandwdth. These two parameters were not used n ths work. More nformaton can be found n [51]. By default, there s no guarantee of the order n whch the DMA executes the transfer requests. The DMA controller has ts own technques for mnmzng total transfer tme and may choose requests n any order t deems ft. To force the order of transfer request, two command modfers are avalable: fence and barrer. By ncludng the fence modfer,

76 the controller s forced to wat untl all prevous transfers wth the same tag are complete before contnung wth the current one. A barrer s a strcter verson of a fence, n that the current transfer s forced to complete before any subsequent transfers wth the same tag. The asynchronous nature of the MFC n relaton to the SPE allows the programmer to effcently parallelze n-flght DMA transfers and data processng and n effect hde or mnmze stalls that result from a lack of nput data. Ths technque s known as multbufferng, and s descrbed next Mult Bufferng A tradtonal method for hdng memory access latences n smlar archtectures (such as graphcs chpsets) s known as double, trple, or mult bufferng. On the Cell, mplementng double-bufferng requres the reservaton/allocaton of two buffers on the LS, preferably of equal sze. Both, the MFC and SPE have drect access to the LS and can read and wrte from t the same tme. The concept s to always have the SPE workng on one buffer, whle the DMA controller s sendng or recevng contents nto the other buffer. The dea s expressed n Fg Essentally, the buffers make up a crcular queue of sze 2. By extendng the number of buffers the concept generalzes to mult bufferng. The SPE always works on one buffer at a tme and, optmally, wll have the succeedng buffer flled and ready by the tme t s done wth the current one. Fgure 5-4: Double bufferng The man drawback to mult-bufferng s the need for addtonal memory for the extra buffers. Gven that there s M memory avalable and n buffers, the sze of each buffer becomes floor(m/n). In the case of the SPEs, the buffers need to be algned on a 16b boundary and be of a multple of 16 bytes each ntroducng some paddng overhead. Although usually nsgnfcant, the addtonal mult-bufferng code and status varables add to the overhead as well.

5.4.2.3 Inter-SPE communcaton Each SPE s MFC s capable of readng from and wrtng to all of system-wde memory.

77 Inter-SPE communcaton Each SPE s MFC s capable of readng from and wrtng to all of system-wde memory. Durng the ntalzaton of the SPEs, the programmer has the opton to map the LS s and so called problem state of each SPE to ths globally accessble address space. The problem state area s takes the form of a c struct and ncludes the addresses of varous SPE regsters related to mult-source synchronzaton, proxy DMAs, malboxes, and sgnal notfers. The problem state area and LS base addresses can be obtaned by usng functons n the SPE lbrary on the SPE context objects durng SPE ntalzaton. Once the SPEs are runnng, a collecton of ths data can be sent and stored n a local table on each SPE. Havng ths nformaton, any SPE can DMA contents nto the LS of any other SPE as well as communcate wth the other SPEs va the memory mapped malbox and sgnal regsters taken from the problem state area. Gvng each SPE ths much power gves the developer much flexblty n the communcaton model. It s up to the programmer to develop a communcaton model that fts well nto the applcaton SPE Shaders A flexble, but potentally dffcult, programmng technque s to desgn small, dynamc, programs whch are automatcally sent va DMA transfers onto the SPE whenever necessary. These dynamc code fragments are made to be self contaned and are transferred onto the SPEs n the same manner as normal data. In fact, code can be handled n the same way that data s on the Cell. Mult-bufferng on code fragments, for example, s not out of the queston. Insomnac games have shared ther method for dong so n the context of the Cell processor [1]. Fgure 5-5: SPE Shader concept The dea for usng dynamc code fragments was nspred by the shader programmng model that s utlzed when programmng graphcs cards. In fact, Insomnac refers to ther system as SPE shaders. A shader s a specalzed fragment of code used n an exstng system that modfes the system data usng a custom nput/output nterface. The locaton of the fragments s predetermned by the system they are part of. Logcally, the fragments may represent asynchronous data processng utltes or be part of a system ppelne, etc.

78 The usefulness of ths technque on the Cell s apparent when the applcaton mplementaton requres more storage than the SPEs LS can provde as well as at tmes when specfc code sectons are needed only sparsely. The method, once desgned, s easy to conceptualze and mantan. Also, beng self-mantaned and treated as data, the shader management lbrary utlzes the same technques as those for data transfer. The Cell development lbrary has support for the programmng of so called code overlays. It s a method, smlar to the one descrbed above, n whch desgnated code s not loaded nto the LS untl t s necessary. Code s dvded nto segments. When one segment calls another segment, a table n memory s ndexed to see f t s n memory. If not, the segment s transferred n, possbly overwrtng another segment, before executon contnues. For an ntroducton to overlays on the Cell see [51]. It was Insomnac s opnon that ths model was too complex and had unnecessary overhead for ther purposes, leadng them to come up wth ther own verson. The dffculty wth developng a custom mplementaton of ths technque s mplementng the lbrary for managng these fragments. It can be very tme consumng to develop a lbrary that s both: effcent and flexble. Also, as s evdent from the above, the fnal product may work well for some applcatons, but not others Job Dstrbuton and Synchronzaton Besdes mult-bufferng, there are other mplementaton decsons that can nfluence the utlzaton rato of the SPEs. These decsons are largely nfluenced by the applcaton beng mplemented. The followng are common scenaros and good practces gven the most wdely used server programmng model n whch each SPE receves one or multple jobs from the PPE to process. When wrtng parallel code, one of the frst tasks s to fnd a method for dvdng the overall work nto a number of tasks, or jobs. In the best case scenaro, ths number s known ahead of tme, the jobs are not dependent on one another, and each job takes the same amount of tme to complete. A fast mplementaton would nvolve sendng each SPE an equally szed lst of the jobs they are responsble for. Each job lst element should nclude nformaton about the effectve addresses of any nputs, the effectve addresses of output destnatons for any produced data, and any other requred job related nformaton. Another, slghtly better, method would be to have the SPEs all receve the same exact job nformaton and a personal SPE ID. Ths way, each SPE would select and transfer ts jobs as a functon of ts ID. These methods requre mnmal communcaton between the PPE and SPEs durng the job processng step and f proper mult-bufferng s used, make t easy to obtan close to optmal performance. The frst complcaton may arse due to uneven or unpredctable job completon tmes. An effectve soluton may be to create a separate job buffer for each SPE that the PPE checks and flls perodcally from the man job buffer. Completed jobs should be marked as such n the SPE s job buffer n man memory so that the PPE can replace t. Whle t s possble for the SPEs to arbtrate on a common buffer by performng atomc operatons, ths would requre far more synchronzaton. A smpler soluton s to send exclusve batches of jobs to all SPEs. The number of jobs should be large at the begnnng, but

79 decrease n sze as there are fewer jobs avalable n the global job pool. Smaller szed batches mply more communcaton overhead at the expense of better job dstrbuton. Another complcaton may be the nterdependence between dfferent jobs. If the dependences are known, the PPE can dspatch batches of ndependent jobs at a tme. Otherwse, f the dependences are dynamc, the PPE can use, for example, a dependency tree and evenly queue up those jobs whch are ready to run among the job buffers prevously descrbed. If there s too much dependence, however, at least two optons exst. The frst opton nvolves fndng parallelsm wthn the jobs themselves and dstrbutng that between the processors. The second opton s to forgo the use of the data dstrbuton model for somethng else, such as the ppelne programmng model descrbed later. A global strategy that apples to all these methods s to keep the large porton of the nput and output data out of the PPE s L2 cache. To promote ths, the PPE should not be accessng the data untl the SPEs are fnshed. DMA transfers between the LS and system memory s characterzed as havng hgh bandwdth and moderate latency; those between the LS and the PPE s L2 cache are characterzed as havng moderate bandwdth and low latency SPE-Intated DMA transfers Each SPE s DMA controller has a proxy DMA queue that functons smlarly to the man DMA transfer request queue wth the excepton that t may be accessed from other processng elements. The PPE, for example, s able to queue up DMA transfers between man memory and the local SPE s LS. Whle attractve at frst, the use of ths queue s not recommended f hgh performance s desred. Rather, the SPEs should perform ther own DMA transfers for several reasons: there are more of them, the proxy queue s half as deep as the man queue, t s easer to verfy DMA completon when pullng data on the consumer, and the number of cycles to queue a request locally s smaller Hgh Level In ths secton, ssues surroundng the decsons for programmng models, the structurng of the algorthm tself, dstrbuton of workload, and development and testng strateges are lad out. Many of the tps have been shared by Insomnac at the Game Developer Conference 2008 n San Francsco Data Desgn over Code Desgn When desgnng hgh performance code on the Cell t quckly becomes evdent how mportant data management s. Desgnng for data s just as, f not more mportant than code desgn. For example, an algorthm may have a mult-stage data processng step that needs to execute on small chunks of data. The tradtonal, synchronous, model n whch each chunk s processed ndvdually would be very neffcent f mplemented on the Cell. Frst, the processng of each chunk on an SPE ncludes the tme needed for synchronzaton between the PPE and SPE. Secondly, such a model s nherently scalar, causng the underutlzaton of the SIMD functonal unts, wth many dependency chans, causng dependency stalls. The SPE local store would most lkely not be utlzed to the

80 full extent that t could be, ether. The fnal code would most lkely not scale well to larger data sets and have poor data localty, hence poor cache utlzaton. The proper procedure would be to group, or compress the data together, whle possbly breakng down the mult-stage data processng step nto shaders or overlays for example. Groupng data together reduces synchronzaton requrements and ntroduces opportuntes for nstructon level parallelsm va SIMD utlzaton, loop unrollng, dualssue, etc. Havng explct control over DMA transfers puts the burden of managng what the code has access to at any pont on the programmer. Ths may be rrtatng, although havng control over what s placed nto the LS (whch s techncally a user-controlled cache) encourages the groupng of data nto greater chunks and results n excellent data localty. The challenge s to place much more emphass on data management rather than only on executon flow Mnmze Synchronzaton When layng out the data and program flow, there wll nevtably be postons at whch synchronzaton between processng elements s necessary. Due to the expensve overhead, synchronzaton should be mnmzed whenever possble. One way, for example, may be to combne multple synchronzaton ponts nto one by combnng ndependent processng steps nto the same block Maxmze SPE Usage The SPEs are desgned to be very fast data processors n most cases faster than the PPE. It s therefore n the best nterest of the desgner to keep the SPEs dong as much productve work as possble. In some cases, t may be acceptable to offload even scalar code that has many branches. The PPE s man task should be to smply shuffle thngs around actng as a SPE controller. It s best to thnk of the SPEs as streamed data processors. Wth the help of an SPE shader-lke model and well defned data flow, many algorthms can be structured to functon well usng ths model Ppelne Programmng Model The server programmng model has been descrbed already n the prevous sectons and wll not be repeated here. Another programmng model, whch may be advantageous n rare stuatons, s the ppelne programmng model. Gven the flexblty that the hardware exposes n terms of creatng communcaton models, the SPEs can be logcally arranged n a chanlke fashon such that the PPE has access to the two ends (for nput and output). Each SPE n the chan, or ppelne, s responsble for a sub porton of the entre process. It provdes ts output to the next processng element (PE) n the ppelne and obtans nputs from the prevous PE at every tme step. The man dffculty, and reason for ts sparse usage, s not only the added communcaton complexty, but the dffculty n splttng a process nto a predefned number of equally work-ntensve and determnstc subprocesses. For ths reason, the data dstrbuton model s much more popular.

81 Reverse Ppelne Desgn The ppelne model may be more common n gamng applcatons, however. A man pont drven by Insomnac [63] s the mportance of desgnng mult-stage algorthms n reverse order. Usng a transformaton ppelne for glass physcs as an example, they dscussed the nter-stage nterfacng ssues that appeared only after certan parts were already completed. Ther argument for ths desgn process s that t forces the programmer to thnk about earler portons of the ppelne n advance and prevents the code as you desgn syndrome. It s easer to fx problems n the front of fnal code as opposed to wldly patchng code all over the place. As stages are desgned, they are tested by submttng dummy nputs and checkng for vald outputs Generalty vs. Performance In developng applcatons, code generalty s hghly regarded. A software company s often defned by the lbrares of exstng code that t can reuse n ts current and future products. Developng hghly tuned and optmzed applcatons, especally on complex hardware such as the Cell, however has dfferent prortes. For best performance, code s hghly talored to the data flow stemmng from the algorthm. Generalty of Cell software can be ncreased by wrtng helpful lbrares, such as Insomnac s SPE shader lbrary. A programmer may also develop low level chunks, or modules, of code to perform common operatons, such as a matrx multplcaton module, or bnary search module. The potental for code reuse does exst, but may requre extra thought. 5.5 Programmng the PPE Programmng the PPE for performance s not as crtcal as t s n the case of the SPEs, although there are some strateges that may be followed. In some cases, t may be benefcal to accelerate PPE processng so that t can keep up wth the SPEs and be ready to provde new tasks or data. In other cases, the PPE may be ndvdually programmed to act as another major processng element n the applcaton Altvec Because t s based on the PowerPC 970 archtecture, the PPE has bult n SIMD support n the form of the Altvec nstructon set. Ths allows the PPE to perform smlar SIMD operatons as are performed on the SPEs. Lkewse, smlar algnment constrants apply. The Altvec nstructon set s accessed usng smlar, but dfferently named, ntrnsc functons and contans an equvalent for most of the SPE ntrnscs and more. In addton, the nstructon set contans predcates whch begn wth vec_all_ or vec_any_. The nstructons n ths class operate on the entre vector n an AND or OR fashon and return a scalar value Mult-threadng Beng a two-way multthreaded archtecture, the PPE can run two threads smultaneously wth the condton that the two threads don t access a shared resource at the same tme. Multthreadng s, therefore, recommended when at least one of the threads experences heavy stalls due to cache msses or nstructon dependences.

82 5.5.3 Self-managed cache If data access s predctable, the programmer has the opton to manually pre-fetch cache blocks usng specal dcbt (data cache block touch) nstructons. Two versons of the nstructon exsts: classc and enhanced. The classc versons allows loadng of data nto L1 cache and enhanced allows loadng of data nto L2.

83 Chapter 6: Implementaton of the Mult-Layer Perceptron on the Cell Processor 6.1 Chapter Introducton In ths chapter, and those followng, the focus wll shft to the actual mplementaton. The current chapter descrbes the mplementaton of the Mult-Layer Perceptron (MLP). Startng out wth a top-level descrpton of the programmng model and decsons, the focus moves nto a detaled descrpton of each step along wth reasonng for the decsons made. Due to efforts placed nto optmzaton, desgn generalty was reduced n some areas, and thus, t may be necessary to reference deas and desgn detals across sectons wthn the chapter. It s recommended that the chapter be read n order. For a detaled descrpton and bref hstory of the MLP algorthm, see Chapter 2. As a revew, the algorthm conssts of two tasks that are repeated untl convergence. Both tasks forward and back propagaton are attractve canddates for parallelzaton due to the heavy use of matrx-vector multplcaton. In the standard MLP archtecture, each neuron wthn a layer (hdden and output) s connected to all neurons on the prevous layer. Each of these connectons s characterzed by a weght (usually a floatng pont representaton n software). As ntroduced n chapter 2, the forward propagaton step from the pont of one neuron s taken by performng the followng calculaton: j l, j l (6.1) ( ) x n L = + n 1 1 F w l= 0 n * x n 1 It s easy to see how ths operaton can be expanded to multple neurons on a layer by ntroducng a matrx vector operaton, as so: (6.2) F v ( Ax ) = y The matrx A s a weght matrx n whch each row represents the weghts connected to exactly one recevng neuron. The vector x embodes the output values from all the neurons on the prevous layer (nput or hdden). The functon F v s a vector verson of functon F. Backpropagaton may also be represented as a matrx vector product. As ntroduced n Chapter 2, the followng equaton s used to calculate the delta values for each neuron:

84 (6.3) l δ j = z l L F( z ) l + j l j 1 k = 0 δ l+ 1 k w jk The equaton s easly represented as a matrx multplcaton: (6.4) l δ Fv = l z l ( z ) T l+ 1 A δ The problem wth ths approach, as shown, s that the Backpropagaton step requres a transpose of the weght matrx. When performng batch learnng, a good desgn strategy s to store a separate copy of the matrx n ts transpose form. Because values wthn the matrx are modfed only at the end of each epoch, random matrx accesses are nfrequent, and do not ntroduce a large computatonal penalty. Ths approach s not used n ths work. Instead, a column-wse matrx vector product algorthm was devsed. It wll be explaned n ths chapter. The MLP software n ths work supports both fully-connected and convoluton layers. Whle the fully-connected layer mplementaton makes use of the matrx-vector multplcaton, exposng parallelzaton potental, the convoluton layer mplementaton cannot utlze ths approach and requred more algorthmc analyss. Detals of the fnal mplementaton follow n the correspondng sectons. In the followng chapters, any reference to the next (or rght) layer refers to the layer closer to the output, and vce versa. 6.2 Hgh Level Implementaton Overvew The MLP code was ntended to be used as a lbrary, and thus can be broken down nto well defned steps that are taken durng ntalzaton of the network and the man tranng forward/backward propagaton loop. In ths secton, a summary of the steps are gven. Both, the fully-connected and convoluton layers have smlar APIs and are treated n a smlar way. The nternal operatons and algorthms dffer, however, and are descrbed n much more detal n the sectons that follow Programmng Model The hgh-level programmng model used n the mplementaton s the functon offload model utlzng data parallelsm. Layers are processed synchronously as data propagates forward and backward through the MLP. The computaton requred at each layer s dstrbuted among the SPEs. All SPEs contan the same code, contanng all the functonalty needed for the operatons requred of them. The SPE program conssts of a man outer loop n whch t wats for a command from the PPU (va a malbox channel). The ncomng command specfes the operaton and parameters. The 32 bt command conssts of 8 bts of the command dentfer, and 24 bts of parameters, n that order. On completon of the operaton requested, the SPE sgnals the PPE and blocks watng for addtonal commands.

85 All SPE-unque memory addresses and job nformaton s uploaded to the SPEs durng layer ntalzaton. Each SPE, thus, holds a table of personalzed parameters. Throughout executon, ths table s ndexed for these parameters, abandonng the need for addtonal communcaton of parameters. In effect, each SPE has the same code, but s ntalzed wth unque and talored nformaton. Common synchronzaton ponts (barrers) occur at the completon of each layer by havng the PPE wat untl all SPEs reply wth a malbox message. Ths model s smple to mplement and applcable to the problem due to the statc and determnstc computaton tmes of assgned workloads Implementaton Overvew In the followng steps, unless otherwse noted, the operaton occurs on the PPE Intalze SPEs A sngle SPE bnary s uploaded to all the SPEs. The SPE software contexts are created and the SPEs themselves begn executng the bnary. At ths pont, each SPE halts, watng for a command from the PPE Create Layers and Set up Values At ths pont, every layer that makes up the full network s created ndvdually on the PPE. Layers may be created, or loaded from a fle Convert each Layer nto Optmzed SPE Layers Each created layer s converted nto a specal format optmzed for use by the SPEs. Each optmzed layer allocates ts own memory that s padded and memory algned per DMA requrements Connect SPE Layers An API functon s called takng two layers as parameters. Internally, the functon completes the setup of any parameters of each layer. Ths ncludes, for example, the source and destnaton addresses for nput and output data. It s at ths pont that much of the neuron and weght memory s allocated Dvde workload among SPEs Wth all requred nformaton present, the workload at each layer s dstrbuted algorthmcally n an effort to balance computaton load and reduce any PPU nvolvement f possble. The product of ths step s a lst of chldren jobs belongng to that layer. These jobs are assgned evenly across the actve SPEs. Any job dstrbuton related layer parameters are also flled n Intalze SPEs wth Personalzed Data Because the job dstrbuton for each layer does not change throughout executon, each SPE s gven nformaton about the locaton of ther jobs n man memory on a per-layer bass. Each SPE holds a table wthn the LS n whch t stores ths nformaton. Each layer s assgned a unque ID (startng at 0). Ths ID s used by the SPEs to ndex ther tables.

86 Thus, only the ID, type of layer, and operaton (forward of backward propagaton) needs to be transferred to the SPE at each step Start Learnng Loop The top-level loop conssts of executng the forward propagaton and back propagaton functons for each layer n the rght order. For any gven layer, each SPE s gven a command specfyng the requested task and d of the current layer. The SPEs reference ther job nformaton n ther lookup tables from whch they obtan all the man memory addresses used n the transfer of requred data nto ther Local Store. Once done wth ther job, they sgnal the PPU. Dependng on certan condtons, the actvaton functon may be performed nlne, or t may be performed synchronously as another step. If batch learnng s enabled, weght updates are performed on the PPU at the end of a batch. 6.3 Detaled Implementaton In ths secton, the ndvdual steps are gven more detal. Whle orgnally ntended to generalze to a common layer type, the fully-connected layer and convoluton layer dverged enough n ther mplementaton to necesstate separate sectons. The smpler fully-connected layer s descrbed frst, followed by the convoluton layer Conventons The followng acronyms are used n pseudo-code lstngs: DMA (n.b.): Start of non-blockng DMA transfer. LS_Rsv[(n)]: Reserve memory n the LS (Local Store) of sze a functon of n; not necessarly of sze n. The technque for reservng memory s to ncrement runnng ponter nto a globally declared array of a fxed sze before copyng the ponter value to the requested array dentfer. As an example: // Global char fxed_sze_array[max_size]; char *ptr; // Local. Allocate (reserve) space for n floats. { float *a; a = ptr; ptr += n * szeof(float); } Lstng 13: Manual memory allocaton on the SPEs SPE Intalzaton (Common) Ths step s farly straghtforward. For all avalable SPEs (the number s specfed as a compler drectve), the SPE bnary s uploaded, and executon begns. The SPEs enter the man processng loop watng for a command. Because no nter-spe communcaton s necessary, the problem state and LS areas are not memory mapped. The SPE context

87 creaton, loadng of the bnary, and thread creaton are all aded by the SPE lbrary that s provded n the SDK Fully-Connected Layers Layer Intalzaton and Representaton As descrbed prevously, from the mplementaton perspectve, t s benefcal to represent the forward and Backpropagaton procedures of the algorthm as matrx vector multplcatons, n whch the vector multplcand s the output from the prevous layer and the vector product s the output from the current layer representng every neuron on the layer. Smlarly, the gradents and errors for a layer can be represented as arrays. Ths s the reason why most mplementatons of the MLP tranng algorthm follow an approach that caters to operatons on layers, as opposed to on ndvdual neurons. By dong so, data s organzed nto Structures of Arrays (SOA) as opposed to Arrays of Structures (AOS). The former method s recommended for the Cell Processor [53] and s used n ths work. Intally, each fully-connected layer s created by supplyng as parameters the number of neurons on that layer, the number of neurons on the layer just before t, and optonal ntalzaton settngs. The only avalable opton s the range of the randomzed weghts ([-0.005, 0.005] was used n the experments). Alternatvely, each layer may be read n from a fle that was prevously saved. Loadng saved layers s useful snce t can take many hours or even days to tran a network. The data n the layer created ncludes a vector of outputs (neuron values) and a weght matrx (connectons nto each of the neurons on the layer). Recallng Chapter 2, Backpropagaton nvolves the propagaton of the error back through the network. For the remander of ths secton, two new symbols wll be used: (6.5) γ δ n n de = dx n. de = dz n Backpropagaton through fully-connected layers s a sequence of the followng two operatons: (6.6) a) b) γ δ n 1 n 1 T A δ F' v n ( zn 1) γ n 1. As observed, two addtonal arrays need to be stored namely the δ and γ vectors. Addtonal nformaton stored s the attrbutes related to the RPROP adaptve tranng mode (f enabled), learnng rate, and sze of the total weght matrx (for fle operatons). The fnal product of ths step s the standard layer as t s only good for the snglethreaded verson of the mplementaton.

88 Fgure 6-1: Standard Fully-connected Layer The standard layer needs to be reformatted and extended wth addtonal nformaton for the parallel mult-spe mplementaton. The standard layer s converted nto a SPEoptmzed layer (or SPE layer) and encapsulated nto a SPE-optmzed layer nfo (or SPE layer nfo) object. The SPE layer becomes the global job to be parttoned nto smaller jobs whch are also placed wthn the SPE layer nfo object. s Fgure 6-2: SPE Layer Info struct Several steps are taken when generatng the parameters for the SPE layer nfo. Intally, only the global job s generated by usng the standard layer as an nput parameter. Output, gradent, and error arrays are coped from the standard array nto 128 byte algned memory locatons so that they meet the DMA crtera for mproved transfer performance. Attrbutes, such as the number of neurons and learnng rate are coped as well. The number of neurons s rounded up to the nearest multple of four, and stored n a new attrbute. The orgnal value s not used by the SPEs but s kept for reference. By workng wth multples of four, 4-way SIMD archtecture on the SPEs can be utlzed effectvely. Fg. 6-3 shows a smplfed SPE-optmzed layer. Note that there are ponters n ths structure that pont to memory belongng to the prevous layer. The process of settng these ponters s descrbed n the next paragraph.

89 s Fgure 6-3: SPE Optmzed Layer Once the SPE Layer Info structures are created, they are connected to one another. Connectng two layers (left layer and rght layer for ths dscusson) updates the attrbutes wthn the rght layer nfo structure. Specfcally, the global layer nsde the nfo structure s gven nformaton about the number of nputs t has and the ntal weght values. Addtonal memory s allocated that s used by the SPEs for varous reasons. Connectng layers lnks up ther ponters (Fg. 6-4). For example, the locaton of the neuron output values for the prevous layer and the destnaton for the calculated weghted sums of deltas (for the prevous layer) durng Backpropagaton are set. Fgure 6-4: Connectng two layers Havng all the nformaton wthn the global layer structure, t s now possble to partton t nto smaller jobs. The process s done nternally usng an algorthm that takes nto account the sze of the weght matrx and number of avalable SPEs. Parallelzaton s mplemented on a per-layer bass. The parttonng scheme for fullyconnected layers s done by block-wse parttonng of the weght matrx wth each block beng small enough to ft nto the local store of an SPE. The number of jobs may exceed

of rows and columns n each matrx block, and some allocated sandbox memory that the SPEs are gven DMA access to.

90 the number of actve SPEs, and thus a sngle SPE usually performs multple jobs synchronously. The SPE layer nfo struct created contans hgh level nformaton that s common to all jobs on that layer, ncludng the number of jobs for forward propagaton, number of jobs for back propagaton, the number of rows and columns n each matrx block, and some allocated sandbox memory that the SPEs are gven DMA access to. Ths parttonng scheme has each SPE update a partal weghted sum of a part of the output vector. For example, n Fg. 6-5, gven that an SPE performs the jobs contanng Block B and Block C, t wll have a partal weghted sum of the shaded porton of the outputs. Fgure 6-5: Weght Matrx partton such that nlne actvaton cannot be performed If on the other hand, a sngle SPE performs jobs contanng Block A and Block B as n Fg. 6-6, t has the full weghted sum of the shaded regon of the output neurons. Two advantages arse from the second case. Frstly, the partal output neuron values orgnatng from the separate SPEs do not have to be summed together, whch requres nter-spe or SPE-PPE communcaton, snce each SPE works on an exclusve regon of memory and s guaranteed to have the fnal value for that porton once all jobs are completed. Secondly, the actvaton functon can be performed nlne on the ndvdual output regons wthn the SPE code. As a counterexample, f the fnal outputs were only partal sums, one way of achevng the same result would be to have the PPU wat for all partal sums, then add them up, and then perform the actvaton functon on the result as a separate step. Fgure 6-6: Weght matrx partton supportng nlne actvaton

91 For ths reason the job partton algorthm s based toward dstrbutng matrx blocks n row wse fashon. Other goals of the algorthm are to optmze local store memory usage by maxmzng the sze of each block, to keep the number of rows and columns a multple of four, and to put as many SPEs nto use as possble. The parttonng heurstc algorthm devsed s shown n Lstng 14. // Fnd the maxmum number of nputs wth the mnmum number of outputs // for each job. Set the number of nputs per job to the maxmum (number of neurons on the prevous layer) Set the number of outputs to the mnmum so that the SIMD archtecture could be utlzed (4) Set the nput dvsor to 1 (no dvsons) Start Loop Calculate memory requrement for Backpropagaton on the LS gven ths combnaton of nput and output szes If the memory requred does not exceed local store space Ext Loop and contnue Increment the dvsor Recalculate the number of nputs per job based on the new dvsor - Make sure that the new number s a multple of four (ncludng the remander) If the new number s the same as the prevous Ext Loop and notfy user that there s not enough local store space // Choose the number of output dvsons for forward propagaton // The followng startng choce of the output dvsor attempts to fnd a balance between // usng as many SPEs as possble and maxmzng ndvdual job sze (reducng overhead) Set the output dvsor to (2*NUM_ACTIVE_SPES) / num_nput_dvsons Calculate the number of outputs based on the output dvsor - Make sure that the new number s a multple of four (ncludng the remander) Start Loop Calculate the memory requrement for Forward propagaton gven ths combnaton of nput and output szes If the memory requred does not exceed the local store space Ext Loop and contnue Increment the dvsor Recalculate the number of outputs per job based on the new dvsor - Make sure that the new number s a multple of four (ncludng the remander) If the new number s the same as the prevous Ext Loop and notfy user that there s not enough local store space // Repeat the above loop for Backpropagaton. The only dfference n the calculaton of // the memory requrement. Backpropagaton requres more space as a functon of the // number of nputs and outputs. Lstng 14: Job Generaton for forward and Backpropagaton for fully-connected layers

92 Once the job parttonng parameters are obtaned, the weght matrx s reformatted, and the chldren jobs are created. Knowng the column wdth of each block, the matrx s reformatted so that block dvson can be used as shown n Fg Note that paddng must be nserted so that the column wdth s dvsble by four. No new memory s allocated for chldren jobs. Instead, each chld job s gven ponters nto the parent job s memory portons. Fgure 6-7: Reorganzaton of the weght matrx n man memory Each chld job s ponters are set to some locaton n the global job s allocated arrays. The locaton of these ponters s based on the nput and output ndces. An example s shown n Fg Fgure 6-8: One of several jobs for a fully-connected layer s

93 SPE Intalzaton Two convenent characterstcs n the ndvdual jobs that are created durng ntalzaton s that they do not change throughout the algorthm and that ther completon tme s determnstc. These are very useful characterstcs as they do not requre the use of dynamc job schedulng, whch usually mples addtonal overhead. Instead, for each layer, an SPE s gven a lst of jobs that t s responsble. Ths lst, along wth several other attrbutes s stored n a local table, and s ndexed by a global and unque layer d. As a consequence, when t comes to processng of a gven layer, the SPEs smply receve the layer d and use t as a look up token wthn the table to obtan the nformaton they need Forward Propagaton Durng forward propagaton, when a fully-connected layer s to be processed, the PPE sends exactly one message, encapsulatng the command and layer d, to each actve SPE. Every message s the same and s shown n Fg FC Forward Command (8 bts) Layer ID (24 bts) Fgure 6-9: Fully-connected Forward Propagaton Command Once an SPE receves the command, t performs ts porton of the work accordng to pseudo-code n Lstng 15. Obtan the EA and sze of the job lst from the local layer table my_output_neurons LS_Rsv (partal or full* sum of output neurons) Whle there are jobs remanng DMA up to MAX_JOBS_PER_SPU jobs from the job lst nto the job data buffer for each job n the job data buffer nput_vector LS_Rsv(job specfcaton) weght_matrx LS_Rsv(job specfcaton) DMA the nput vector and weght matrx nto the local store Matrx Vector Multplcaton (see Alg <>) If nlne actvaton opton s enabled* perform actvaton DMA (n.b.) the actvated partal output vector nto man memory Release memory for nput and weght matrx Loop Loop If nlne actvaton opton s dsabled DMA the output neuron partal (or full*) sums back nto man memory Wat for DMA transfers to complete Sgnal the PPE *Apples f the global layer matrx was small enough to be separated row wse only. If so, each job results n the completon of some part of the output vector. Otherwse, only partal sums are generated and need to be summed on the PPE before performng actvaton. Lstng 15: Fully-connected Layer Forward Propagaton SPE Pseudo-code

94 The correspondng PPE pseudo-code s shown n Lstng 16. For all SPEs: Send command (SPEs begn processng ther jobs) If nlne actvaton s enabled, zero out global neuron array For all SPEs Wat for completon of SPE If nlne actvaton s dsabled Add partal output neuron array values to global neuron array Loop If nlne actvaton s dsabled Perform actvaton on global neuron array as a separate step Lstng 16: Fully-connected Layer Forward Propagaton PPE Pseudo-Code The matrx-vector multplcaton step uses the weght matrx and nput array respectvely. The mplementaton s optmzed by unrollng of the outer loop and utlzng the SIMD nstructon set as shown graphcally n Fg Fgure 6-10: Matrx Vector multplcaton Each teraton of the outer loop performs four dot products at a tme. The nner loop keeps a runnng sum of four partal sums of the four dot products. Once the nner loop

95 exts, each of the partal sum vectors s summed nto a scalar value whch s placed n the correct locaton of the output vector. Once a job s processed and f nlne actvaton s enabled, the actvaton functon s used to process the porton of the output vector just calculated, and the result s transferred nto man memory va DMA wthout blockng (each job has access to an exclusve porton of the output vector and thus no two jobs wll modfy the same porton of the output vector makng t safe to transfer a porton once ts correspondng job s done). If nlne actvaton s not enabled, then multple jobs may access the same porton of the output vector. In ths stuaton, a runnng partal sum s kept of the full output vector (sze of the actual global layer). Each job s aware of the porton of ths output vector that t needs to wrte to. Once all assgned jobs are processed, the full output vector s transferred nto memory usng a blockng DMA transfer. In ths stuaton, the PPE receves the output vectors from each SPE, and sums them all up before executng the actvaton step separately. Fgure 6-11: Optonal dstrbuton of actvaton functon The actvaton step smply transfers exclusve portons of the output vector to each SPE (Fg. 6-11). The SPEs use SIMD nstructons to perform the actvaton on ther vector porton and sgnal the SPE when they have transferred the results back nto man memory. At ths pont, the forward propagaton through the fully-connected layer s complete Backpropagaton The SPE layer nfo struct also holds a lst of jobs that are performed durng Backpropagaton. Because of the way that the matrx s represented n memory after beng reformatted, the Backpropagaton jobs need to have the same block wdth (number of nputs per job) as the forward propagaton jobs. The block heght (number of outputs per job), however, may be smaller due to the need for storng more data on the local store n the Backpropagaton algorthm. It s ths reason that a separate lst of back propagaton jobs s stored. Smlarly to the forward propagaton step the message shown n Fg s sent to all SPEs.

96 FC Backprop Command (8 bts) Layer ID (24 bts) Fgure 6-12: Fully-connected Backpropagaton Command The top-level pseudo code for the Backpropagaton algorthm s shown n Lstng 17. Recall that the γ for any neuron s obtaned by gettng the weghted sum of the δ s from the next layer. The δ s then calculated by multplyng the γ value by the dervatve of the actvaton functon wth respect to the nput. Also note that onlne learnng s selected usng a compler drectve and not durng runtme. Ths way, there s no condtonal branchng requred. Obtan the EA and sze of the job lst from the local layer table partal_delta_sum LS_Rsv (partal delta sums for the prevous layer) whle there are jobs remanng DMA up to MAX_JOBS_PER_SPU jobs from the job lst nto the job data buffer for each job n the job data buffer nput_vector LS_Rsv(job.num_nputs) output_vectors LS_Rsv(job.num_outputs) weght_matrx LS_Rsv(job.num_nputs * job.num_outputs) weght_deltas LS_Rsv(job.num_nputs * job.num_outputs) delta_vector LS_Rsv(job.num_outputs) DMA n the nput vector, output vector, and weght matrx If nlne actvaton DMA n the γ vector nto the delta_vector Obtan the δ vector by processng the γ usng the dervatve of the actvaton functon w.r.t. the nputs Else DMA n the δ vector (already calculated n a prevous step) If onlne learnng, zero out the weght_delta array Else, DMA n the latest weght_delta from man memory Transpose Matrx Vector Multplcaton (see Alg <>) (modfes the weght_deltas matrx and wds for the prevous layer) If onlne leanng, update the weght matrx block for ths job and send back n the proper place n man memory Else DMA the updated weght_deltas back to man memory Release all job related memory Loop Loop DMA the partal wds to man memory (results from each SPE are summed on the PPE) Sgnal the PPE Lstng 17: Fully-connected layer Backpropagaton SPE pseudo-code The assocated PPE pseudo-code follows:

97 If nlne actvaton s dsabled obtan the deltas by performng a separate step on the weghted delta sums For all SPEs: Send command (SPEs begn processng ther jobs) Zero out the global wds on the prevous layer For all SPEs Wat for completon of SPE Add partal wds result to global wds on prevous layer Loop Lstng 18: Fully-connected layer Backpropagaton PPE pseudo-code The frst part of the algorthm depends on whether nlne actvaton s enabled; ths agan depends on whether the global weght matrx for the layer was splt vertcally between jobs. If nlne actvaton s enabled, the SPE performs a DMA transfer of the γ vector onto the LS and obtans the δ vector by performng an element by element product of the γ vector and the dervatve of the actvaton functon wth respect to the nput of the neuron at that teraton. Vector operatons are used n ths step. Otherwse, the δ s have already been computed by a separate step and are transferred onto the LS. The second part of the Backpropagaton algorthm s n some ways smlar to the forward propagaton algorthm n that t s a matrx vector multplcaton. The matrx n ths operaton, however, s the transpose of the matrx block. The vector s now the delta array for the neurons on ths layer assocated wth the current job. Instead of frst performng the transpose and then performng the multplcaton, the operatons are reformulated to work drectly wth the weght matrx whle stll takng advantage of the SIMD unts. As a separate advantage, t s not necessary to reserve addtonal space for the transpose of the matrx block. The concept s shown n Fg In the mplementaton, the outer loop was unrolled and code was optmzed to reduce dependences between steps.

98 Fgure 6-13: Matrx vector multplcaton requrng no transpose to be taken The updatng of the changes n the weghts was also performed n ths nner loop. The procedure s llustrated n Fg Fgure 6-14: Updatng of weghts s

99 If ncremental learnng s enabled, the next step s to update the weghts n the matrx block and send the result back nto man memory. The matrx block update s easly performed usng SIMD multply-and-subtract operatons usng the weght deltas and learnng rate vector that s obtaned by copyng the learnng rate nto each of the four vector elements. If ncremental learnng s not enabled, the updated weght deltas are transferred back to man memory. Once all jobs are completed, the γ vector for the prevous layer s transferred nto man memory for consumpton by the prevous layer Convoluton Layers The procedures for ntalzng the general convoluton layer and convertng t nto an SPE-optmzed nfo structure, as well as the forward and Backpropagaton processes are smlar to those of the fully-connected layer. However, the actual data present n each object and the methodology for dvdng the workload dffer sgnfcantly as wll be detaled n ths secton. To reduce redundancy, some portons of ths secton may refer to the fully-connected secton. In the mplementaton, both layer types may be used n a sngle network. Gong forward, convoluton layers may be connected to fully-connected layers but not vce versa. Also, for smplcty of desgn, for any two connected convoluton layers, every kernel operates on every nput feature map. LeCun expermented by performng custom connectons to see whch worked best [12]. Fully-connected convoluton layers performed very well as well Layer Intalzaton and Representaton Several analoges can be made between the fully-connected layer and convoluton layer. The neurons, and outputs, on a convoluton layer are termed feature maps and represent the fltered output mages after applcaton of a kernel matrx to each of the nput feature maps (from the prevous layer). The number of feature maps s specfed as a network parameter and equals the number of kernel weght matrces (one kernel per output feature map). Kernel weght matrces, or kernels, are analogous to the weght matrx. Creatng a general convoluton layer nvolves specfyng the kernel length, the number of output feature maps on the layer, the heght and wdth of each of the nput feature maps from the prevous layer, and specfyng f random weghts n the range [-0.005, 0.005] are to be used. Alternatvely, layers may be loaded from a fle to restore prevously learned kernel weghts. The general convoluton layer, therefore, has a contguous array for output feature maps, a contguous array for kernels, an array for the γ and δ vectors used n Backpropagaton, and attrbutes for the kernel length, number of output feature maps, and heght and wdth of each output feature map (calculated automatcally based on heght and wdth of nput feature maps and kernel sze). Addtonal arrays and attrbutes for RPROP Backpropagaton are also ncluded. Fg dsplays a smplfed general convoluton layer.

100 Fgure 6-15: General Convoluton Layer An SPE-optmzed layer nformaton structure s created from a general convoluton layer n the same manner as before. The frst step generates the global SPE layer job (Fg. 6-16) wthn the SPE layer nformaton structure by copyng array elements and attrbutes from the general layer structure. Fgure 6-16: Optmzed Convoluton Layer Because all the arrays represent two-dmensonal structures (.e.: kernel matrx, feature map), each row of every structure s padded to a multple of four due to mplementaton detals stemmng from DMA requrements and Vector operatons. An example of a padded 2-d map s shown n Fg

Four dentcal vecsons are generated, each wth a dfferent column offset.

101 Fgure 6-17: Padded Feature Map Another step, reasonng for whch wll become apparent shortly, s to generate four copes of the kernel matrx, each shfted by a dfferent amount as shown n Fg Fgure 6-18: Example of a 5x5 convoluton kernel as t s stored n memory. Four dentcal vecsons are generated, each wth a dfferent column offset. Next, adjacent layers n the network are connected, whch flls out the remanng attrbutes of the global job by lnkng up the nputs and outputs between the layers as shown n Fg Fnally, the global job s parttoned nto smaller jobs.

102 Fgure 6-19: Connectng two convoluton layers The method for parttonng of the global job dffers from the method used n the case of the fully-connected layer. A major factor n the parttonng scheme for the fullyconnected layers was that each job fts nto the lmted memory of the local store of an SPE. Ths requrement s markedly relaxed when dealng wth convoluton layers snce the weght kernels take up sgnfcantly less space. For example, a typcal kernel has a sze of 5x5 (8x5 when padded). Recallng that there are four versons of each kernel, havng 50 feature maps mples 200 kernels for a total weght count of The space requred would be 20 Kb f usng 32 bt floatng pont numbers. Wth storage space not beng a lmtng factor n most cases, the number of jobs created s always less than or equal to the number of avalable SPEs. Recallng convoluton layer archtecture, every output feature map s nfluenced by every nput feature map through a correspondng kernel. It s logcal, therefore, to partton the global job by assgnng a subset of the output and/or nput feature maps to each chld job. The task s to decde on a strategy. Now, recallng the goals for parttonng fullyconnected layers, nlne actvaton s possble only f an SPE has exclusve access to a porton of the output array that s: t computes the fnal values locally. It makes sense therefore to partton convoluton jobs such that output feature map exclusvty s attaned. Ths s done by assgnng all nput feature maps to each SPE, but only a subset of the output feature maps. As expected, teratng over all nput feature maps produces the fnal pre-actvaton values of the owned output feature maps and the actvaton functon can be performed locally on the local store. In concluson, the parttonng algorthm smply dvdes the total number of output feature maps by the total number of avalable SPEs and assgns the resultng number to each SPE (accountng for the remander). All SPE jobs nclude all nput feature maps. An assumpton s made that a job wll not exceed the space of the local store. Fg summarzes the concept.

103 Fgure 6-20: Job parttonng for convoluton layers SPE and layer unque job nformaton s uploaded and stored n the local store of each SPE durng ntalzaton n the same manner as for the fully-connected layers. Due to dfferent szes of fully-connected jobs and convoluton jobs, two separate tables are used. Layer IDs mght, therefore, overlap between the two versons Forward Propagaton Agan, processng a convoluton layer s ntated by sendng a sngle command to each SPE (Fg. 6-21). CV Forward Command (8 bts) Layer ID (24 bts) Fgure 6-21: Convoluton Forward Command The job handlng overhead s much smaller due to there beng exactly one job per SPE. The pseudo-code for the SPE, once the command s receved s as follows:

104 nput_feature_maps LS_Rsv(num_nput_maps * sze_one_nput_fm) kernel_weghts LS_Rsv(num_output_maps * 4 * sze_one_kernel) output_feature_maps LS_Rsv( num_output_maps * sze_one_output_fm) DMA n nput_feature_maps and kernel_weghts output_feature_maps 0 For every output feature map: fm_out For every nput feature map: fm_n Update fm_out (See Algorthm <>) Loop Loop If nlne actvaton s enabled Perform actvaton on entre output_feature_maps array DMA out output_feature_maps Sgnal the PPE Lstng 19: Convoluton forward propagaton SPE pseudo-code The PPE pseudo-code s exactly the same as the fully-connected pseudo code for the forward propagaton case: For all SPEs: Send command (SPEs begn processng ther jobs) If nlne actvaton s enabled, zero out global neuron array For all SPEs Wat for completon of SPE If nlne actvaton s dsabled Add partal output neuron array values to global neuron array Repeat If nlne actvaton s dsabled Perform actvaton on global neuron array as a separate step Lstng 20: Convoluton forward propagaton PPE pseudo-code Unlke n the fully-connected layer, updatng the output neurons (output feature maps) s not performed usng a matrx vector multplcaton. Whle techncally possble, the repetton of weght values would make t very neffcent n terms of storage. Fgure 6-22: Convoluton operaton Updatng an output feature map s performed by traversng an nxn kernel across the current nput feature map from left to rght, top to bottom, one pxel at a tme as shown n Fg Ths s known as the convoluton operaton n mage processng. On scalar

105 CPU archtectures, ths s easly done usng systematc array ndexng. However, to utlze the SIMD hardware of the SPE, a new method needed to be devsed. As mentoned before, four versons of the weght matrx are generated for every one matrx n the general layer structure such that each one s offset by dfferent amounts. Because there are four elements per vector, the kernel s rows are padded to a multple of four wth 0 s. For example, as shown n Fg. 6-23, f the kernel sze s 5x5, two vectors are used per row for a total of 8 values per row. There are four versons of the kernel. In the frst verson n each row the frst vector takes on the frst four values, and the second vector has the ffth kernel value followed by three 0 s. In the second verson, the frst value of the frst vector n each row s 0 followed by 3 values. The second vector of each row now has two vald values, followed by two 0 s. The thrd verson s frst vector n each row has two 0 s followed by two vald values. The second vector of each row n the thrd verson has three vald values followed by one 0 s. The fourth verson follows the same pattern, wth the frst vectors startng wth three 0 s followed by one vald value and the second vectors havng all vald values. Havng these four versons allows the algorthm to terate over vectors (four elements) from left-to-rght, as seen n Fg. 6-23, nstead of one pxel at a tme. At each vector locaton all four versons of the kernel are appled. Ths process allows for bgger loop blocks and utlzes the vector operatons provded by the SPEs. Fgure 6-23: SIMD-optmzed convoluton An mportant opton that s more often enabled than t s not s subsamplng. Subsamplng nvolves takng steps of two pxels at a tme, as opposed to one when traversng the kernel over the nput feature map. The mplementaton n ths work supports subsamplng by only utlzng two of the four copes of the kernel matrx verson A and verson C. Because several changes n ndexng needed to be made, the opton of subsamplng can only be enabled or dsabled by changng a compler drectve and recomplng. Fg shows a graphcal representaton of the processng of a sngle patch of the nput feature map. The patch travels across the nput feature map from left to rght, top to bottom. All four kernel versons (A, B, C, and D) are appled to the current patch at the

106 same tme. The actual mplementaton n the software s unrolled and optmzed for dependency reducton. Fgure 6-24: Detaled SIMD-optmzed convoluton The fgure represents forward propagaton for the case that subsamplng s dsabled. When enabled, only two of the kernels (A and C) are used, and the output feature map s traversed at half the speed (snce t s about half the sze). The mplementaton s more complex as t nvolves careful shufflng of results nto the proper locaton of the output feature map Backpropagaton The command sent to each layer follows the standard pattern: CV Backprop Command (8 bts) The pseudo-code for the SPE follows: Layer ID (24 bts) Fgure 6-25: Convoluton Backpropagaton Command

107 nput_feature_maps LS_Rsv(num_nput_maps * sze_one_nput_fm) kernel_weghts LS_Rsv(num_output_maps * 4 * sze_one_kernel) de_wrt_w_maps LS_Rsv(num_output_maps * 4 * sze_one_kernel) de_wrt_w_total LS_Rsv(num_output_maps * sze_one_kernel) γ_map LS_Rsv(num_output_maps * sze_one_output_map) δ_map_prev LS_Rsv(num_nput_maps * sze_one_nput_map) If nlne actvaton enabled, output_feature_maps LS_Rsv( num_output_maps * sze_one_output_fm) DMA n nput_feature_maps, kernel_weghts, kernel_masks* If nlne actvaton enabled, DMA n δ array of ths layer nto γ_map DMA n output_feature_maps Else DMA n γ array of ths layer nto γ_map If usng RPROP, DMA n the de_wrt_w_total Else de_wrt_w_total 0 de_wrt_x_prev, de_wrt_w 0 If nlne actvaton enabled, generatel γ_map from current δ values n γ_map (dervatve of actvaton functon) For every output feature map: fm_out For every nput feature map: fm_n Update δ_map_prev Update de_wrt_w_maps Loop de_wrt_w_total += verson A, B, C, and D of de_wrt_w_maps If ncremental learnng, Multply all elements of de_wrt_w_total by learnng rate Loop DMA out de_wrt_w_total and de_wrt_x_prev Sgnal the PPE Lstng 21: Convoluton Backpropagaton SPE pseudo-code The PPE pseudo-code s shown below: If nlne actvaton s dsabled, perform SPE-optmzed dervatve actvaton functon For all SPEs: Send command (SPEs begn processng ther jobs) Zero out δ_map_prev for the global job For all SPEs Wat for completon of SPE If ths s not the frst layer, Add partal δ_map_prev from job to δ_map_prev of global job If onlne learnng s enabled, Use the returned de_wrt_w_total values to update all the weghts Repeat If onlne learnng s enabled, Generate updated four kernel versons (A, B, C, D) for each kernel n the layer Lstng 22: Convoluton Backpropagaton PPE pseudo-code

108 The frst step s to obtan δ from the γ values that were calculated by the next adjacent layer (closer to the output). Ths s easly handled by performng SIMD operatons on the entre δ map, whch s represented as one long array. Durng forward propagaton, every pxel on an output feature map s generated by summng up the convoluton results of the same coordnates of all the nput feature maps. Ths s demonstrated graphcally n Fg Fgure 6-26: Convoluton Layer forward propagaton A pxel on an output feature map s analogous to a sngle neuron n a fully-connected layer, whch mples that each pxel has a δ value assocated wth t. The task, therefore, becomes to backpropagate every pxel s δ value to the correspondng pxels (neurons) of each of the nput feature maps whch had an effect on the output pxel durng forward propagaton. The mplementaton traverses over the de/dx map one vector (four elements at a tme). At each selecton, the four vector elements are each expanded nto separate matrces A, B, C, and D. In the actual mplementaton, only one row of the matrx s generated snce each row s exactly the same. For the purposes of explanaton matrces wll be used. Fg shows the process.

Fgure 6-27: Generatng the γ matrces Next, the δ values are backpropagated through kernels by supermposng each verson of the γ matrx to the correspondng kernel matrx.

109 Fgure 6-27: Generatng the γ matrces Next, the δ values are backpropagated through kernels by supermposng each verson of the γ matrx to the correspondng kernel matrx. The two matrces are multpled element by element (SIMD operatons) and the result s added to the proper patch of each of the γ maps for the prevous layer. Fg shows the process for one nput and output feature map par. Fgure 6-28: Convoluton Layer Backpropagaton A separate array s used to keep track of the de/dw values. Just as the kernel matrces, there are four versons of the de/dw matrx. If RPROP s used, the values placed nto the array are loaded from man memory. Otherwse, the contents of the array are zeroed. Each verson of each kernel has an assocated de/dw matrx. Recallng the Backpropagaton algorthm as descrbed n Chapter 2, de/dw s obtaned by multplyng δ of the current neuron by the output of the neuron on the other sde of the weghted connecton (prevous layer).

It s convenent that ndexng nto the nput feature maps corresponds wth the ndexng nto the γ maps of the prevous layer because the updatng of the de/dw matrces can be performed n the same nnermost loop.

110 It s convenent that ndexng nto the nput feature maps corresponds wth the ndexng nto the γ maps of the prevous layer because the updatng of the de/dw matrces can be performed n the same nnermost loop. The process s shown n Fg Fgure 6-29: Convoluton layer kernel updatng Before startng processng of the next output feature map, the four γ matrces are algned and summed nto one matrx. If ncremental learnng s enabled, the algned de/dw matrx s multpled by the learnng rate. The actual updatng of the weght matrces s performed on the PPE once the SPE s fnshed. Otherwse, the matrx s left as t s. Agan, f subsamplng s used, the ndexng scheme s modfed. Whle the same methods are used as stated above, the ntrcaces of the algorthm become somewhat more complex. Once all output feature maps are processed, the SPE transfers the de/dw matrx and γ map (for the prevous layer n the network) to man memory. The PPE sums up the γ maps from each SPE by utlzng the Altvec engne. If ncremental learnng s enabled, the weghts are updated usng the obtaned de/dw values, and the new weght kernels are regenerated for use by the SPEs by recreatng the four kernel versons (A, B, C, and D) for each one. As always, the loops n ths algorthm were heavly unrolled, and the usage of varables was optmzed to mnmze data dependences and stalls.

111 Chapter 7: Implementaton of Support Vector Machnes on the Cell Processor 7.1 Chapter Introducton As establshed n Chapter 3, there are multple technques for tranng Support Vector Machnes (SVMs) ncludng the popular workng set technque, the sequental mnmal optmzaton technque (an extreme case of the workng set technque), and the cascade SVM and ts varatons. Each technque and ts mplementatons have been developed to target and correct some unfavorable characterstc. The workng set technque relaxes memory requrements, the SMO technque further reduces memory requrements and drastcally smplfes the algorthm, and the Cascade SVM ntroduces parallelsm va ndependent problem solvng. Beyond these methods, other technques exst that clam to mprove accuracy and decrease tranng tme. Ths work dd not try to develop a new algorthm altogether. Instead, the purpose was to examne exstng mplementatons n lterature and pck the ones that were consdered a good ft for the archtecture of the Cell Processor and the style of ts programmng models. Two mplementatons the Gradent Projecton-based Decomposton Technque (GPDT) and Cascade SVM were selected, mostly for ther focus on parallelzaton. Parallel hardware has only recently become wdespread as many consder that the current processng technology s begnnng to reach ts lmts n terms of transstor sze and clock speeds. Unlke the MLP, SVM tranng does not easly break down nto matrx multplcatons, or any other nherently parallelzable process. The GPDT and Cascade SVM are among the frst to add parallelzaton potental and were deemed good canddates for the Cell Processor. Ths chapter explans the mplementaton detals of the GPDT and Cascade SVM on the Cell Processor. Smlarly to the prevous chapter, many of the concepts are descrbed usng mages rather than pseudo-code whenever possble. The order n whch the chapter s lad out may not follow closely wth the order of development as t was wrtten wth comprehenson n mnd. 7.2 Parallel Gradent Projecton-based Decomposton Technque The mplementaton of the GPDT s based on the freely avalable Parallel GPDT (PGPDT) source code that s lcensed under General Publc Lcense (GPL). The GPL lcense allows for the modfcaton of the code for expermental and educatonal

112 purposes. In ths work, heavy modfcatons were placed nto those parts whch were most computatonally ntensve, whle those parts whch had lttle mpact on the overall runnng tme were kept so as to reduce sources of bugs. The fnal product conssts mostly of orgnal custom code and uses heavly optmzed data representatons. As a smplfed hgh-level revew, the PGPDT algorthm s structured as shown n Fg The shaded porton of the fgure s that whch has been optmzed n ths work. Fgure 7-1: Parallel Gradent Projecton-based Decomposton Technque Of the major steps n the PGPDT loop, updatng the gradent of the objectve functon n terms of the Lagrange Multplers usually takes the most amount of tme, followed by generatng and solvng the subproblem based on the current tranng vector workng set. The selecton of the subsequent workng set has not been modfed as t s nsgnfcant n terms of the total runnng tme. The cachng access ponts have been modfed, but the cache handlng logc has remaned the same. The sectons followng begn to dg deeper nto the low level detals of the ndvdual steps of the Cell mplementaton. Where applcable, dfferences between the Cell and orgnal mplementatons are brought to lght Tranng Data Representaton In the orgnal PGPDT source code, all tranng nput vectors are represented n memory as sparse vectors. The skernel class kept track of these vectors usng several data members:

nt lx[ell]; // the number of non-zero elements n each vector nt *x[ell]; // the ndces of the non-zero elements float *x[ell]; // the actual values of the non-zero elements The memory layout of ths

113 nt lx[ell]; // the number of non-zero elements n each vector nt *x[ell]; // the ndces of the non-zero elements float *x[ell]; // the actual values of the non-zero elements The memory layout of ths data s shown graphcally n Fg Fgure 7-2: Orgnal Tranng Set (Sparse Vector) data representaton To effcently utlze the SIMD archtecture of the SPEs, the data was converted from sparse arrays nto 4-element sparse blocks (whch s the vector sze for the float datatype on the SPEs). Any 4-element vec_float4 type whch has at least one non-zero element s stored n memory. A non-zero block s therefore defned as a 4-element block n whch at least one element s non-zero. Sparse ndex numbers represent an entre block nstead of a sngle value. There s some paddng necessary as well for optmzng DMA transfers. The new data representaton has a much better ft for the SPEs at the expense of a slghtly greater memory requrement. The new data members are: nt vlx[ell]; // the number of non-zero vec_float4 elements n each tranng nput nt *vxm; // the ndces of non-zero vec_float4 elements (contguous n memory) nt *vx[ell]; // ponters nto vxm for each nput vector vec_float4 *vec_xm; // the non-zero 4-element blocks (contguous n memory) vec_float4 **vec_x; // ponters nto vec_xm for each nput vector For example, the sparse array n Fg. 7-2 can be represented usng the new format as shown n Fg. 7-3.

114 Fgure 7-3: SIMD-optmzed tranng set data representaton All references to sparse nput vectors n the sectons below refer to ths sparse-block representaton Generatng and Solvng the Subproblem The frst tem that was focused on n ths thess was the subproblem QP solver. The orgnal PGPDT source code was wrtten n C and C++ and mplemented the QP solver as a separate module, makng t relatvely easy to replace wth a modfed one. For ths reason, ths porton of the algorthm was targeted frst. The subproblem solver attempts to fnd a soluton to the problem 7.1. (7.1) mn w Ω A R Ω = f ( w) = nsp * nsp 1 2 w T, w, b R Aw + b nsp nsp T nsp { w R, 0 w u, y w = e}, u, c R T w Recall from Chapter 3 that the full matrx of the entre problem as well as the other data are subdvded nto subproblem data as follows: Gββ Gβδ α β yβ (7.2) G =, α =, y =. Gδβ Gδδ α δ yδ In problem 7.1, A s G ββ, w s α β, and y s y β Revson 1 The frst revson of the solver module was wrtten to take advantage of the SPE SIMD nstructon set, reduce branches, unroll and clump loops together, and conform to the

115 general SPE programmng gudelnes. The nputs and outputs of ths frst verson of the solver module were: Inputs Outputs n (<200) sze of problem x[n] fnal alpha values A[n*n] kernel matrx ls 1 number of lne searches b[n] lnear porton proj 1 number of projectons y[n] output class ter 1 total number of teratons x[n] ntal alpha values u u parameter e e parameter tol 1 tolerance Table 4: Inputs and outputs of frst revson of SPE solver 1. These parameters are specfc to the GPDT algorthm. The smplest way to plug an SPE module nto exstng code s to wrte an nterface functon that runs on the PPE and replaces the orgnal sngle threaded QP functon. Ths s called the functon offload programmng model. The nterface functon smply farms out the work onto the SPEs, usually by data parallelzaton, and wats untl they are completed. The functon offload technque was mplemented n ths work. The basc pseudo code for the nterface functon wrtten s shown n Lstng 23. Collect all nput data nto a SPE-optmzed structure. - Algn all arrays nto 128 byte memory, pad f necessary. If the SPE s not runnng, start t Send the memory address of the SPE struct to the SPE Wat for SPE completon Move result from SPE struct back nto orgnal x[n] nput Clean up Return Lstng 23: PPE Interface functon for Revson I of subproblem solver The SPE module, once started, communcates drectly wth the nterface functon by watng and processng of commands va the Cell malbox functonalty. An overvew of the SPE steps s shown n Lstng 24. Wat for address of data structure DMA n the structure Manually organze memory that wll be needed DMA all arrays nto the LS Execute the algorthm Place results back nto man memory Sgnal the PPE Lstng 24: SPE module for Revson I of subproblem solver

116 Whle the SPE module was retaned, and found to be useful later n the Cascade SVM mplementaton, as wll be descrbed n the proper sectons, t was quckly dentfed that there were several lmtatons of ths ntal revson. Frst, there was qute a bt of overhead assocated wth the copyng and restructurng of data at every nvocaton of the nterface functon. Second, due to the SPEs lmted room on the local store, only a kernel matrx of sze ~200 by 200 could ft resultng n a maxmal subproblem sze of 200. The GPDT algorthm was underutlzed as t was desgned to work on subproblem szes of O(10^2) and O(10^3). Thrd, t would be qute dffcult to parallelze ths module across multple SPEs snce only one solver s allowed to run at a tme n the workng set technque. Fourth, the SPE module dd not perform any subproblem generaton, leavng a lot of the heavy work to the PPE. And lastly, any other portons of the algorthm would ether take up more space on the SPE, reducng the total subproblem sze, or requre the reservaton of ther own SPEs, keepng the number of actve SPEs below maxmum. Durng the course of actual development, work was temporarly halted on the QP solver whle focus was placed on the gradent updatng step. However, to keep ths chapter n coherent order, the next revson of the QP solver whch attempted to fx all of these problems wll be dscussed next Revson 2 Before the subproblem can be solved, the data for t must be generated. The generaton step s computatonally expensve yet performed solely on the PPE n Revson I. Lookng back at problem 7.1, t s apparent that the kernel matrx A needs to be generated before ntatng the QP solver. Whle α β, and y β are easly generated by random ndexng and copyng from the correspondng global arrays, the kernel matrx A, and lnear term b requre many kernel computatons. It should be ponted out that there s a shortcut for the calculaton of the lnear term once the global gradent s known. Ths shortcut s used for all but the frst teraton. The standard method s: (7.3) G δ δ. The shortcut s: βδα 1 α α. (7.4) Fβ ( ) Gββ β For detals, see [64] and [35]. The sgnfcance of the shortcut s that t only requres workng wth n dmensons (the dmenson of the subproblem) and that once the kernel s generated, no addtonal kernel teratons are necessary. Ths assumes that the global values of the gradent have been calculated, as they are n GPDT. As mentoned n Chapter 3, the most expensve porton of the QP solver algorthm s the matrx multplcaton used for fndng the gradent of the subproblem objectve functon. There are two locatons n whch the multplcaton s performed, dfferng only by the vector beng multpled. Because the sze of the problem s selected by the user and kept unchanged, so s the kernel matrx. A proper strategy (as recommended n [35]) s to

117 dvde the kernel matrx row-wse so that each SPE generates a porton of the output vector. In ths mplementaton, all non ntensve steps are performed on the PPE, only outsourcng the matrx multplcaton step onto the SPEs. The QP solver, therefore, s rather trval to mplement. The kernel matrx for a problem, once ntalzed, does not change for the duraton of the runnng of the QP solver. Ths suggests that once the matrx contents are placed nto the local stores of each SPE, they should reman there untl the QP solver fnshes. The task, therefore, s to effcently place the rows of the matrx nto the proper locatons of each SPE. Ths task, however, s not that smple. An mportant feature of every generated kernel matrx s that t s postve defnte meanng that the upper and lower trangular portons are mrror mages of one another. In the orgnal PGPDT source code, ths fact s used to an advantage by only calculatng the upper trangle of the matrx and then mrrorng t across the dagonal, thus cuttng kernel calculatons by nearly half (the dagonal s calculated as well). Recall that calculatng an element a,j of the kernel matrx requres the th and j th nput tranng vector, as well as the th and j th output as shown below. (7.5) a y y K ( x, x ), j = j j Because generatng elements n the kernel, whether for subproblems, or for the gradent updatng step descrbed later, s very expensve, ths task s a prme target for optmzaton. The detals of ths process wll be postponed untl the followng sectons. For the moment, assume that there exst utltes wthn the SPE module that perform ths very step and that each tranng vector s assgned a unque d, or ndex value. The second revson of the QP solver makes use of one SPE module, whch s uploaded to all the SPEs, and performs subproblem data generaton and matrx multplcaton. A bg challenge to overcome n ths revson was to balance the workload among the SPEs durng these two steps. If the kernel matrx was not postve defnte (and thus symmetrcal along the dagonal), ths would be a trval matter. Each SPE would be programmed to generate a unque, equally szed, porton of the matrx wthn ts local store and, gnorng the dfferences n nput vector sparsty, perfect workload balance between the SPEs would be acheved. Ths workload parttonng concept s shown n Fg Of course, ths same balance can be acheved by gnorng the symmetry and just calculatng each and every element, but ths s unnecessary overhead. On the other hand, the advantage of ths scheme s that the matrx multplcaton step of the solver s perfectly balanced between the SPEs.

Fgure 7-4: Calculatng all kernel elements On the other extreme, f only half the matrx s generated by dstrbutng the workload as shown n Fg. 7-5, the kernel generaton step s equally balanced.

Fgure 7-5: Equal balance of upper trangle, dffcult reflecton The fnal soluton that was mplemented strkes a balance between the two extremes.

118 Fgure 7-4: Calculatng all kernel elements On the other extreme, f only half the matrx s generated by dstrbutng the workload as shown n Fg. 7-5, the kernel generaton step s equally balanced. However, t becomes dffcult to reflect the elements to form a complete matrx. Even f t was accomplshed, the matrx multplcaton step would be hghly unbalanced. Fgure 7-5: Equal balance of upper trangle, dffcult reflecton The fnal soluton that was mplemented strkes a balance between the two extremes. The concept s to dvde the kernel matrx nto NxN blocks, each of whch represents one job. The number of blocks to dvde nto depends on several factors, ncludng the number of avalable SPEs and the sze of the matrx. Wth ths method, the number of kernel evaluatons s just over half of the number of evaluatons performed n the frst case, and slghtly over that of the second case. Durng the ntalzaton of the solver, each SPE s gven a table of the blocks for whch t s responsble for. Because the subproblem sze does not change between teratons, ths table s reused, savng on communcaton (smlar to how the job nformaton for each layer n the MLP algorthm was stored n a table on each SPE as descrbed n Chapter 6). To enforce balance durng the subproblem data generaton step, each SPE s gven about an equal number of blocks to process. The frst task s to fgure out the sze N of the length and heght of each block, ncludng the rghtmost and bottom ones, the number of rows of blocks to be placed nto each SPE. The lmtaton s the maxmum block sze (MAX_BLOCK_SIZE), whch s set by a macro. The dea behnd ths algorthm s to fnd the mnmum number of block rows per SPE such that the sze of each block s wthn the maxmum block sze. The block sze b sze, related to the number of block rows per SPE br spe by equaton 7.6.

119 (7.6) ( ) b = cel cel n sze 4 NUM _ SPEs* br n whch n s the subproblem sze and cel 4 rounds up to the nearest multple of four. Next, based on the number of blocks per column and row n bpcr of the full kernel matrx, the number of block jobs s calculated and memory s allocated. The number of jobs s smply: bpcr (7.7) T n = bpcr k. n k = 1 Once the block jobs are allocated, ntalzaton nformaton for each of the NUM_SPE SPEs s generated and sent to the SPEs usng the CMD_INIT command. The contents of ths ntalzaton structure are shown n Fg spe Fgure 7-6: SPE-ntalzaton structure Each SPE s gven an equal number of block jobs to generate (number of jobs should be roughly the same; Fg. 7-7). The actual jobs have not yet been set up as they requre nternal LS ponters of all the SPEs so that nter-spe DMAs can occur. Recall that effectve addresses can also map nto the LS s of the SPEs. DMA transfers between SPEs utlze ths effectve address for drect SPE-to-SPE communcaton. Because of the dynamc allocaton of memory, the local LS buffer addresses nto the partal kernel matrces are not known untl the SPEs are ntalzed (usng the CMD_INIT command). The CMD_GET_KERNEL_LS command s sent after the CMD_INIT command whch the SPEs respond to by returnng ther local LS address nto these buffers. The systemwde effectve addresses of the partal kernel buffers are obtaned by addng the returned local LS addresses to the base LS address obtaned when frst settng up the SPEs.

Fgure 7-7: Block job SPE dstrbuton At ths pont, wth the startng addresses of all partal kernel buffers n each of the SPEs known, the block jobs themselves are set up.

Based on these two values and the prevously obtaned global effectve addresses of each of the SPEs partal kernel buffers, the destnatons of the block and ts transpose are calculated.

120 Fgure 7-7: Block job SPE dstrbuton At ths pont, wth the startng addresses of all partal kernel buffers n each of the SPEs known, the block jobs themselves are set up. The contents of a block job struct are shown n Fg Fgure 7-8: Job block structure Each block s gven a block row and column number whch represents where n the full kernel matrx the block resdes. Based on these two values and the prevously obtaned global effectve addresses of each of the SPEs partal kernel buffers, the destnatons of the block and ts transpose are calculated. So far, the number of kernel evaluatons has been mnmzed, and each SPE s gven an equal number of block generaton tasks. The next goal s to enforce that at the end of the data generaton step, each SPE holds an equal amount of consecutve matrx rows so that the matrx multplcaton step s also balanced. Ths mples that the destnaton addresses of all the blocks durng kernel generaton should be equalzed between the SPEs. There s no bnd between a block job and the SPE at whch t was generated. Blocks generated on one SPE may end up on any of the other actve SPEs dependng on ther destnaton addresses. In fact, most of the tme, the destnaton address of the block and that of ts transpose belong to two dfferent SPEs (Fg. 7-9). The destned SPE s chosen based on the number of block rows per SPE and the row number of the block (and column number of the transpose of the block).

Fgure 7-9: Fnal locatons of all the blocks (B: Block, T: Transpose) Once all block jobs are created, the CMD_JOBS_READY command s sent to each SPE, at whch pont the SPEs DMA n ther personal job lsts

121 Fgure 7-9: Fnal locatons of all the blocks (B: Block, T: Transpose) Once all block jobs are created, the CMD_JOBS_READY command s sent to each SPE, at whch pont the SPEs DMA n ther personal job lsts from man memory nto ther LSs. At each teraton of the man loop, there s a subproblem that needs to be solved. The PPE s responsble for choosng those ndces from the global tranng set that wll make up ths subproblem based on prevous teratons. Once, chosen, the PPE generates several arrays n memory that wll be used by the SPEs, ncludng the DMA lsts for the sparse tranng vectors n the current subproblem. The ponters to ths data are encapsulated nto a problem_data structure shown n Fg Fgure 7-10: Problem data structure Next, the CMD_GENERATE_KERNEL command s sent to all SPEs along wth the effectve address of the data generated above. The SPEs begn generatng ther portons of the kernel matrx for ths subproblem. The block generaton algorthm s shown n Lstng 25. Generaton and sendng of blocks s ntertwned by usng mult-bufferng. Increasng the number of buffers allows for havng more blocks (and ther transposes) n

122 transt at any one tme. In software, the buffers are encapsulated nto job contexts, whch are accessed sequentally whle processng block jobs. Havng four job contexts, each wth ther own buffers and DMA transfer statuses, showed to have the best sze vs. speed performance. DMA n data gven n the problem data structure For all block jobs assgned to ths SPE Select next block job context (round robn) Wat for DMA transfer completons on ths context Clear block and block s transpose buffer memory Generate the block Begn DMA transfer of block to ts fnal destnaton Generate transpose of block Begn DMA transfer of transpose to ts fnal destnaton End Loop Lstng 25: Block generaton algorthm The actual task for generatng each block has also been heavly optmzed. Recall that n order to generate the (,j) th element n the kernel matrx, the th and j th sparse tranng vectors are needed. Smplfed pseudo code s shown n Lstng 26. Whle there are rows remanng to be generated DMAL 4 sparse nput vectors (four rows) Expand the sparse vectors nto non-sparse format Whle there are columns remanng DMAL up to 16 sparse nput vectors (columns) Use optmzed kernel generaton module to generate four rows of the kernel block Loop Loop Lstng 26: Generatng the actual block The functon generatng the transpose has been optmzed as well. It nvolves a loop n whch a sequence of vector nstructons are used to generate ndex values nto the source block and destnaton transpose block. Once the source and destnaton ponters are obtaned, smple vector shufflng nstructons are used to copy the data. The loop s desgned to process a four by four block on every teraton. An addtonal step of subproblem data generaton s creatng the lnear term of the Lagrange functon. As mentoned earler n ths secton, there s a shortcut to creatng ths array (Eq. 7.4). Each SPE generates a porton of the lnear term based on the rows of the kernel that are n ts memory and DMA s t nto man memory (effectve address s stored n the ntalzaton data obtaned along wth the CMD_INIT command). To ntate ths step, the PPE must wat for an ncomng sgnal from each SPE after ssung the CMD_GENERATE_PKERENL command. Ths synchronzaton step acts as a barrer, ensurng that the fnal subproblem kernel s ready and dstrbuted among the SPEs. At ths pont, the CMD_GENERATE_PLINTERM command s sent. The SPE smply

123 DMAs n the data t needs and performs a matrx-vector multplcaton correspondng to Eq Havng the data generated, the subproblem solver can be run. To shorten development tme, the revson 1 solver code was ported to run on the PPE by utlzng the spu2vmx and vec_types.h header fles. Dong so enforced the use of the PPE s Altvec unt for vector operatons. The code was further modfed, replacng the two expensve matrx multplcatons wth nterface functons nto the SPEs. The two matrx multplcatons that are performed dffer only by the vector beng multpled. Because each matrx multplcaton operaton requres the effectve address of the nput vector and output vector, these addresses are gven to the SPE along wth the CMD_INIT command nsde the ntalzaton structure (Fg. 7-10). These ponters are kept untl the subproblem s solved. The correct choce for the nput and output effectve addresses s chosen based on a parameter n the CMD_PERFORM_MULT command, whch s shown n Fg Multply Command (8 bts) Fgure 7-11: Multply command Set (24 bts) The parameter Set s used to select the correspondng effectve address for the source of the vector multpled and destnaton of the product vector. Because the QP solver runs over many teratons, requrng only one malbox message per multplcaton reduces communcaton overhead Kernel Element Generaton In ths work, only the Gaussan kernel was mplemented due to lack of tme. The Gaussan kernel, however, s the most common and more than suffcent for the purposes of ths work. The tme t would take to mplement the other kernels was better spent n optmzatons so that the potental of the Cell Processor could be fully exhbted. In general (gnorng the workng set method) the Gaussan kernel matrx elements are generated usng: (7.8) α = exp( ( norm + norm 2.0* vec, vec )*σ ), j j j n whch and j are two tranng vector ndces and a,j s the element n the th row and j th column (as well as j th row and th column due to symmetry). Note that when generatng kernel matrces for subproblems, the s and j s take on values from the subset of the ndces n the current workng set. As mentoned prevously, besdes cachng, the optmzaton of the kernel generaton code s wthout queston one of the greatest mprovements that can be done to speed up the GPDT algorthm. In that regard, an SPE optmzed functon was wrtten desgned to effcently calculate several elements of a sngle row of the kernel matrx at a tme. The functon sgnature s shown n Lstng 27.

124 vod GetKgaussKernelRow( vec_float4 *vec_output, vec_float4 *vec_exp_row, float exp_row_nor, vec_float4 *vec_nput_rows, nt *vx, nt *vlx, float *sparse_nors, nt num_ell ) vec_output: Ponter to result vector. vec_exp_row: The man nput tranng vector (non-sparse). exp_row_nor: The norm value for the row above (pre-calculated durng ntalzaton). vec_nput_rows: All the nput tranng vectors (n sparse form) whch wll be evaluated aganst the man nput tranng vector usng the kernel. vx: Sparse ndces for the non-zero values for each of the tranng vectors above. vlx: Sparse lengths for all the tranng vectors above. sparse_nors: The norms for all the sparse nput rows. num_ell: The number of the non-man sparse tranng vectors (the number of kernel elements that wll be calculated). Ths number must be a multple of four. Lstng 27: Kernel row generaton functon sgnature Note that all the arrays correspondng to the sparse nput rows vary n length (the vlx array has nformaton about all ther lengths) and each array s contguous n memory. Usng functon 7.8 as a reference, t s clear that the dot product s the frst step performed for any par of nput tranng vectors. In the GetKgaussKernelRow, the dot product s performed for four sparse tranng vectors at a tme, each pared wth the man non-sparse vector. For every four sparse vectors (outer loop), the nner loop performng the dot product executes a number of tmes equal to the longest sparse vector. Once the loop exceeds any of the four vector lengths, the dot product at that teraton for that element s calculated but not added to the runnng sum (dot product s a runnng sum of products for each element n the vector). Ths s done by usng the SPE vector-wse compare and select nstructons as demonstrated n secton To better explan the procedure, a toy example n whch four elements are to be calculated follows. Vec_exp_row s the man non-sparse vector aganst whch the four sparse vectors n vec_nput_rows are processed. The lengths of the sparse rows are 2, 1, 3, 1 n that order. Fgure 7-12: Parameters nto GetKgaussKernelRow functon Because there are four elements, there s only one teraton of the outer loop. Durng ths loop, the followng ponters wll be assgned:

125 Fgure 7-13: Ponters assgned at frst teraton of loop Next the nner loop (jj s the loop control varable) starts and terates 4 tmes (MAX(2,1,3,1) = 3 jj = 0,1,2,3). All the vec_add_[n] values are calculated whether the results wll be added to the man dot product vector or not (see next step). Each of the dot product vectors vec_add_[n] are summed across and nserted nto element postons 0, 1, 2, and 3 of a temporary vector. Fgure 7-14: Updatng of the runnng dot product Because t s very lkely that all the sparse vectors are not the same length, at every teraton, the loop counter jj s made nto a vector by copyng the value four tmes. The resultng jj vector s compared to the vlx_max vector. A select mask vector s formed that s used to modfy the contents of the temporary vector created n the prevous step. Those elements for whch vlx s less than jj are set to zero (usng the select masks created and

126 the vector select nstructon). The adjusted temporary vector s added to the current poston n the man dot product array. Fgure 7-15: Addng the result to the runnng dot product array Fnally, once the nner loop s fnshed, the rest of the operatons n functon 7.8 can be performed as smple algned vector operatons on the current locaton of the man dot product array Updatng the Gradent Because exstng GPL code was avalable, t was relatvely easy to optmze one secton of the algorthm whle keepng the rest unchanged. For example, the orgnal gradent updatng code was used throughout the development of revson 1 of the QP solver. A selecton of small tranng problems was used for verfcaton that the obtaned results dd not change from the orgnal, unmodfed code. The gradent updatng porton of the algorthm s most computatonally expensve when the number of total tranng vectors becomes large compared to the number of tranng vectors n a sngle workng set. The task bols down to a large matrx-vector multplcaton. The problem s that most of the contents of ths matrx are not n memory and must be calculated. The cachng technque n the PGPDT mplementaton s very useful here, but only up to some sze of the tranng set. Nevertheless, t was updated to work wth the Cell mplementaton and could be turned on and off as seen ft. As a revew, the formula to calculate the gradent of the dual Lagrange objectve functon s: G G k + 1 k ββ k + 1 k (7.9) F( α ) = F( α ) + ( α α ) δβ Gββ Note that the columns of the matrx are the sparse columns from the global kernel Gδβ matrx of column ndces correspondng to the tranng vector ndces of the current workng set. β β.

k +1 k To reduce the number of calculatons, the PPE obtans the value ( α ) β α β and dscards those ndces for whch the result s less than a small value (DELTAsv).

127 k +1 k To reduce the number of calculatons, the PPE obtans the value ( α ) β α β and dscards those ndces for whch the result s less than a small value (DELTAsv). The remanng set of ndces s evenly dstrbuted for processng among the avalable SPEs. Each SPE Gββ k+1 k produces a partal sum of the ( α β α β ) G value that s added to the global sum by δβ the PPE. Fgure 7-16: Kernel Matrx generaton job dstrbuton among SPEs In the frst revson, the gradent updatng code was placed nto a separate SPE software module and ran on dfferent SPEs than the QP solver (fve gradent updatng SPEs and one QP solver SPE). Once revson 2 of the QP solver module was fnshed, however, the gradent updatng module was merged wth t to produce one SPE module capable of performng both procedures. Whle ths made the code sze slghtly larger, there was some code whch was shared ncludng the DMA handlng and kernel element generaton. Mergng the two modules also made t needless to perform expensve context swtchng and allowed for the maxmzng the utlzaton of all sx avalable SPEs. The ncreased complexty came from the management of memory (ponter handlng) snce the usage of the LS space between the two was ntentonally overlapped for hgher effcency. The frst step that needs to occur after uploadng the SPE module to the SPEs s the sendng of ntalzaton nformaton. Ths nformaton ncludes any kernel parameters, the number of total tranng vectors, dmenson of the tranng vectors, and effectve addresses of varous tranng vector data (sparse lengths, sparse ndces, sparse values, norm values). Once the nformaton s receved, the LS memory s organzed by settng

128 varous ponters for both the gradent updatng step and QP solver step. For both cases, any leftover memory s maxmzed for crtcal buffer usage such as double bufferng. Ths way, the LS space s optmzed. Performng gradent updatng follows the same PPE-SPE programmng model as the QP solver. However, extra nformaton needs to be sent to the SPEs at each step. Three malbox messages are sent to all the SPEs as shown n Fg Update Gradent Command ID (8 b) NumRows (24 b) Effectve address of nput vector data array (hgh 32 bts) Effectve address of nput vector data array (low 32 bts) Fgure 7-17: Sequence of malbox messages sent to SPEs at gradent updatng step Note that the NumRows and effectve address s dfferent for all SPEs causng each SPE to DMA dfferent nformaton (data parallelsm). The SPE uses the receved effectve address to download the array of NumRows nput vector nformaton from man memory that t s responsble for. Table 5 shows the contents of a sngle nstance of the vector nformaton struct. Name Type Descrpton vec_d unt32_t The global ndex of the tranng vector sparse_vlen unt32_t the sparse length of the vector norm float the vector norm grad float k +1 k the value ( α ) β α β EA_row_start addr64 the memory locaton of the sparse vector EA_row_dxs add64 the memory locaton of the sparse vector ndces KernelRowInCache char s the kernel row n the cache CacheOn char should the cache be used/updated EA_kernel_row addr64 f the cache s on, the memory locaton of the kernel row Table 5: Vector nformaton struct The pseudo-code n Lstng 28 shows a very smplfed and non-optmzed verson of the procedure. The text followng gves an overvew and comments on each step.

129 tranng_vec_remanng = total_tranng_vectors (matrx rows) Whle tranng_vec_remanng > 0 (Loop A) ws_sze = MIN(MAX_WORKING_SET, tranng_vec_remanng) DMA n the next ws_sze norm values (norm) and sparse vector lengths (vlx) for ths set of tranng vectors ws_sze_remanng = ws_sze Whle ws_sze_remanng > 0 (Loop B) nner_ws_sze = maxmum sparse vectors that can ft nto the buffer on the LS DMA n the next nner_ws_sze nput vectors (sparse values and ndces) from the current workng set (Call ths set r_ws) For all rows (r) assgned to ths SPE (Loop C) If ths kernel row s n the cache n man memory DMA n the current porton of the kernel row (sze=nner_ws_sze) Else, need to generate t: DMA n the sparse nput vector values and sparse ndces for ths row Expand the sparse row nto non-sparse format Calculate the kernel values that can be obtaned from the current row r an all sparse rows n the nner workng set (r_ws) If cache s enabled DMA the porton of the kernel row back nto PPE cache Update the delta gradent (st_out) for ths secton of the current workng set Loop (C) ws_sze_remanng = ws_sze_remanng nner_ws_sze Loop (B) Send the fnal partal gradent to the PPU tranng_vec_remanng = tranng_vec_remanng ws_sze Loop (A) Lstng 28: Pseudo-code of gradent updatng on the SPEs The total tranng vectors, as the name mples, s the total number of tranng vectors n the full set. Because ths number can become very large, only a porton of the total tranng vectors are worked wth at any tme. The outer most loop (Loop A) s responsble for sequencng over the entre set n chunks that won t overwhelm the LS. The maxmum workng set sze for each teraton s MAX_WORKING_SET. Two partal arrays of data are downloaded from man memory: vlx the sparse lengths of each tranng vector, and norm the norm of each of the tranng vectors n the current porton (or workng set). Fgure 7-18: The contents of the vlx and nor vectors receved by the SPE n Loop A The next nner loop (Loop B) s responsble for downloadng as many sparse vectors from the workng set as possble nto the avalable buffer space. The vlx array, whch holds the

130 sparse length of the vectors n the current workng set s used to fgure out ths number, knowng that each sparse entry n the sparse vectors requres twenty bytes (16 byte 4- element vector and 4 byte sparse vector ndex). Fgure 7-19: The actual sparse vector contents receved wthn Loop B The two loops so far execute dentcally on all SPEs snce the set of all global tranng vectors s common between them. The next nner loop (Loop C), however, terates over all the tranng vectors ndces that were assgned for that SPE. When generatng the row nformaton for the SPEs, the PPE sets a flag n the nformaton struct whether that row s currently n the cache. The SPE checks ths flag here and f t s, downloads the current porton of the row (the porton s determned by Loop B s current tranng vector ndex and nner_ws_sze). If the flag s not set, the contents of the current tranng vector (sparse values and ndces) are downloaded and expanded nto non-sparse form. The expanded tranng vector array s used as the second parameter of the GetKgaussKernelRow functon. The remanng parameters nclude the current set of sparse tranng vectors downloaded by the frst nner loop, the correspondng nor and vlx values downloaded n the outer loop, and the number of elements to generate s equal to nner_ws_sze. Once ths porton of the kernel row s generated, column based matrx vector multplcaton s used to obtan the partal change of the gradent due to ths row. In other k +1 k α words, all elements n the resultng kernel row porton are multpled by the ( ) β α β value assocated wth the current row of the nner-most loop. Ths s easly done by copyng ths value nto a four-element vector and performng a set of SIMD operatons. The partal change of the gradent s added to the proper locaton of st_out, whch s the fnal partal change of the gradent for ths SPE of sze equal to ws_sze (determned n Loop A). Fg shows a summary of the order n whch the loops generate the matrx and produce the change n the gradent.

131 Fgure 7-20: Updatng a porton of the product vector by a sngle SPE In the frst revson of the gradent updatng code, double bufferng was performed n Loop B over the contents of the sparse vector values and ndces as shown n Fg and smplfed pseudo-code n Lstng 29. Fgure 7-21: Double bufferng n the frst revson of the gradent updatng SPE module

132 tranng_vec_remanng = total_tranng_vectors Whle tranng_vec_remanng > 0 (Loop A) ws_sze = MIN(MAX_WORKING_SET, tranng_vec_remanng) DMA n the next ws_sze norm values (norm) and sparse vector lengths (vlx) for ths set of tranng vectors ws_sze_remanng = ws_sze nner_ws_sze_0 = max that can ft nto buffer 0 nner_ws_sze_1 = max that can ft nto buffer 1 Start DMA of sparse vector data nto Buffer 0 (nner_ws_sze_0) and Buffer 1 (nner_ws_sze_1) Whle ws_sze_remanng > 0 (Loop B) wat for Buffer 0 DMA to complete Loop C for Buffer 0 Start DMA of next nner_ws_set nto Buffer 0 wat for Buffer 1 DMA to complete Loop C for Buffer 1 Start DMA of next nner_ws_set nto Buffer 1 Loop (B) Send the fnal partal gradent to the PPU tranng_vec_remanng = tranng_vec_remanng ws_sze Loop (A) Lstng 29: Double bufferng n the frst revson of the gradent updatng SPE module It was soon dscovered that double bufferng at ths level provded very lttle advantage as most of the DMA stalls occurred n Loop C when downloadng the sparse row contents (or partal kernel row) for each row ndex assgned for the SPE. The second revson of the gradent updatng code, therefore, removed double bufferng n Loop B and mplemented mult bufferng n Loop C. Fg graphcally shows a toy example of a double-bufferng setup n Loop C. Fgure 7-22: Double bufferng n second revson of gradent updatng SPE module

133 7.3 Cascade SVM As ntroduced n Chapter 3, the cascade SVM acheves ts level of parallelsm by generatng multple asynchronous and ndependent QP problems. Each problem s data s generated from a certan subset of the full set of tranng vectors. By analyzng the α values obtaned by solvng each QP problem, the support vectors for that QP tranng subset can be deduced and combned wth support vectors from other solutons to form a new tranng set for the next problem n the tree. The output set of support vectors s a subset of the nput tranng vectors orgnally chosen for a partcular problem, and therefore the output s always smaller than or equal to the nput. There s no guarantee, however, that combnng the outputs from two or more solvers wll produce an nput small enough to ft on the avalable LS space for a sngle SPE solver. The problem, therefore, s to guarantee that the number of tranng vectors for any new problem does not exceed a fxed maxmum amount. In ths mplementaton, the Cascade SVM method was mplemented to functon as a flter nto the GPDT solver. In other words, the Cascade SVM was not capable of solvng the problem all by tself Dependency Tree and Job Queue In the mplementaton, the PPE keeps track of a dynamc dependency tree for all the currently actve QP solvers. Actve solvers may or may not be runnng due to dependences. The dependency tree s formed from ndvdual smaller trees called Cascade421 s. Each Cascade421 has three layers wth a total of seven elements as shown n Fg

134 Fgure 7-23: Dependency tree of Cascade421 objects Before proceedng, the reader s remnded that each tranng vector s ndexed by a unque ndex value. A subset of tranng vectors s, therefore, represented as a subset of ndces. In the fgure, sold arrows show the dependences between the elements. Each element holds a set of tranng ndces. If the set of ndces s too large for a sngle solver, the element spawns (farms out) a new Cascade421 element whch automatcally dvdes the set nto four subsets for ts frst layer. Ths process effectvely creates new dependences that must be resolved before that element can proceed. Any of the seven elements n a Cascade421 object can spawn a chld Cascade421 object, hence creatng a tree of trees structure. If the number of ndces s below the maxmum, a TranngIndces object s created holdng the ndces of that element. The TranngIndces object s queued n an ndependent job queue for processng (also shown n the Fg. 7-23) and the element s marked as ready. Whether a node farms out ts tranng ndces or not s transparent to the parent Cascade421 object as t smply wats untl that node s marked done. Each tme an element s marked ready for processng (that s, the number of tranng ndces was less than the maxmum allowed) and the gven tranng set s placed on the ready queue, the element holds a ponter to that set, as shown n the fgure usng dotted arrows. The PPE effectvely alternates between processng the dependency tree and processng the ready queue. Followng the ready queue processng, at the next

135 dependency processng, t wll check f the problem has been solved and change ts status accordngly. Elements n a Cascade421 are numbered 0-6 startng at the left layer, movng from top to bottom. In Fg. 7-23, elements 2 and 4 are farmed out from the topmost Cascade421 object. The status of the tree represents the tme after the tree has been processed for dependency resolutons but before processng the job queue. Element 6 s ready, whch mples that ts ndces are n the job queue n the form of a TranngIndces object and are ready to be processed. The reason for choosng the Cascade421 representaton s that t could easly generalze to the mproved cascade SVM or M 3 -SVM, as descrbed n Chapter 3. As mentoned above, the ready queue holds nstances of TranngIndces objects, and not actual problems. A sngle teraton of the dependency tree could create thousands of new ready elements. If the problem was generated before beng queued, all the memory taken up would defeat the purpose of ths dvde and conquer technque. Instead, the problems are generated as the TranngIndces objects are dequeued durng the job queue processng stage as wll be descrbed later Desred Flter Rato Parameter Wth the addton of the Cascade SVM mplementaton came a new parameter the desred flter rato (-R) whch had the effect of controllng the amount of flterng performed by ths step. The Cascade421 element keeps track of the number of support vectors comng nto and leavng each layer. Fg shows the locatons at whch these numbers are calculated. When all layer 1 solvers are fnshed, the rato b/a s calculated. When all of layer 2 s fnshed, the rato c/b s calculated. If ether one of these ratos s greater than the value passed n usng the -R parameter, the Cascade421 element wll fnsh early, and notfy ts parent Cascade421 object. A hgh desred flter rato wll force more Cascade421 elements to run to completon. A lower value wll cause the Cascade421 elements to qut early due to loss of flterng effcency (large rato). Fgure 7-24: Ponts n the Cascade421 element at whch the number of support vectors are checked

136 7.3.3 Improved QP Problem Solver Although the frst revson of the QP solver was completely replaced by the second revson n the Cell GPDT mplementaton, t was an excellent startng pont for desgnng the problem solver for the Cascade SVM mplementaton. Here are the problems that were nherent to the frst revson of the problem solver n the context of the GPDT mplementaton: o Overhead due to copyng of data nto DMA able (algned, padded) arrays o A maxmal subproblem sze of 200 o No potental for parallelzaton o No subproblem data generaton The overhead for the frst problem s now a necessary step rather than overhead. In other words, the problem does not exst n memory beforehand, and s generated at the tme when t s dequeued. Ths mples the generaton of all DMA able arrays at that tme. There s no converson beng done as was the case of the GPDT mplementaton. Furthermore, the generaton of the DMA able arrays s bult nto the new PPE sde of the solver and s executed n a ppelned manner. Bascally, the PPE reades arrays for the next problem n the queue whle exstng problems are beng solved on the SPEs. So, not only s ths not consdered overhead, but t can be hdden from the total executon tme. The problem sze lmtaton, unfortunately stll exsts n the desgn of the cascade SVM solver. In the GPDT mplementaton, there was no potental of parallelzaton due to the lnear and sequental problems. Here, the queue often holds hundreds of TranngIndces whch can be transformed nto ndependent problems. After all, the man dea behnd the Cascade SVM s the generaton of multple parallelzable problems. The last problem n the lst was solved by devsng a clever method for generatng the subproblem data on the SPE. Granted, ths was not really a problem, just a lack of a feature that turned out to be feasble. The problem solver (named SPQProblem), just as most of the code, was wrtten usng C and C++ and currently conssts of two man components. The PPE component functons as an nterface for assgnng and processng problems and controls the executon of the SPE component (an SPE module). An earler verson of the solver, before the new problem generaton technque was devsed, ncluded a second SPE module, for a total of three components. The task of the second SPE module (generator module) was to generate the data for the problem and send t drectly to the solver SPE module. The generator SPE and the sngle solver SPE were ted usng qute a complex communcaton scheme and worked n a ppelned fashon. The number of SPE generator modules and the selecton and amount of work done on each was adjustable. Two archtectures were tested: three solvers havng one generator each (Fg. 7-25a), and two solvers havng two generators each (Fg. 7-25b).

Fgure 7-25: Ppelned QP solver SPE modules Eventually, ths concept turned out to be qute unnecessary and had an unresolved bug due to a race condton between the SPEs.

137 Fgure 7-25: Ppelned QP solver SPE modules Eventually, ths concept turned out to be qute unnecessary and had an unresolved bug due to a race condton between the SPEs. Results were therefore not collected wth ths method. The realzaton of a new technque that performed generaton of the problem data wthn the solver module lead to the abandonng of the overly complex ppelned verson. The rest of ths secton refers to the new concept, although the nterested reader s nvted to examne the source code, whch s stll present as the SPQProblem class was wrtten to be versatle enough to support both methods. The SPQProblem class takes care of reservng and ntalzng an SPE before use. Beng that there s one SPE module per solver, the number of SPQProblem objects created s equal the number of avalable SPEs for maxmum parallelzaton. Intalzng the SPE modules follows the same pattern as elsewhere n ths work. Namely, an ntalzaton command s sent, followed by the hgh and low 32 b portons of the effectve address of the ntalzaton structure located n man memory. The SPE downloads and stores the structure whch contans nformaton about the kernel parameters and dmenson of the tranng vectors. The PPE module ntalzaton conssts of havng varous ponters assgned to the global problem data so that sub problems can be generated gven a TranngIndces object as a parameter. The data ncludes the global Lagrange multplers α s, tranng vector output values y, and the skernel object whch holds the actual tranng vector data. The SPQProblem class was desgned to be controlled asynchronously and contans a twoelement ppelne one for allocatng memory and gatherng the requred nformaton from the global data, and one for runnng the actual solver. Ths model makes t easer to nterface wth the SPEs n the job queue handlng code (Lstng 30).

Do Loop over all actve SPQProblem solvers Process SPQProblem solver If SPQProblem solver s ready for new nput dequeue the next problem from the job queue ntalze the dequeued problem and use

138 Do Loop over all actve SPQProblem solvers Process SPQProblem solver If SPQProblem solver s ready for new nput dequeue the next problem from the job queue ntalze the dequeued problem and use AssgnProblemData to assgn the problem to the solver Process SPQProblem solver Whle there are more problems n the job queue Lstng 30: Job queue handlng Two nterfaces nto the SPQProblem are AssgnProblem and Process. Assgnng a problem nvolves creatng an nternal context representng the problem and placng t nto the nternal ppelne. The SPE command data structure s generated as well. It s sent to the SPE n the next stage of the ppelne. AssgnProblem takes a TranngIndces nstance as a parameter and uses t to selectvely copy values from the global problem arrays. Fg shows the dea. Fgure 7-26: PPE AssgnProblem random ndexng of global problem arrays The core nterface functon s Process, whch manages the nternal problem contexts by pushng them along the ppelne. Each tme the Process functon was called, the PPE updated the status of the currently actve problem contexts n ts ppelne. The problem contexts could be n any one of the three slots nt, gen, or solver at a tme. When a problem context enters the solver slot, three commands are sent to the SPE, followng the usual pattern: New Problem Command PPE Rank (=0) Effectve address of problem command data (hgh 32 bts) Effectve address of problem command data (low 32 bts) Fgure 7-27: Commands send to SPE when startng QP solver The parameter of the frst command s the source rank. It s used so that the recevng SPE knows what processng element the command came from. Each SPE has a table of nformaton about the other processng elements that s ndexed by ther unque rank number.

139 The pseudo-code n Lstng 31 gves an overvew of the operatons that occur on the SPE n the most recent verson. Loop Indefntely Wat for command Case: Receved new problem command DMA n the problem nformaton structure DMA n the y output array for ths problem Generate the kernel DMA n the α array for ths problem If usng the standard lnear term (vector of -1 s) Generate the standard lnear term Else f generatng the lnear term locally Generate the lnear term Run man solver loop Place updated <alphas> back nto man memory Sgnal PPE Repeat Lstng 31: SPE solver pseudo-code The problem nformaton structure s downloaded from man memory. It contans effectve addresses of varous data that wll be needed and where the results are to be placed n man memory, the problem d, the sze of the tranng set, a number of solver optons, the quadratc optmzaton problem parameters, etc. The generaton of the kernel matrx s one of the new tems added for the cascade SVM mplementaton. The procedure s explaned n the followng secton. Once the matrx s generated, the rest of the data that s requred for solvng the problem s DMA ed n. In the case of the cascade SVM, the lnear term s smply a vector of -1 s. When done, the resultng Lagrange Multplers are uploaded back nto man memory and the PPE s notfed. On the next call to the Process functon, the problem s flushed out of the ppelne and the next problem s sent to the SPE.

7.3.3.1 Kernel Generaton Only the mnmal amount of data s transferred onto the LS for the generaton of the kernel matrx.

140 Kernel Generaton Only the mnmal amount of data s transferred onto the LS for the generaton of the kernel matrx. Snce the matrx wll be postve defnte, only the upper trangle of the matrx s generated. However, t s not a perfect trangular matrx. To best utlze the SIMD unts on the SPE t s requred to work wth four values at a tme. In effect, 4x4 blocks are generated at a tme. The data s placed nto memory n a compressed form so as to maxmze memory effcency for double bufferng (Fg. 7-28a). The dashed lnes n the fgure represent avalable space for double bufferng. The 4x4 matrx blocks are placed nto memory as shown. Fgure 7-28: Kernel Generaton on the SPEs The algorthm for generatng the compressed kernel s shown n Lstng 32. Gven problem sze, calculate sze of compressed kernel compressed_kernel nonexpanded_rows expanded_rows DMA_lst_buffer norms vlxs outputs sparse_vec_buff LS_Rsv LS_Rsv(4 * dmenson) LS_Rsv(4 * dmenson) LS_Rsv(problem sze) LS_Rsv(problem sze) LS_Rsv(problem sze) LS_Rsv(problem sze) LS_Rsv( remanng space on LS) Download DMA lsts nto DMA_lst_buffer Download norm and vlx data nto norms and vlxs Iterate over rows of blocks 4 at a tme DMA.L n the next 4 sparse vectors nto nonexpanded_rows Expand the sparse vectors nto non-sparse form nto expanded_rows Generate 4x4 block (don t need any more tranng vectors; ths s a block on the dagonal) Iterate over the remanng sparse tranng vectors for ths row (loop untl all processed) DMA.L n as much sparse vectors as can ft nto sparse_vec_buff Generate Kernel matrx elements usng expanded_rows, sparse_vec_buff, and outputs Repeat Repeat Lstng 32: Kernel Generaton on the SPEs Essentally, the outer loop moves the generaton sequence down the matrx and the nner loop moves from left to rght. The outer loop grabs four tranng vectors at a tme and

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.