Matrix-Matrix Multiplication Using Systolic Array Architecture in Bluespec

Matrx-Matrx Multplaton Usng Systol Array Arhteture n Bluespe Team SegFault Chatanya Peddawad (EEB096), Aman Goel (EEB087), heera B (EEB090) Ot. 25, 205 Theoretal Bakground. Matrx-Matrx Multplaton on Hardware Computng matrx produts s oth a entral operaton n many numeral algorthms and potentally tme onsumng, makng t one of the most well-studed prolems n numeral omputng. Varous algorthms have een devsed for omputng C = AB, espeally for large matres. Mappng suh algorthms to ustom or general purpose hardware arhteture s always a hallengng task. By havng a ustom or ASIC hardware, the matrx-matrx multplaton an e mplemented usng least resoures and an e aelerated to a large extent. Mappng the same algorthms on general purpose hardware, for example, mplementng on general purpose Xlnx FPGA oard always has nherent trade-offs suh as area (power), tme (maxmum operatng lok frequeny), lateny, hardware utlzaton effeny and so on. The realst way to ompare two solutons would e to assgn weghts to eah of these fators and hoose a soluton among multple possle pareto optmal solutons..2 Systol Array Arhteture Systol arhtetures (also referred to as systol arrays) represent a network of proessng elements (s) that rhythmally ompute and pass data through the system. These s regularly pump data n and out suh that a regular flow of data s mantaned [], [2]. As a result, systol systems feature two mportant propertes for VLSI desgn: Modularty: Varous funtonal loks whh make up the larger system have well-defned funtons and nterfaes. Hene, the onept of modularty enales the parallelsaton of the desgn proess. Regularty: Herarhal deomposton of a large system results n not only smple, ut also smlar loks, as muh as possle. The systol array may e used as a oproessor n omnaton wth a host omputer where the data samples reeved from the host omputer pass through the s and the fnal result s returned to the host omputer (see Fg. ). Ths operaton s analogous to the flow of lood through the heart, thus the name systol. Typally, all the s n a systol array are unform and fully ppelned,.e., all ommunatng edges among the s ontan delay elements, and the whole system usually ontans only loal nteronnetons [3]. However, some relaxatons have een ntrodued to nrease the utlty of systol arrays. These relaxatons nlude use of not only loal ut also neghor (near, ut not nearest) nteronnetons, use of data roadast operatons, and use of dfferent s n the system, espeally at the oundares. Wth these relaxatons, a famly of modular, regular, and effent data-drven array arhtetures an e desgned for SP applatons, one of whh s matrx-matrx multplaton.

CS6230: CA for VLSI Systol Array & Bluespe Host Proessor The Systol Array Fgure : Bas prnple of a systol system.3 Systol Array esgn Methodology We use the systol arhteture desgn methodology where many systol arhtetures an e desgned for any gven regular teratve algorthm usng lnear mappng or proeton tehnques. The dependeny graph (G) orresponds to a spae representaton where no tme nstane s assgned to any omputaton. Typally ths orresponds to a t = 0 plane. The mappng tehnque transforms a spae representaton to a spae-tme representaton where eah node s mapped to a ertan proessng element and s sheduled to a ertan tme nstane. The systol desgn methodology that we are adoptng here maps a 3-dmensonal G to a or 2 systol arhteture. Now we defne the as vetors nvolved n the systol array desgn: [ ] d Proeton vetor (also alled teraton vetor), d = : Two nodes that are dsplaed y d or d 2 multples of d are exeuted y the same proessor ] Proessor spae vetor, p T = [p p 2 : Any node wth the ndex I T = ] [ y proessor p T I = p Shedulng vetor, s T = p 2 ] [ [ ] would e exeuted [s s 2 ] : Any node wth ndex I would e exeuted at tme s T I. Hardware Utlzaton Effeny, HU E = / s T d : Ths s eause two tasks exeuted y the same proessor are spaed s T d tme unts apart. These aforementoned vetors must satsfy the feaslty onstrants stated elow: Proessor spae vetor and the proeton vetor must e orthogonal to eah other. If ponts A and B dffer y the proeton vetor,.e., I A I B s same as d, then they must e exeuted y the same proessor. In other words, p T I A = p T I B. Ths leads to p T (I A I B ) = 0 = p T d = 0. If A and B are mapped to the same proessor, then they annot e exeuted at the same tme,.e., s T I A s T I B,.e., s T d 0. Edge mappng: If an edge e exsts n the spae representaton or G, then an edge p T e s ntrodued n the systol array wth s T e delays. Gven 2 matres A and B, we an denote ther produt as C = AB, where A, B and C are n n matres. For n = 2, we have [ ] [ ] [ ] 2 a = a 2 2 2 22 a 2 a 22 2 22 2

CS6230: CA for VLSI Systol Array & Bluespe = a + a 2 2 2 = a 2 + a 2 22 2 = a 2 + a 22 2 22 = a 2 2 + a 22 22 These equatons an e represented n a spae representaton as shown n Fg. 2. 2 22 22 2 2 a 2 a 22 2 k 0 0 0 a 0 a 2 Fgure 2: Systol array arhteture of the matrx produt omputaton From the spae dagram, we an wrte the teraton n standard output regular teratve algorthm (RIA) form as follows: a(,, k) = a(,, k) (,, k) = (,, k) (,, k) = (,, k ) + a(,, k)(,, k) Wth lnear mappng, ths 3 spae representaton s mapped onto 2 spae to desgn 2 systol arrays for matrx-matrx multplaton. Wth dfferent hoe of proessor vetor (d), proeton vetor (p T ) and shedulng vetor (s T ) that satsfy the shedulng onstrants, we get dfferent edge mappng hene dfferent systol array arhteture. Some of dfferent possle solutons are derved n Ta.. Systol array arhteture, Arh and Arh 2 usng general proessor elements are drawn n Fg. 3. Tale : fferent solutons to systol array arhteture Vetor Arh Arh 2 Arh 3 Arh 4 ] [ ] [ ] [ ] s [ T [ ] [ ] [ ] [ ] p T 0 0 0 0 0 0 0 0 0 0 [ ] T [ ] T [ ] T [ ] T d 0 0 0 e p T e s T e p T e s T e p T e s T e p T e s T e a(0,, 0) (0 ) (0 ) (0 ) (0 ) (, 0, 0) ( 0) ( 0) ( 0) ( 0) (0, 0, ) (0 0) ( ) ( 0) ( ) 3

CS6230: CA for VLSI Systol Array & Bluespe a a a a a a (a) Arh () Arh 2 Fgure 3: Two-dmensonal systol array for matrx-matrx multplaton 2 Our Implementaton: esgn & Evaluaton of fferent Arhtetures We mplemented 4 dfferent solutons for matrx-matrx multplaton, rght from mplementng on one proessor element to mplementng on 2 array of proessor elements. The desgn supports mulple matrx-matrx multplatons n a sngle streth n ontnuous fashon. We oded and smulated all four solutons n Bluespe to verfy the auray of eah of the solutons. We further syntheszed the solutons to ompare dfferent trade-offs on hardware mplementaton eah of them faed n terms of maxmum operatng lok frequeny (hange of rtal path), area (hardware utlzaton) and theoretal throughput. The solutons are stated elow: Soluton : Perform matrx-matrx multplaton usng 2 systol array of proessor elements. Soluton 2: Perform matrx-matrx multplaton usng lnear array of proessor elements. Soluton 3: Perform matrx-matrx multplaton usng sngle proessor element. Soluton 4: Perform matrx-matrx multplaton usng lnear dretonal 2 systol array of proessor elements. Note that the matrx-matrx multplaton synthess results were otaned for the 5 5 matres wth performng 4 suh multplatons one after other n a sngle run. 2. Soluton Ths soluton s mplemented usng 2 systol array of proessor elements, whh s nothng ut systol array of sequental multply and aumulate (MAC) unts. In one step, K 2 MAC unts performs K 2 multplatons of two numers a k and k and aumulaton f applale. But, sne the outputs propagate form one to another, for multplaton of two K K matres to omplete, the numer of steps requred s 3K 2. The arhteture s shown n Fg. 4. Synthess Results Sle log utlzaton: Numer of sle regsters: 508/26800 = 4% Numer of sle LUTs: 5632/63400 = 8% Numer used as log: 5632/26800 = 8% Mnmum perod : 3.008 ns ( Maxmum frequeny: 76.876 MHz). 4

CS6230: CA for VLSI Systol Array & Bluespe Crtal path: From matnum 6 (FF) to out 66 (FF) elay = 3.008 ns Levels of log = 23 a a a MAC unt a n- n MAC unt: n = n- + a () Operaton of MAC unt (a) Soluton Fgure 4: Arhteture for soluton : K = 3 2.2 Soluton 2 Ths soluton s mplemented usng lnear array of K proessor elements, whh s nothng ut array of sequental multply and aumulate (MAC) unts. In one step, K MAC unts performs K multplatons of two numers a k and k and aumulaton f applale. Hene, for multplaton of two K K matres, the numer of steps requred s K 2. The arhteture s shown n Fg. 5. MAC unt a Fgure 5: Arhteture for soluton 2: K = 3 Synthess Results Sle log utlzaton: Numer of sle regsters: 3689/26800 = 2% Numer of sle LUTs: 6986/63400 = 26% Numer used as log: 6986/26800 = 26% Mnmum perod : 2.008 ns ( Maxmum frequeny: 83.278 MHz). Crtal path: From matnum 6 (FF) to out 0 (FF) elay = 2.008 ns Levels of log = 5

CS6230: CA for VLSI Systol Array & Bluespe 2.3 Soluton 3 Ths soluton s mplemented usng only one proessor element, whh s nothng ut sequental multply and aumulate (MAC) unt. In one step, the MAC unt performs sngle multplaton of two numers a k and k and aumulaton f applale. Hene, for multplaton of two K K matres, the numer of steps requred s K 3. The arhteture s shown n Fg. 6. a MAC Fgure 6: Arhteture for soluton 3 Synthess Results Sle log utlzaton: Numer of sle regsters: 3333/26800 = 2% Numer of sle LUTs: 4798/63400 = 7% Numer used as log: 4798/26800 = 7% Mnmum perod : 0.938 ns ( Maxmum frequeny: 9.424 MHz). Crtal path: From matnum 6 (FF) to out 0 (FF) elay = 0.938 ns Levels of log = 0 2.4 Soluton 4 Ths soluton s mplemented usng K Bdretonal Lnear Systol Arrays (BLSA) of omnatonal s that takes 3 nput a, and and produes output + a, desred n [4]. Note that omnatonal s are used for the mplementaton and performane omparson aganst the sequental ounterpart that has een mplemented and used n frst 3 solutons. It an e shown that for multplaton of two K K matres to omplete, the numer of steps requred s 3K 2. Note that K olumns of the output matrx are omputed smultaneously, where one lnear array omputes one olumn eah. Hene K BLSA strutures are repeated regularly and work ndependent of eah other. The arhteture s shown n Fg. 7. a 23 a 22 a 3 a 33 a 2 a 0 0 a 2 a 32 a3 0 2 3 2 3 2 3 n Fgure 7: Arhteture for soluton 4: K = 3 a out = n + a out 6

CS6230: CA for VLSI Systol Array & Bluespe Synthess Results Sle log utlzaton: Numer of sle regsters: 207/26800 = 0.95% Numer of sle LUTs: 647/63400 = % Numer used as log: 647/26800 = % Mnmum perod : 9.374 ns ( Maxmum frequeny: 06.657 MHz). Crtal path: From mas 0 /Mmult ans a MUL ans elay = 9.374 ns Levels of log = 8 d (SP) to reg 0 3 (FF) 2.5 Pareto Curve: Fae-off Between Solutons Arhteture Sle Reg. Utlzaton Log LUTs Utlzaton Clok Frequeny Throughput Soluton 4% 8% 76.87 MHz 3K 2 Soluton 2 2% 26% 83.27 MHz K 2 Soluton 3 2% 7% 9.42 MHz K 3 Soluton 4 0.95% % 06.67 MHz 3K 2 Tale 2: Trade-offs n dfferent solutons (K = 5) The trade-offs are lear from the Ta. 2. Note that soluton uses ustom sequental MAC unts whle soluton 4 uses default omnatonal proessor elements, hene sgnfant dfferene n hardware utlzaton and the lok perod. Solutons -3 use sequental s hene are nluded n pareto urve where we evaluate the solutons ased on trade-off etween hardware utlzaton and throughput, or etween lok freqeuny and throughput. Ether way, all three are pareto optmal. The pareto urve s drawn n Fg. 8. Clok Frequeny (MHz) 9.4 83.2 76.8 sol n :3 sol n :2 sol n : 0.008 0.04 0.077 Throughput ( no. of steps ) for K = 5 Fgure 8: Pareto Curve (not to sale) 3 Conluson We have explored four dfferent solutons to mplement matrx-matrx multplaton. Seleton of a soluton to one partular applaton depends on several fators suh as dmensons of matrx, onstrants on throughput, resoure utlzaton and makespan as demanded y the applaton. On a general perepton we an say that for smaller matres wth strngent hardware utlzaton onstrant, soluton 3 performs 7

CS6230: CA for VLSI Systol Array & Bluespe good. On the other hand to get etter throughput at the ost of extra hardware, soluton wll e more optmal ompared to soluton 3. Soluton 2 assumes a mddle ground etween soluton and soluton 3. However we an see that the est results are otaned when the proessng element s mplemented usng omnatonal log as n soluton 4. Smlar results an also e otaned from soluton f the as s omnatonal. Whle soluton 4 demands more omplated shedulng of the nputs, soluton requres a smple sequenng of nputs and an thus lead to nterestng results f the as of soluton s mplemented usng omnatonal log. Referenes [] H. T. Kung and C. E. Leserson, Systol arrays (for VLSI), Sparse Matrx Symposum, SIAM, pp. 256 282, 978. [2] H. T. Kung, Why systol arhtetures? IEEE Computers Magazne, vol. 5, pp. 37 45, Jan. 982. [3] S. Y. Kung, VLSI Array Proessors, Prente Hall, 988. [4] E. I. Mlovanov et. al, Matrx Multplaton on Lnear Bdretonal Systol Arrays, Ser. A: Appl. Math. Inform. and Meh, vol. 2, no., pp. 20, 200. 8