High performance CUDA based CNN image processor

High pefomance UDA based NN image pocesso GEORGE VALENTIN STOIA, RADU DOGARU, ELENA RISTINA STOIA Depatment of Applied Electonics and Infomation Engineeing Univesity Politehnica of Buchaest -3, Iuliu Maniu Blvd., Secto 6, Buchaest ROMANIA vstoica@yahoo.com, adu_d@ieee.og, ce_stoica@yahoo.com Abstact: - ellula neual netwoks (NNs) have been adopted as solution in vaious fields due to thei poweful yet simple achitectue. Pactical implementations using VLSI o FPGA ae vey efficient but difficult to use in development o simulation stages, when wide spead cost effective, easy to lean, high pefomance solutions ae equied. GPU and moe specific UDA based simulatos can povide the computing powe equied fo developing, simulating and unning NNs. This pape investigates solutions to optimize the utilization of nvidia s Keple achitectue to achieve pefomance up to 9. Million /s. Key-Wods: - UDA enabled GPU, high pefomance NN simulato, image pocessing Intoduction Developing and simulating cellula neual netwoks help finding the ight genes fo specific poblems o discoveing potential new applications. Speeding up the simulation is desiable but this should come with minimal development and implementation costs. UDA enabled GPU seems the pefect solution: its massive paallel achitectue matches the NN achitectue, its thoughput-oiented design can suffice the equied computing powe fo unning NN, its compatibility with cuent pogamming languages (e.g.,, Python, Fotan) and libaies o middlewaes (e.g. OpenA, Matlab, OpenL) ease the migation of the applications fom PU to GPU platfoms. Also the availability and cost of UDA enabled GPUs helps choosing this oute fo implementing high pefomance, high poductivity solutions [], [2]. Thee is an inceased inteest in adopting UDA as high pefomance, high poductivity platfom and combining this with the continuous development of the UDA enabled GPUs equies a continuous eseach and investigation in finding efficient implementations fo specific poblems [3]. Pevious wok elated to NN implementations on GPUs uses pevious UDA achitectues (e.g. Tesla o Femi achitectues) with notable esults ove PU o dedicated image pocessing libaies (e.g OpenV) when typical acceleation of 7-2 wee obtained [4], [5], [9], []. Although the NN specific data-paallel computation model fits GPU achitectue, high pefomance implementations must conside the GPU esouces and thei specific limitations. As visible in Fig., thee ae some notable diffeences when compaing GPU with PU achitectues: smalle cache memoy and simple contol units which leads to highe global memoy access latency, customizable memoy types (shaed memoy, constant memoy, tetue memoy, egistes). This pape deals with such aspects and poposes a new implementation model fo the NN discete time image pocesso on the UDA platfom using a moe ecent nvidia s Keple achitectue. Fig. : PU vs. GPU achitectual diffeences [6] This pape analyses the implementation of NN image pocesso on nvidia s Keple achitectue. The discete time NN model as descibed in [7], [8] is pesented. Memoy types (e.g. global memoy, shaed memoy, tetue memoy) and access pattens ae analyzed to find the optimal configuation fo the implementation of the discete NN model. The memoy access patten of the NN simulation makes this poblem a memoybandwidth bounded poblem [5]. Specific techniques can be applied to impove the pefomance (e.g. the use of shaed memoy, ISBN: 978--684-329- 58

memoy cache, coalesced memoy eads as pesented in [3]) but we can t go beyond some limiting factos: is desiable but we can t fit the image into the fast, low latency, on-chip shaed memoy, is desiable but we can t fit all the eads fom within a block of theads into a single coalesced memoy ead because thee is a limited 28 bytes pe ead tansaction, and even so still thee is a significant 2-4 clock cycles ead latency [6], is desiable but we can t avoid global memoy ead/wites fo the cell states since the initial image and the final esult ae placed into the global memoy. Ou appoach is to conside the compute to global memoy access (GMA) atio defined as the numbe of floating-point calculations pefomed fo each access to the global memoy: inceasing this atio we can impove the pefomance of the implementation model. 2 The NN image pocesso The discete time NN model is descibed by the following equation [5]: ( t ) ( ) ( ) h A(, y B(, u ( z ) () whee (t) epesents the state of a cell at t, ( is an element in the S () neighbohood, A(, and B(, ae the feed-back and feed-fowad templates, u l (t) is the input image, z is the offset and y l (t) is the output that is calculated accoding to the following fomula: yi,.5( ) (2) o using the equivalent fom:, yi,, < (3), Assuming that the image is constant duing the iteations (i.e. u l (t) u l ), () can be divided in two pats: the feed-fowad and the feed-back pat. The feed-fowad pat must be computed only once, at the beginning of the iteative pocess: g B(, u z (4) ( ) The NN pocess can be epessed as follows: ( t ) ( ) A(, y ( h g ) (5) G(A,B,z) ae called genes and a specific combination of values fo A, B and z detemines the behavio of the NN (i.e. a specific image filte): shapening, softening, edge detection, theshold, ditheing, etc. 2. The implementation model Efficient GPU pogamming pattens ae based on dividing the poblem into a lage numbe of theads, each thead eecuting the same code but on diffeent data. Rathe than dividing the poblem in few lage blocks as accustomed in multitheading PU implementations, GPU allows (and benefits) fom computing each cell in a sepaate thead thus obtaining thousands of theads that will be efficiently managed by the GPU contol unit. A two dimensional stuctue of blocks of theads pocess a two dimensional egion of cell. Nomalize and compute gi ompute (t)i Denomalize Load image : [, N] t : [, N] t<t : [, N] Save image PU (sequentia implementation model Load image i NN Synchonize t i NN Synchonize t<t i NN Synchonize Save image GPU (paalle implementation model Fig. 2: PU sequential and GPU paallel implementation model of NN 3 High pefomance NN The implementation model descibed in pevious section is based on an iteative pocess, in each iteation (t) each thead compute the new cell state based on the pevious cell state, on it s neighbohood state and coesponding feedfowad constant value. Based on (5), assuming we have an A(33) mati and the state ae stoed ISBN: 978--684-329- 59

into the global memoy then each thead pefoms 33 eads and one wite to global memoy and pefoms 33 floating point multiplications and additions packet into a single FMA opeation (fused multiply add) and two additions. We can compute the GMA atio fo single cell iteation as follows: 33 GMA NN GM (6) 33 In ode to incease the GMA ation, thus inceasing the efficiency, we can educe the numbe of global memoy eads by oganizing the memoy eads at block level and splitting the cell iteation in two pats: each thead eads the coesponding cell state fom the global memoy and save it into a shaed memoy and, afte a synchonization point, each thead compute the new cell state eading data fom the shaed memoy. This appoach inceases Eecution time (us) 4 35 3 25 2 5 5 2 4 8 6 32 28 Hoizontal (piels) 4 6 Vetical (piels) Fig. 3: Eecution time (T GM ) using global memoy fo 2424 image size the GMA fom to : 2 33 GMANN SM (7) Note that the eads fom shaed memoy ae ignoed while computing GMA atio since the shaed memoy access has much lowe latency than eads fom the global memoy. Also note that (7) does not include the special case fo bode when the coesponding thead must pefom anothe ead fo the cell outside the block o the case fo cone when two eads ae equied. In a simila way we can obtain the GMA atio fo the case when the data is placed into the tetue memoy (tetues ae stoed also into the global memoy): 33 GMANN TM (8) We simulate the thee cases descibed above on the same 2424 piels image. Woking on a gay scale image, each piel is a byte data containing the gay level in the [,255] ange of intege values. Befoe stating the iteative pocess descibed the Section 2., the data must be nomalized, i.e. tansfoming the [, 255] intege piel values into [-.,.] floating point values coesponding to the state initial value. Also at this moment we can compute the constant g accoding to (4). Sepaate UDA kenels ae eecuted by the GPU, each one using only global memoy, global memoy and shaed memoy, and tetue memoy and shaed memoy espectively. Epeimental esults ae focused on measuing the eecution time on GPU. Eecution time (us) 4 35 3 25 2 5 5 2 4 8 6 32 28 Hoizontal (piels) 4 6 Vetical (piels) Fig. 4: Eecution time (T SM ) using global memoy and shaed memoy fo 2424 image size Eecution time (us) 4 35 3 25 2 5 5 2 4 8 6 32 28 Hoizontal (piels) 6 Vetical (piels) Fig. 5: Eecution time (T TM ) using tetue memoy fo 2424 image size 4 ISBN: 978--684-329- 6

As pesented in Fig. 3, 4, and 5, thee is a consistency between the calculated GMA and the eecution time. ompaing (6) and (7) fo eample, we can notice that using shaed memoy to stoe intemediate eads fom the global memoy, thee is about an ode of magnitude incease of the GMA which confimed in the epeimental esults pesented in Fig. 3 and 4. Note that Eecution time ais values fom the Fig. 3 ae times bigge than the values pesented in the Fig. 4. Using shaed memoy combined with global memoy o tetue memoy poduces simila esults but two impotant obsevations must be made. Fist obsevation is that eads fom global memoy can be coalesced, meaning that eads fom theads within a block of theads ae packed into a single tansactions if ae made on consecutive bytes fom the memoy. By convention in ou epeiments the two dimensional image is stoed into the global memoy in a ow mao configuation. In this case eads fom hoizontal block of theads ae packed into a single tansaction and the memoy access latency is educed ove the case of the vetical blocks configuation [3]. Second obsevation is that the tetue eads ae not coalesced but can benefit fom the locality access optimizations: eads ae faste if neighbohood memoy locations ae accessed. Fig. 5 show that iespective the use of vetical, squaed o hoizontal blocks the eecution time is consistent when compaed with the access pattens fom the global memoy as pesented in Fig. 3 o Fig. 4. A deepe investigation shows that the best pefomance is obtained when using hoizontal block of theads and global memoy, as pesented in Table. Table : Eecution time compaison fo shaed memoy (T SM ) and tetue memoy (T TM ) depending on image size and Eecution time (μs) 25 3232 4 5252 image size T SM 47 62 27 T TM 75 57 7 2424 image size T SM 362 444 783 T TM 45 473 542 248248 image size T SM 369 324 T TM 89 683 276 4996 image size T SM 5584 6296 85 T TM 5869 668 7925 3. The new model A new appoach is poposed in ode to futhe impove the GMA atio. Analyzing the eisting NN implementation model thee is a incemental pocess in which each iteation is computed in one step and consists in the following oppeations: eading the cuent state fom the memoy (global memoy o tetue memoy), computing the new state, save the new state back into the memoy and synchonize among all blocks of theads, as descibed in Fig. 6. Read fom global memoy/tetue memoy Save to shaed memoy ompute new state Wite to global memoy Fig. 6: One iteation pe step i NN Synchonize (between the theads within a block) i NN i NN Synchonize (between kenel calls) Inceasing the GMA atio by educing the global memoy opeations can be obtained if we combine moe iteation into a single step. One iteation pe stage must ead the state fom the block plus the oute laye of neighbohood and compute the new cell state only fo the within the block. Two iteations pe step must ead the state fom the block plus the two oute layes of neighbohood and compute two iteations fo the new cell state and one iteation fo the fom the fist oute laye as pesented in Fig. 7. Moe iteations can be pefomed into a single step with additional layes to be ead and computed. Neighbohood MN block Fig. 7: One and two iteation pe step Neighbohood Neighbohood MN block The GMA atio at block level fo one iteation pe step and MN can be calculated as follows: ISBN: 978--684-329- 6

33MN GMABlock IpS (9) MN 2( M 2) 2N In a simila way the GMA atio at block level fo one iteation in the case of two iteations pe step can be calculate: GMABlock 2IpS 233MN 2( M 2) 2N () 2( MN 2( M 2) 2N 2( M 4) 2( N 2) ) Assuming a NN pocess consisting in T iteations, using two iteations pe step then T/2 steps ae equied so less memoy eads and wites fom and to the global memoy poduces bette GMA atio impacting the eecution time, as pesented Table 2. Table 2. Eecution time fo one, two and fou iteations pe step, fo 2424 piels image size and T2 iteations One ite./step Two ite./step Fou ite./step Eecution time (ms) 63 38 23 Speed-up.67 2.73 Eecution time pe cell and iteation T cit (ns).3.8. ell iteations/s ( 6 ) 333 557 9 4 onclusion UDA enabled GPU platfoms povides to the developes a diffeent achitectue when compaed with the taditional PU. Highly paallel computing powe, lage but high latency global memoy, low latency but limited cache and shaed memoy, locality optimized and cached tetue memoy could be efficiently used and combined to implement high pefomance algoithms. Measuements wee pefomed on the following hadwae/softwae achitectue: Windows 7/32 bit opeating system, nvidia UDA Toolkit v5.5, PU Intel oe 2Duo E632 PU unning at.86 GHz, 2GB DDR2 DRAM, nvidia GeFoce GTX 65 Ti Boost GPU using Keple achitectue compatibility 3., 768 coes in fou 98 MHz base clock multipocessos, GB GDDR5 DRAM with 44.2 GB/s bandwidth. Refeences: [] R. Dogau, I. Dogau, High Poductivity ellula Neual Netwok Implementation on GPU using Python, Poceedings of the Wokshop on Infomation Technology and Bionics Symposium in Memoy of Tamas Roska, Budapest, Hungay, 23-24 June, 25, pp. 23-27. [2] R. Dogau, I. Dogau, A Low ost High Pefomance omputing Platfom fo ellula Nonlinea Netwoks using Python fo UDA, 2th Intenational onfeence on ontol Systems and Science, 25, pp. 593-598. [3] G.V. Stoica, R. Dogau,.E. Stoica, Speedingup Image Pocessing in Reaction-Diffusion ellula Neual Netwoks using UDA enabled GPU Platfoms, Intenational onfeence on Electonics, omputes and Atificial Intelligence, Buchaest, Oct. 24, Vol. 2, pp. 39-42. [4] K.V. Kalgin, Implementation of algoithms with a fine-gained paallelism on GPUs, Numeical Analysis and Applications, Vol.4, No., pp 46-55, 2. [5] E. Laszlo, P. Szolgay and Z. Nagy, Analysis of a GPU based NN implementation, 3th Intenational Wokshop on ellula Nanoscale Netwoks and Thei Applications (NNA), Tuin, Aug. 29-3, 22. [6] UDA Pogamming Guide, http://docs.nvidia.com/cuda/cuda-cpogamming-guide/inde.html [7] Roska, T. and hua, L.O., The NN univesal machine: an analogic aay compute, in IEEE Tansactions on icuits and Systems II: Analog and Digital Signal Pocessing, vol. 4, no. 3, 993, pp.63-73. [8] L.O. hua and L. Yang, ellula Neual Netwok: Theoy, in IEEE Tansactions on icuits and Systems, vol. 35, no., 988, pp. 257-272. [9] R. Dolan and G. DeSouza, GPU-Based Simulation of ellula Neual Netwoks fo Image Pocessing, in Poceedings of Intenational Joint onfeence on Neual Netwoks, Atlanta, Geogia, USA, 29, pp. 73-735. [] S. Potlu A. Fasih, L. K. Vutukuu, F. Al Machot, K. Kyamakya, NN Based High Pefomance omputing fo Real Time Image Pocessing on GPU, Nonlinea Dynamics and Synchonization (INDS) & 6th Int'l Symposium on Theoetical Electical Engineeing (ISTET), Klagenfut, Austia, 2, pp. -7 ISBN: 978--684-329- 62