Speedup of Type-1 Fuzzy Logic Systems on Graphics Processing Units Using CUDA

Speedup of Type-1 Fuzzy Logc Systems on Graphcs Processng Unts Usng CUDA Durlabh Chauhan 1, Satvr Sngh 2, Sarabjeet Sngh 3 and Vjay Kumar Banga 4 1,2 Department of Electroncs & Communcaton Engneerng, SBS State Techncal Campus, Ferozepur 152004, (Punjab) Inda 3 Department of Computer Scence & Engneerng, SBS State Techncal Campus, Ferozepur 152004, (Punjab) Inda 4 Department of Electroncs & Communcaton Engneerng, Amrtsar College of Engg. & Tech., Amrtsar 143001, (Punjab) Inda E-mal: 1 er.durlabh@gmal.com, 2 drsatvr.n@gmal.com, 3 sarabjeet_sngh13@yahoo.co, 4 v_banga@redffmal.com Abstract Parallelcomputng s one of sgnfcant components of the Hgh Performance Computng (HPC) and s beng used to solve problems, whch are large and complex n nature. Fuzzy Logc System (FLS) s a problem that becomes computatonally ntensve wth ncrease n number of nputs and/or fuzzy rules. Runnng an FLS s hghly parallel n nature, therefore, can be mplemented n parallel on GPU usng CUDA. In ths paper, varous fuzzy computatons vz. rule frng, mplcaton, aggregaton and defuzzfcaton are performed n parallel. Multple threads are run at a tme to fre multple fuzzy rules, smultaneously, that reduces the overall FLS executon tme. It s observed from smulaton results that GPU works faster as compared to CPU when ether number of nputs s ncreased or number of fuzzy rules. Keywords: Hgh Performance Computng, Fuzzy Logc Systems, General Purpose Computng on Graphcs Processng Unt, Compute Unfed Desgn Archtecture I. INTRODUCTION Artfcal Intellgence (AI) undoubtedly s the backbone of the emergng technologcal advancements, however, mplementaton nvolves ntensve computaton owng to ncreased data szes. Methods to mprove the runtme performance construe the areas of research to reduce mathematcal computatons and parallelzaton of algorthm n hardware. Graphcs Processor Unts (GPU), plays a major role n gamng and graphcs applcatons and allows for general purpose programmng on remarkably fast parallel hardware usng a Sngle Instructon Multple Data (SIMD) programmng archtecture. Fuzzy Logc, ntroduced by Zadeh [1], [2], s one of the exgent part of AI and possess nherent parallel nature. General Purpose computng on GPU (GPGPU) s targeted by many researchers to speed up complcated algorthms especally, AI algorthms and ther applcatons [3]. Ander-son et al. presented a GPU soluton for the fuzzy C-means clusterng algorthms [4]. Earler ths soluton used OpenGL and Cg (graphcs lbrares) to acheve approxmately two folds of computatonal speedup for some clusterng profles usng NVIDIA 8800 GPU. They later generalzed the system for the use of non- Eucldean metrcs [5]. Further, Sejun Km descrbes the method used to adapt a multlayer tree structure composed of fuzzy adaptve unts nto CUDA (Compute Unfed Devce Archtecture) platforms [6]. In [7], Chosa and Kolb present a framework for mesh clusterng solely mplemented on the GPU wth a new generc multlevel clusterng technque. Cha et al., have proposed the mplementaton of a zero-order TSK-Fuzzy Neural Network (FNN) on GPUs to reduce tranng tme n [8]. Harvey et al., have presented a GPU soluton for fuzzy nference system n [9]. Anderson et al., present a parallel mplementaton of fuzzy nference on GPU usng CUDA n[10]. Two folds of speed mprovement of ths naturally parallel algorthm have been acheved under typcal nference profles. One problem wth ths system and the mplementaton on GPU s that they both rely upon OpenGL and Cg lbrares, whch makes system generalzaton dffcult for new comers to GPGPU. Further, Ngo et al., report an mplementaton of Interval Type-2 FLS on GPU usng CUDA wth a tremendous speedup of 30 folds n [11]. In ths paper, parallel functonalty wth CUDA programmng model, for parallel mplementaton of a Type-1 Sugeno FLS on GPU, s nvestgated wth ncreased number of nputs and fuzzy rules. After ths bref hstorcal background rest of ths paper s outlned as follows: Secton II targets CUDA programmng model for GPGPU. Secton III ntroduces Type 1 FLS along wth the scope of parallelsm wherever possble. Secton IV presents smulaton results and a speedup comparson of seral and parallel computng usng CPU and GPU, respectvely. Fnally, secton V concludes the research workand presents future off-shoots. II. CUDA PROGRAMMING MODEL CUDA s a parallel computng platform and programmng model ntroduced by NVIDIA that ncreases computng performance substantally by harnessng the power of parallelsm of the GPU. CUDA gves program developers the drect access to the vrtual nstructon set and memory of the parallel computatonal elements n CUDA enabled GPUs. The CUDA platform s accessble to software developers through CUDA accelerated lbrares, compler drectves (such as Open ACC), and extensons to ndustry-standard programmng languages, ncludng C, C++ and FORTRAN. C/C++ programmers use

Internatonal Conference on Communcaton, Computng & Systems (ICCCS 2014) CUDAC/C++, compled wth nvcc whch s NVIDIA S LLVM-based C/C++ compler. A C/C++ program usng CUDA can nterface wth one GPU or multple GPUs and can be dentfed and utlzed n parallel, allowng for unprecedented processng power on desktop computers. CUDA allows multple kernels to be run smultaneously on GPU cores. CUDA refers to each kernel as a grd. A grd s a collecton of blocks. Each block runs the same kernel, however, s ndependent of each other (ths has sgnfcance n terms of access to memory types). A block contans threads, whch are the smallest dvsble unt on a GPU. A thread block s a number of SIMD threads that work on core at a gven tme. Threads can exchange nformaton through the shared memory and can be synchronzed. The operatons are systematzed as a grd of thread blocks. For parallel operaton the programmng model allows a developer to partton a program nto several subprograms, each of whch s executed ndependently on a block. Each subprogram can be further dvded nto fner peces that perform the same functon but execute on dfferent threads ndependently wthn the same block. For data set parallelsm, data sets can be dvded n to smaller chunks that are stored n the shared memory, and each chunk s vsble to all threads of the same block. Ths local data arrangement approach reduces the need to access off-chp global memory, whch reduces data access tme. memory reads and wrtes. Ths s typcally acheved by havng each thread read ts data from global memory and store ts content nto shared memory. The basc structure of a CUDA code comprses of allocaton of memory space (usng cuda Malloc functon) on devce (GPU) and (usng regular malloc functon) on host (CPU). Data whch s coped from the host to the devce for the call of kernel routne to be executed on the GPU (usng functon cuda Memcpy) also defnes the number of threads and ther physcal structure. Kernel s prefxed wth the global keyword. Results are transferred from GPU to CPU n the same fashon as data s coped from host to devce. III. SCOPE OF PARALLELISM IN TYPE-1 FLS Theory of FLS gven by Zadeh a fuzzy set s defned for a partcular doman, and t s characterzed by a membershp functon that maps elements from the doman to a real valued numbers [1], [2]. Mendel and many researchers gave numerous methods to desgn and mplement FLS for varous applcatons [12] [15]. Theory of FLS gven by Zadeh a fuzzy set s defned for a partcular doman, and t s characterzed by a membershp functon that maps elements from the doman to a real valued numbers. Fg. 1 CUDA Archtecture The next crtcal component of a CUDA applcaton s the memory model. There are multple types of memory and each has dfferent access tmes. The GPU s broken up nto read-wrte per thread regsters, read-wrte per thread local memory, read-wrte per-block shared memory, read-wrte per-grd global memory, read-only per-grd constant memory, and read-only per-grd texture memory. Texture and constant memory have relatvely small access latency tmes, whle global memory has the largest access latency tme. Applcatons should mnmze the number of global Fg. 2 CUDA Process Flow In ths paper every nput has been fuzzfed usng four arbtrary located Gaussan membershp functons that are characterzed by two parameters,.e., mean (m) and standard devaton (). Gaussan membershp functon s symmetrcal about ts mean and s expressed mathematcally as (1) ( x) ( x m) exp 2 2 2 (1) In ths paper, an FLS wth 4 nputs and 1 output s subjectedto parallel computaton for forecastng of Mackey-Glass tmeseres [14]. For mplcaton and aggregaton, mn and maxoperator are nvestgated. The defuzzfcaton s performedwth the heght defuzzfcaton method as symmetrcal shaped Gaussan fuzzy sets have been used to fuzzfy consequent.output of the heght defuzzfer s gven by (2) 268

Speedup of Type-1 Fuzzy Logc Systems on Graphcs Processng Unts Usng CUDA y M 1 M C ( x ) 1 ( x ) Here C represents the locaton of sngleton consequents fuzzy sets and (x ) represent clppng level for each rule after mplcaton and M represents total number of fuzzy rules. However, for mplementaton of the FLS defned above n CUDA, the foremost and the most crucal step s to allocate the memory space for the data sets to be used n the system,as the choce and format of data affects performance of thealgorthm. Scope of possble parallel computatonal processng s dscussed as follows: Harvey et al., [9] and Anderson et al., [10] n ther respectve work have gven a vast scope of parallelsm for Type-1 FLS. In the same fashon Ngo et al., [11] presented a novel scope of parallelsm for Interval Type-2 FLS. However, n all these mplementatons the emphass was lad to parallelze the number of fuzzy rules and dscrete levels for a typcal FLS wth two nputs and sngle output. So, a sngle FLS was computed wth parallel rule nference on GPU usng CUDA. However, a typcal FLS can be processed multple tmes wth fxed fuzzy rules and dscrete levels but varyng nputs. Here les our scope of parallelsm, we construe our code to compute multple FLS n parallel on GPU as a FLS serally on CPU wll consume more tme. IV. PARALLEL I MPLEMENTATION Two M N dmensonal Mean and Sgma matrces are used n ths mplementaton where M denotes number of fuzzy rules and N s number of antecedents. These matrces hold mean and sgma values of Gaussan membershp functon n accordance wth fuzzy rules used n the FLS. An M 1 dmensonal Consequent matrx contans only mean values of the consequent fuzzy sets as systems uses heght defuzzfcatonmethod gven by equaton (2). Multple nputs are provded to the systems collectvely n the form of an L N dmensonal nput matrx, X, where L s the number of nputs.the CPU executes rule frng sequentally wth a sngle nput at a tme and causes more computatonal tme. Whereas, multple threads are run at the same tme to fre multple rules smultaneously usng system matrces,.e., Mean, Sgma and Consequent matrces,whch reduce the executon tme for the FLS. A kernel functon s ntalzed from CPU to pass nputs n parallel to varous GPU cores. Copyng system matrces everytme along wth nput vectors and ncreases the GPU processng tme. Therefore, system matrces are coped only once and subsequently requre only nput vectors those are passed to GPU n parallel to enhance the GPU performance. (2) V. RESULTS DISCUSSION The speed up performance wth GPU mplementaton of a Type-1 Sugeno FLS was compared wth that of CPU mplementaton. Intel Core 2 Duo system under expermentaton has 2GB of system RAM, and Wndows 7 platform. The GPU used here works on nvidiageforce GTX 650 wth 1024 MB of texture memory, 192 stream processors, and PCI Express X16. The number of antecedents s fxed to 4 and consequentto 1, the number of rules s vared between 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 and nputs were vared as 128, 256, 512, and 1024. A set of nput data s obtaned from Mackey-Glass tme seres for expermentaton. Rato of CPU to GPU runtmes wth respect to number of fuzzy rules has been tabulated n Table I and presented graphcallyn Fg. 3. TABLE I FUZZY RULES VS. CPU TO GPU SPEEDUP RATIO CPU to GPU Runtme Speedup Rato Number of Fuzzy 128 256 512 1024 Rules Inputs Inputs Inputs Inputs 10 1.937 4.875 5.034 5.488 20 2.437 3.032 4.548 6.132 30 1.340 2.319 4.319 6.544 40 2.000 2.476 3.968 7.063 50 1.516 3.532 4.410 7.063 60 1.730 3.000 4.365 7.185 70 1.602 3.205 4.509 7.134 80 1.500 2.742 4.500 7.832 90 1.709 2.577 4.446 5.488 100 1.624 2.504 4.652 6.132 Here, t can be observed clearly that advantage of GPU computng has ncreased wth ncreased nput data vector szes. In another smulaton results, t s observed that CPU to GPU speedup tme mproves wth ncrease n fuzzy rules. Computatonal tme on CPU vares sgnfcantly due to already always runnng applcatons at the back end.therefore, to present far comparson seral and parallel tmng analyss experments wth same FLS have been repeated 30 tmes. Average of CPU to GPU speedups for 30 monte-carlosmulatons durng seral and parallel computatons of rule frng, mplcaton, aggregaton and defuzzfcaton have beenpresented n Fg. 4 7. Fg. 3 CPU to GPU Speedup Rato vs Inputs Data Length 269

Internatonal Conference on Communcaton, Computng & Systems (ICCCS 2014) The overall performance of CPU to GPU speedup tmngswth respect to collectve number of nputs, presented to the FLS, s shown n Fg. 8 that depcts that larger the number offuzzy rules better s the speedup performance. Fg. 6 CPU and GPU Runtme Comparson for 512 Inputs Fg. 4 CPU and GPU Runtme Comparson for 128 Inputs Fg. 7 CPU and GPU Runtme Comparson for 1024 Inputs Fg. 5 CPU and GPU Runtme Comparson for 256 Inputs VI. CONCLUSION AND FUTURE WORK Ths paper has demonstrated the mplementaton and comparatve runtme performances of a typcal Sugeno Type-1 FLSon a GPU and CPU wthout the use of a graphcs API whch s flexble, scalable, and can be used by any researcher wth knowledge of C. It has been demonstrated that the CPU works equally fast as GPU when the system s small. As the number of rules or the number of nputs ncrease the GPU outperforms the CPU runtme. Here, n the work nearly 7.83 tmes speedup could be acheved as 1024 nputs suppled n parallel to the FLS ported on GPU. The GPU has an ntal setup overhead of kernel loadng and memory transfer, however, subsequent parallel computatons leads to a small ncrease n processng tme despte a substantal ncrease n computatonal load. On the other hand, CPU has no ntal cost, but computaton tme grows lnearly wth computatonal load much beyond GPGPU runtme. Fg. 8 CPU to GPU Speedup Rato for Varous Input Data Sets The parallelzaton of more FLS applcatons s next on ouragenda. That wll follow mplementaton of Interval Type-2 FLSs and Generalzed Type-2 FLSs for varous applcatons that requre much more computatonal tme otherwse.gpgpu s also possbly nvestgated to on Evolutonary Algorthms those are computatonal ntensve and parallel nnature. REFERENCES [1] L.A. Zadeh, Fuzzy Sets, Informaton and Control, Vol. 8, No. 3, pp. 338 353, 1965. [2] L.A. Zadeh, Fuzzy Logc and Approxmate Reasonng, Synthese, Vol. 30, pp. 407 428, 1975. [3] S. Sngh, S. Sngh, V.K. Banga, and D. Chauhan, CUDA for GPGPU Applcatons-A Survey, n Natonal Conference on Contem-poraryTechnques & Technologes n Electroncs Engneerng, Murthal,Sonepat, Inda, March 2013, p. Accepted. 270

Speedup of Type-1 Fuzzy Logc Systems on Graphcs Processng Unts Usng CUDA [4] D.T. Anderson, R.H. Luke, and J.M. Keller, Speedup of FuzzyClusterng through Stream Processng on Graphcs Processng Unts, IEEE Transactons on Fuzzy Systems, Vol. 16, No. 4, pp. 1101 1106, 2008. [5] D. Anderson, R.H. Luke, and J.M. Keller, Incorporaton of non-eucldeandstance Metrcs nto Fuzzy Clusterng on Graphcs Processng Unts, n Analyss and Desgn of Intellgent Systems usng Soft Computng Technques. Sprnger, pp. 128 139, 2007. [6] S. Km and D. Wunsch, A GPU based Parallel Herarchcal Fuzzy ART Clusterng, n The 2011 Internatonal Jont Conference on Neural Networks (IJCNN), pp. 2778 2782, 2011. [7] I. Chosa and A. Kolb, GPU-based Multlevel Clusterng, IEEE Transactons on Vsualzaton and Computer Graphcs, Vol. 17, No. 2, pp. 132 145, 2011. [8] C.F. Juang, T.C. Chen, and W.Y. Cheng, Speedup of Implementng Fuzzy Neural Networks wth Hgh-dmensonal Inputs through Parallel Processng on Graphc Processng Unts, IEEE Trans-actons on Fuzzy Systems, Vol. 19, No. 4, pp. 717 728, 2011. [9] N. Harvey, R. Luke, J.M. Keller, and D. Anderson, Speedup of Fuzzy Logc through Stream Processng on Graphcs Processng Unts, n 2008. CEC IEEE World Congress on Computatonal Intellgence and Congress on Evolutonary Computaton, pp.3809 3815, 2008. [10] D. Anderson and S. Coupland, Parallelsaton of Fuzzy Inference on a Graphcs Processor Unt usng the Compute Unfed Desgn Archtecture, n Proceedngs of the UK Workshop on Computatonal Intellgence (UKCI 08), pp. 1 6, 2008. [11] L.T. Ngo, D.D. Nguyen, C.M. Luong et al.,., Speedup of Interval Type 2 Fuzzy Logc Systems based on GPU for Robot Navgaton, Advances n Fuzzy Systems, Vol. 2012, pp. 4, 2012. [12] J.M. Mendel, Fuzzy Logc Systems for Engneerng: A Tutoral, Proceedngs of the IEEE FUZZ, Vol. 83, No. 3, pp. 345 377, 1995. [13] G.C. Mouzours and J.M. Mendel, Non-sngleton Fuzzy Logc Systems: Theory and Applcaton, IEEE Transactons on Fuzzy Systems, Vol. 5, No. 1, pp. 56 71, 1997. [14] N.N. Karnk and J.M. Mendel, Applcatons of Type-2 Fuzzy Logc Systems to Forecastng of Tme-seres, Informaton Scences, Vol. 120, No. 1, pp. 89 111, 1999. [15] J.M. Mendel, Uncertanty, Fuzzy Logc, and Sgnal Processng, Sgnal Processng, Vol. 80, No. 6, pp. 913 933, 2000. 271