High performance CUDA based CNN image processor

Similar documents
A Memory Efficient Array Architecture for Real-Time Motion Estimation

Journal of World s Electrical Engineering and Technology J. World. Elect. Eng. Tech. 1(1): 12-16, 2012

IP Network Design by Modified Branch Exchange Method

Controlled Information Maximization for SOM Knowledge Induced Learning

ANALYTIC PERFORMANCE MODELS FOR SINGLE CLASS AND MULTIPLE CLASS MULTITHREADED SOFTWARE SERVERS

Color Interpolation for Single CCD Color Camera

Segmentation of Casting Defects in X-Ray Images Based on Fractal Dimension

SYSTEM LEVEL REUSE METRICS FOR OBJECT ORIENTED SOFTWARE : AN ALTERNATIVE APPROACH

COSC 6385 Computer Architecture. - Pipelining

Optical Flow for Large Motion Using Gradient Technique

Positioning of a robot based on binocular vision for hand / foot fusion Long Han

A New Finite Word-length Optimization Method Design for LDPC Decoder

Detection and Recognition of Alert Traffic Signs

Lecture Topics ECE 341. Lecture # 12. Control Signals. Control Signals for Datapath. Basic Processing Unit. Pipelining

Point-Biserial Correlation Analysis of Fuzzy Attributes

Lecture 8 Introduction to Pipelines Adapated from slides by David Patterson

Image Enhancement in the Spatial Domain. Spatial Domain

UCB CS61C : Machine Structures

An Extension to the Local Binary Patterns for Image Retrieval

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

A ROI Focusing Mechanism for Digital Cameras

ANN Models for Coplanar Strip Line Analysis and Synthesis

Cellular Neural Network Based PTV

Module 6 STILL IMAGE COMPRESSION STANDARDS

Spiral Recognition Methodology and Its Application for Recognition of Chinese Bank Checks

A modal estimation based multitype sensor placement method

A VECTOR PERTURBATION APPROACH TO THE GENERALIZED AIRCRAFT SPARE PARTS GROUPING PROBLEM

Introduction To Pipelining. Chapter Pipelining1 1

Towards Adaptive Information Merging Using Selected XML Fragments

Effects of Model Complexity on Generalization Performance of Convolutional Neural Networks

A Novel Automatic White Balance Method For Digital Still Cameras

Lecture #22 Pipelining II, Cache I

Prediction of Time Series Using RBF Neural Networks: A New Approach of Clustering

A Two-stage and Parameter-free Binarization Method for Degraded Document Images

High Performance Computing on GPU for Electromagnetic Logging

On Error Estimation in Runge-Kutta Methods

Any modern computer system will incorporate (at least) two levels of storage:

Dynamic Multiple Parity (DMP) Disk Array for Serial Transaction Processing

A Recommender System for Online Personalization in the WUM Applications

MapReduce Optimizations and Algorithms 2015 Professor Sasu Tarkoma

Input Layer f = 2 f = 0 f = f = 3 1,16 1,1 1,2 1,3 2, ,2 3,3 3,16. f = 1. f = Output Layer

And Ph.D. Candidate of Computer Science, University of Putra Malaysia 2 Faculty of Computer Science and Information Technology,

A New and Efficient 2D Collision Detection Method Based on Contact Theory Xiaolong CHENG, Jun XIAO a, Ying WANG, Qinghai MIAO, Jian XUE

Performing real-time image processing on distributed computer systems

i-pcgrid Workshop 2016 April 1 st 2016 San Francisco, CA

Approximating Euclidean Distance Transform with Simple Operations in Cellular Processor Arrays

Title. Author(s)NOMURA, K.; MOROOKA, S. Issue Date Doc URL. Type. Note. File Information

a Not yet implemented in current version SPARK: Research Kit Pointer Analysis Parameters Soot Pointer analysis. Objectives

Illumination methods for optical wear detection

OPTIMAL KINEMATIC SYNTHESIS OF CRANK & SLOTTED LEVER QUICK RETURN MECHANISM FOR SPECIFIC STROKE & TIME RATIO

Multidimensional Testing

Shortest Paths for a Two-Robot Rendez-Vous

Keith Dalbey, PhD. Sandia National Labs, Dept 1441 Optimization & Uncertainty Quantification

Transmission Lines Modeling Based on Vector Fitting Algorithm and RLC Active/Passive Filter Design

17/5/2009. Introduction

CS 2461: Computer Architecture 1 Program performance and High Performance Processors

ECE331: Hardware Organization and Design

Frequency Domain Approach for Face Recognition Using Optical Vanderlugt Filters

Lecture # 04. Image Enhancement in Spatial Domain

Modelling, simulation, and performance analysis of a CAN FD system with SAE benchmark based message set

An Unsupervised Segmentation Framework For Texture Image Queries

Slotted Random Access Protocol with Dynamic Transmission Probability Control in CDMA System

An Optimised Density Based Clustering Algorithm

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

RANDOM IRREGULAR BLOCK-HIERARCHICAL NETWORKS: ALGORITHMS FOR COMPUTATION OF MAIN PROPERTIES

Prioritized Traffic Recovery over GMPLS Networks

Multi-azimuth Prestack Time Migration for General Anisotropic, Weakly Heterogeneous Media - Field Data Examples

Fifth Wheel Modelling and Testing

Parallel processing model for XML parsing

Simulation and Performance Evaluation of Network on Chip Architectures and Algorithms using CINSIM

Computer Science 141 Computing Hardware

Topological Characteristic of Wireless Network

Cardiac C-Arm CT. SNR Enhancement by Combining Multiple Retrospectively Motion Corrected FDK-Like Reconstructions

DYNAMIC STORAGE ALLOCATION. Hanan Samet

Hierarchically Clustered P2P Streaming System

XFVHDL: A Tool for the Synthesis of Fuzzy Logic Controllers

Color Correction Using 3D Multiview Geometry

A Novel Parallel Deadlock Detection Algorithm and Architecture

User Visible Registers. CPU Structure and Function Ch 11. General CPU Organization (4) Control and Status Registers (5) Register Organisation (4)

COMPARISON OF CHIRP SCALING AND WAVENUMBER DOMAIN ALGORITHMS FOR AIRBORNE LOW FREQUENCY SAR DATA PROCESSING

Behavioral Modeling of a C-Band Ring Hybrid Coupler Using Artificial Neural Networks

DEADLOCK AVOIDANCE IN BATCH PROCESSES. M. Tittus K. Åkesson

Drag Optimization on Rear Box of a Simplified Car Model by Robust Parameter Design

New Algorithms for Daylight Harvesting in a Private Office

THE THETA BLOCKCHAIN

Goal. Rendering Complex Scenes on Mobile Terminals or on the web. Rendering on Mobile Terminals. Rendering on Mobile Terminals. Walking through images

On the Conversion between Binary Code and Binary-Reflected Gray Code on Boolean Cubes

Assessment of Track Sequence Optimization based on Recorded Field Operations

Embeddings into Crossed Cubes

Topic -3 Image Enhancement

A Neural Network Model for Storing and Retrieving 2D Images of Rotated 3D Object Using Principal Components

HISTOGRAMS are an important statistic reflecting the

ART GALLERIES WITH INTERIOR WALLS. March 1998

A Full-mode FME VLSI Architecture Based on 8x8/4x4 Adaptive Hadamard Transform For QFHD H.264/AVC Encoder

4.2. Co-terminal and Related Angles. Investigate

Gravitational Shift for Beginners

Adaptation of TDMA Parameters Based on Network Conditions

An Improved Resource Reservation Protocol

Mobility Pattern Recognition in Mobile Ad-Hoc Networks

Annales UMCS Informatica AI 2 (2004) UMCS

Transcription:

High pefomance UDA based NN image pocesso GEORGE VALENTIN STOIA, RADU DOGARU, ELENA RISTINA STOIA Depatment of Applied Electonics and Infomation Engineeing Univesity Politehnica of Buchaest -3, Iuliu Maniu Blvd., Secto 6, Buchaest ROMANIA vstoica@yahoo.com, adu_d@ieee.og, ce_stoica@yahoo.com Abstact: - ellula neual netwoks (NNs) have been adopted as solution in vaious fields due to thei poweful yet simple achitectue. Pactical implementations using VLSI o FPGA ae vey efficient but difficult to use in development o simulation stages, when wide spead cost effective, easy to lean, high pefomance solutions ae equied. GPU and moe specific UDA based simulatos can povide the computing powe equied fo developing, simulating and unning NNs. This pape investigates solutions to optimize the utilization of nvidia s Keple achitectue to achieve pefomance up to 9. Million /s. Key-Wods: - UDA enabled GPU, high pefomance NN simulato, image pocessing Intoduction Developing and simulating cellula neual netwoks help finding the ight genes fo specific poblems o discoveing potential new applications. Speeding up the simulation is desiable but this should come with minimal development and implementation costs. UDA enabled GPU seems the pefect solution: its massive paallel achitectue matches the NN achitectue, its thoughput-oiented design can suffice the equied computing powe fo unning NN, its compatibility with cuent pogamming languages (e.g.,, Python, Fotan) and libaies o middlewaes (e.g. OpenA, Matlab, OpenL) ease the migation of the applications fom PU to GPU platfoms. Also the availability and cost of UDA enabled GPUs helps choosing this oute fo implementing high pefomance, high poductivity solutions [], [2]. Thee is an inceased inteest in adopting UDA as high pefomance, high poductivity platfom and combining this with the continuous development of the UDA enabled GPUs equies a continuous eseach and investigation in finding efficient implementations fo specific poblems [3]. Pevious wok elated to NN implementations on GPUs uses pevious UDA achitectues (e.g. Tesla o Femi achitectues) with notable esults ove PU o dedicated image pocessing libaies (e.g OpenV) when typical acceleation of 7-2 wee obtained [4], [5], [9], []. Although the NN specific data-paallel computation model fits GPU achitectue, high pefomance implementations must conside the GPU esouces and thei specific limitations. As visible in Fig., thee ae some notable diffeences when compaing GPU with PU achitectues: smalle cache memoy and simple contol units which leads to highe global memoy access latency, customizable memoy types (shaed memoy, constant memoy, tetue memoy, egistes). This pape deals with such aspects and poposes a new implementation model fo the NN discete time image pocesso on the UDA platfom using a moe ecent nvidia s Keple achitectue. Fig. : PU vs. GPU achitectual diffeences [6] This pape analyses the implementation of NN image pocesso on nvidia s Keple achitectue. The discete time NN model as descibed in [7], [8] is pesented. Memoy types (e.g. global memoy, shaed memoy, tetue memoy) and access pattens ae analyzed to find the optimal configuation fo the implementation of the discete NN model. The memoy access patten of the NN simulation makes this poblem a memoybandwidth bounded poblem [5]. Specific techniques can be applied to impove the pefomance (e.g. the use of shaed memoy, ISBN: 978--684-329- 58

memoy cache, coalesced memoy eads as pesented in [3]) but we can t go beyond some limiting factos: is desiable but we can t fit the image into the fast, low latency, on-chip shaed memoy, is desiable but we can t fit all the eads fom within a block of theads into a single coalesced memoy ead because thee is a limited 28 bytes pe ead tansaction, and even so still thee is a significant 2-4 clock cycles ead latency [6], is desiable but we can t avoid global memoy ead/wites fo the cell states since the initial image and the final esult ae placed into the global memoy. Ou appoach is to conside the compute to global memoy access (GMA) atio defined as the numbe of floating-point calculations pefomed fo each access to the global memoy: inceasing this atio we can impove the pefomance of the implementation model. 2 The NN image pocesso The discete time NN model is descibed by the following equation [5]: ( t ) ( ) ( ) h A(, y B(, u ( z ) () whee (t) epesents the state of a cell at t, ( is an element in the S () neighbohood, A(, and B(, ae the feed-back and feed-fowad templates, u l (t) is the input image, z is the offset and y l (t) is the output that is calculated accoding to the following fomula: yi,.5( ) (2) o using the equivalent fom:, yi,, < (3), Assuming that the image is constant duing the iteations (i.e. u l (t) u l ), () can be divided in two pats: the feed-fowad and the feed-back pat. The feed-fowad pat must be computed only once, at the beginning of the iteative pocess: g B(, u z (4) ( ) The NN pocess can be epessed as follows: ( t ) ( ) A(, y ( h g ) (5) G(A,B,z) ae called genes and a specific combination of values fo A, B and z detemines the behavio of the NN (i.e. a specific image filte): shapening, softening, edge detection, theshold, ditheing, etc. 2. The implementation model Efficient GPU pogamming pattens ae based on dividing the poblem into a lage numbe of theads, each thead eecuting the same code but on diffeent data. Rathe than dividing the poblem in few lage blocks as accustomed in multitheading PU implementations, GPU allows (and benefits) fom computing each cell in a sepaate thead thus obtaining thousands of theads that will be efficiently managed by the GPU contol unit. A two dimensional stuctue of blocks of theads pocess a two dimensional egion of cell. Nomalize and compute gi ompute (t)i Denomalize Load image : [, N] t : [, N] t<t : [, N] Save image PU (sequentia implementation model Load image i NN Synchonize t i NN Synchonize t<t i NN Synchonize Save image GPU (paalle implementation model Fig. 2: PU sequential and GPU paallel implementation model of NN 3 High pefomance NN The implementation model descibed in pevious section is based on an iteative pocess, in each iteation (t) each thead compute the new cell state based on the pevious cell state, on it s neighbohood state and coesponding feedfowad constant value. Based on (5), assuming we have an A(33) mati and the state ae stoed ISBN: 978--684-329- 59

into the global memoy then each thead pefoms 33 eads and one wite to global memoy and pefoms 33 floating point multiplications and additions packet into a single FMA opeation (fused multiply add) and two additions. We can compute the GMA atio fo single cell iteation as follows: 33 GMA NN GM (6) 33 In ode to incease the GMA ation, thus inceasing the efficiency, we can educe the numbe of global memoy eads by oganizing the memoy eads at block level and splitting the cell iteation in two pats: each thead eads the coesponding cell state fom the global memoy and save it into a shaed memoy and, afte a synchonization point, each thead compute the new cell state eading data fom the shaed memoy. This appoach inceases Eecution time (us) 4 35 3 25 2 5 5 2 4 8 6 32 28 Hoizontal (piels) 4 6 Vetical (piels) Fig. 3: Eecution time (T GM ) using global memoy fo 2424 image size the GMA fom to : 2 33 GMANN SM (7) Note that the eads fom shaed memoy ae ignoed while computing GMA atio since the shaed memoy access has much lowe latency than eads fom the global memoy. Also note that (7) does not include the special case fo bode when the coesponding thead must pefom anothe ead fo the cell outside the block o the case fo cone when two eads ae equied. In a simila way we can obtain the GMA atio fo the case when the data is placed into the tetue memoy (tetues ae stoed also into the global memoy): 33 GMANN TM (8) We simulate the thee cases descibed above on the same 2424 piels image. Woking on a gay scale image, each piel is a byte data containing the gay level in the [,255] ange of intege values. Befoe stating the iteative pocess descibed the Section 2., the data must be nomalized, i.e. tansfoming the [, 255] intege piel values into [-.,.] floating point values coesponding to the state initial value. Also at this moment we can compute the constant g accoding to (4). Sepaate UDA kenels ae eecuted by the GPU, each one using only global memoy, global memoy and shaed memoy, and tetue memoy and shaed memoy espectively. Epeimental esults ae focused on measuing the eecution time on GPU. Eecution time (us) 4 35 3 25 2 5 5 2 4 8 6 32 28 Hoizontal (piels) 4 6 Vetical (piels) Fig. 4: Eecution time (T SM ) using global memoy and shaed memoy fo 2424 image size Eecution time (us) 4 35 3 25 2 5 5 2 4 8 6 32 28 Hoizontal (piels) 6 Vetical (piels) Fig. 5: Eecution time (T TM ) using tetue memoy fo 2424 image size 4 ISBN: 978--684-329- 6

As pesented in Fig. 3, 4, and 5, thee is a consistency between the calculated GMA and the eecution time. ompaing (6) and (7) fo eample, we can notice that using shaed memoy to stoe intemediate eads fom the global memoy, thee is about an ode of magnitude incease of the GMA which confimed in the epeimental esults pesented in Fig. 3 and 4. Note that Eecution time ais values fom the Fig. 3 ae times bigge than the values pesented in the Fig. 4. Using shaed memoy combined with global memoy o tetue memoy poduces simila esults but two impotant obsevations must be made. Fist obsevation is that eads fom global memoy can be coalesced, meaning that eads fom theads within a block of theads ae packed into a single tansactions if ae made on consecutive bytes fom the memoy. By convention in ou epeiments the two dimensional image is stoed into the global memoy in a ow mao configuation. In this case eads fom hoizontal block of theads ae packed into a single tansaction and the memoy access latency is educed ove the case of the vetical blocks configuation [3]. Second obsevation is that the tetue eads ae not coalesced but can benefit fom the locality access optimizations: eads ae faste if neighbohood memoy locations ae accessed. Fig. 5 show that iespective the use of vetical, squaed o hoizontal blocks the eecution time is consistent when compaed with the access pattens fom the global memoy as pesented in Fig. 3 o Fig. 4. A deepe investigation shows that the best pefomance is obtained when using hoizontal block of theads and global memoy, as pesented in Table. Table : Eecution time compaison fo shaed memoy (T SM ) and tetue memoy (T TM ) depending on image size and Eecution time (μs) 25 3232 4 5252 image size T SM 47 62 27 T TM 75 57 7 2424 image size T SM 362 444 783 T TM 45 473 542 248248 image size T SM 369 324 T TM 89 683 276 4996 image size T SM 5584 6296 85 T TM 5869 668 7925 3. The new model A new appoach is poposed in ode to futhe impove the GMA atio. Analyzing the eisting NN implementation model thee is a incemental pocess in which each iteation is computed in one step and consists in the following oppeations: eading the cuent state fom the memoy (global memoy o tetue memoy), computing the new state, save the new state back into the memoy and synchonize among all blocks of theads, as descibed in Fig. 6. Read fom global memoy/tetue memoy Save to shaed memoy ompute new state Wite to global memoy Fig. 6: One iteation pe step i NN Synchonize (between the theads within a block) i NN i NN Synchonize (between kenel calls) Inceasing the GMA atio by educing the global memoy opeations can be obtained if we combine moe iteation into a single step. One iteation pe stage must ead the state fom the block plus the oute laye of neighbohood and compute the new cell state only fo the within the block. Two iteations pe step must ead the state fom the block plus the two oute layes of neighbohood and compute two iteations fo the new cell state and one iteation fo the fom the fist oute laye as pesented in Fig. 7. Moe iteations can be pefomed into a single step with additional layes to be ead and computed. Neighbohood MN block Fig. 7: One and two iteation pe step Neighbohood Neighbohood MN block The GMA atio at block level fo one iteation pe step and MN can be calculated as follows: ISBN: 978--684-329- 6

33MN GMABlock IpS (9) MN 2( M 2) 2N In a simila way the GMA atio at block level fo one iteation in the case of two iteations pe step can be calculate: GMABlock 2IpS 233MN 2( M 2) 2N () 2( MN 2( M 2) 2N 2( M 4) 2( N 2) ) Assuming a NN pocess consisting in T iteations, using two iteations pe step then T/2 steps ae equied so less memoy eads and wites fom and to the global memoy poduces bette GMA atio impacting the eecution time, as pesented Table 2. Table 2. Eecution time fo one, two and fou iteations pe step, fo 2424 piels image size and T2 iteations One ite./step Two ite./step Fou ite./step Eecution time (ms) 63 38 23 Speed-up.67 2.73 Eecution time pe cell and iteation T cit (ns).3.8. ell iteations/s ( 6 ) 333 557 9 4 onclusion UDA enabled GPU platfoms povides to the developes a diffeent achitectue when compaed with the taditional PU. Highly paallel computing powe, lage but high latency global memoy, low latency but limited cache and shaed memoy, locality optimized and cached tetue memoy could be efficiently used and combined to implement high pefomance algoithms. Measuements wee pefomed on the following hadwae/softwae achitectue: Windows 7/32 bit opeating system, nvidia UDA Toolkit v5.5, PU Intel oe 2Duo E632 PU unning at.86 GHz, 2GB DDR2 DRAM, nvidia GeFoce GTX 65 Ti Boost GPU using Keple achitectue compatibility 3., 768 coes in fou 98 MHz base clock multipocessos, GB GDDR5 DRAM with 44.2 GB/s bandwidth. Refeences: [] R. Dogau, I. Dogau, High Poductivity ellula Neual Netwok Implementation on GPU using Python, Poceedings of the Wokshop on Infomation Technology and Bionics Symposium in Memoy of Tamas Roska, Budapest, Hungay, 23-24 June, 25, pp. 23-27. [2] R. Dogau, I. Dogau, A Low ost High Pefomance omputing Platfom fo ellula Nonlinea Netwoks using Python fo UDA, 2th Intenational onfeence on ontol Systems and Science, 25, pp. 593-598. [3] G.V. Stoica, R. Dogau,.E. Stoica, Speedingup Image Pocessing in Reaction-Diffusion ellula Neual Netwoks using UDA enabled GPU Platfoms, Intenational onfeence on Electonics, omputes and Atificial Intelligence, Buchaest, Oct. 24, Vol. 2, pp. 39-42. [4] K.V. Kalgin, Implementation of algoithms with a fine-gained paallelism on GPUs, Numeical Analysis and Applications, Vol.4, No., pp 46-55, 2. [5] E. Laszlo, P. Szolgay and Z. Nagy, Analysis of a GPU based NN implementation, 3th Intenational Wokshop on ellula Nanoscale Netwoks and Thei Applications (NNA), Tuin, Aug. 29-3, 22. [6] UDA Pogamming Guide, http://docs.nvidia.com/cuda/cuda-cpogamming-guide/inde.html [7] Roska, T. and hua, L.O., The NN univesal machine: an analogic aay compute, in IEEE Tansactions on icuits and Systems II: Analog and Digital Signal Pocessing, vol. 4, no. 3, 993, pp.63-73. [8] L.O. hua and L. Yang, ellula Neual Netwok: Theoy, in IEEE Tansactions on icuits and Systems, vol. 35, no., 988, pp. 257-272. [9] R. Dolan and G. DeSouza, GPU-Based Simulation of ellula Neual Netwoks fo Image Pocessing, in Poceedings of Intenational Joint onfeence on Neual Netwoks, Atlanta, Geogia, USA, 29, pp. 73-735. [] S. Potlu A. Fasih, L. K. Vutukuu, F. Al Machot, K. Kyamakya, NN Based High Pefomance omputing fo Real Time Image Pocessing on GPU, Nonlinea Dynamics and Synchonization (INDS) & 6th Int'l Symposium on Theoetical Electical Engineeing (ISTET), Klagenfut, Austia, 2, pp. -7 ISBN: 978--684-329- 62