Neneta: Heterogeneous Computing Complex-Valued Neural Network Framework

Size: px

Start display at page:

Download "Neneta: Heterogeneous Computing Complex-Valued Neural Network Framework"

Annice Copeland
6 years ago
Views:

1 Neneta: Heterogeneous Computing Complex-Valued Neural Network Framework Vladimir Lekić* and Zdenka Babić* * Faculty of Electrical Engineering, University of Banja Luka, Banja Luka, Bosnia and Herzegovina Abstract Due to increased demand for computational efficiency for the training, validation and testing of artificial neural networks, many open source software frameworks have emerged. Almost exclusively GPU programming model of choice in such software frameworks is CUDA. Symptomatic is also lack of the support for complex-valued neural networks. With our research going exactly in that direction, we developed and made publicly available yet another software framework, completely based on C++ and OpenCL standards with which we try to solve problems we identified with already existing solutions. I. INTRODUCTION Attention that complex machine learning algorithms are receiving in the recent years is tremendous. Research laboratories are competing in making their sets of training data publicly available [1],[2], along with software frameworks [3],[4],[5], tutorials and courses to use this data. On the other side, scientists, industry professionals and enthusiasts are competing in tuning the available models, and tweaking the algorithm performance. Of course, this is perfectly valid approach, but what somehow stays hidden in this machine learning hype, is that everyone is building machine learning models with the same sets of training data on more or less the same hardware. The neural network framework "neneta" that will be introduced in this paper is a product of a research we are conducting on complex-valued neural networks [6],[7]. Our goal was never to compete with the state-of-the-art frameworks already available, but to build the tool that will support us efficiently through our research. On the other hand, we believe that presented tool has a potential to attract attention of the broader community. Not only by offering ability to efficiently design and train neural networks on a broader range of GPUs, but also by offering to do that in a more general way by using complex-valued neural networks. II. DESIGN DECISIONS Choice of a programming languages and APIs when starting work on a project like this is everything but an easy task. Idea was also to allow software to run on most popular operating systems. Decisions to enable all these goals are as follows. A. C++11 For all programming tasks related to the network and GPU configuration, input data preprocessing and result presentation we use C++ [8]. There are two main reasons behind this decision. First, we already had enough knowledge of the language to make significant progress fast. Second, although preprocessing tasks in deep neural networks are relatively simple compared to the network model itself, they are not insignificant, and they can definitely impact the overall performance of the framework. Therefore, we needed programming language with minimum overhead possible, but still with object oriented programming support. C++ was an ideal candidate. B. OpenCL We wanted to be able to run our software on wide range of available devices, and of course on the ones yet to come. By this we do not only consider the GPUs having available OpenCL [9],[10] support, but also FPGAs and DSPs utilizing this standard [11]. C. Testing Due to significant complexity of the software, unit testing and component testing [12] was necessary and done for most of the components. D. Operating systems support We use gcc compiler for development with CMake build system. Basically, all operating systems having this compiler and build system support and appropriate OpenCL drivers can run neneta. Until now, software has been successfully tested on Microsoft Windows and Linux operating systems. III. SOFTWARE ARCHITECTURE Component diagram of neneta is shown on Fig. 1. Components are compiled to static libraries and at the end of the linking process linked together to a single executable. As it can be seen, only requirement for the operating system is support for OpenCL. Number of GPUs is not limited and it is completely abstracted away from the neuralnetwork component through interfaces provided by the gpgpu, which is the only component interfacing directly to the GPU. In the following subsections is given short description of the framework components. A. confighandler This component is responsible for parsing the XML configuration files. There are three types of configuration files: configuration.xml - Holds general configuration information for the logging (log level, log rotation, log MIPRO 2017/DC VIS 209

2 Fig. 1. neneta Component Diagram format etc.), plotting, input data sources, persistence, OpenCL kernel sources and GPU. kernels.xml - Holds profiling configuration for all the OpenCL kernels used by the framework. On startup, all configured OpenCL kernel sources are compiled by the OpenCL driver. This also means that kernels can be added, removed or modified without recompiling any of the framework libraries. network_params_<id>.xml - Holds configuration for the neuralnetwork component. As an example configuration, one of the complex-valued neural network layers is shown of Fig. 2. Configuration of the layers is parsed automatically. Based on the layer type, appropriate objects are instantiated and enqueued for the execution on the GPU. At the moment following types of the layers are supported: Input layer Convolution layer Fully connected layer FFT layer IFFT layer Projection Layer Softmax Layer Spectral-pooling layer Error calculation layer <layer type="conv" id="conv1"> <input>input1</input> <channels>1</channels> <kernels>10</kernels> <kernelsize>5</kernelsize> <stride>1</stride> <actfunc>complextanh</actfunc> <weightsdev>1</weightsdev> <weightsmean>0</weightsmean> <weightstype>complex</weightstype> <biasre>0.0001</biasre> <biasim>0</biasim> Fig. 2. Example of complex-valued neural network layer configuration. B. plotting Task of the plotting component is to abstract the data presentation tools for the neuralnetwork component. Normally, for data presentation tools some kind of plotting library is used (for example gnuplot [13], but not limited to it). C. imageprocessing Input date comes in various formats. Task of the imageprocessing components is to convert, adapt, merge or filter input data based on the neuralnetwork component needs. This component runs only on host CPU and it is not desirable that these operations have high complexity. In case input data preprocessing step consumes significant processing time (eg. FFT [14]), additional layer type should be introduced to the neuralnetwork component. D. neuralnetwork This is the core component of the neneta framework. Basically, entire neural network processing is done within this component. Main features of this component are: It is consisted of various types of layers, available within the framework. It is straightforward to define and implement a new layer. Framework it self will enqueue it for execution on GPU. Performance critical functionality of the layers is transferred to the OpenCL kernels. Changing the functionality within kernel source files doesn t require recompilation, but only application restart. Fig. 3 shows simplified class diagram of ConvLayer layer. In this example, ConvLayer implements IPersistedLayer interface. Functions store() and restore() are called for this layer after each training epoch. Other two classes that ConvLayer inherits are clearly indicating the relation of the layer to OpenCL execution plan. Being also IOpenCLChainableExecutionPlan, allows layer to be linked with other layers. Functions setinputbuffer(bufferio)/setbkpinputbuffer(bufferio) are called from the left/right layer during forward/back propagation configuration. Forward and back-propagation are configured once during startup, but executed many times during training. Input parameter BufferIO is the block in GPU s global memory. Role of this memory block is to pass needed information between layers - what directly means that size of neural network model is directly proportional to the size of the available GPU global memory. E. imagehandler Any set of training data can be used to train the modeled neural network. Task of the imagehandler component is to abstract away the training set from the neuralnetwork component. At the moment support for MNIST [2] and IMAGENET [1] is available. F. logging During the long training periods some sort of logging system (for example boost logging library [15]) is desirable. This component provides logging capabilities to 210 MIPRO 2017/DC VIS

Fig. 3. Simplified class diagram of ConvLayer the entire framework. Logging file path, rotation size, logging level and logging format can be configured in configuration.xml configuration file. G.

3 Fig. 3. Simplified class diagram of ConvLayer the entire framework. Logging file path, rotation size, logging level and logging format can be configured in configuration.xml configuration file. G. persistence Role of the persistence component is to ensure that network training execution can be interrupted and continued at will. Persistence interface store() can be called at the end of each batch execution or at the end of each training epoch. On the other hand restore() is called only once during initialization phase. Persistence data are stored as binary blob on hard disk. H. gpgpu Although OpenCL offers C++ interface wrappers [16], we introduced even higher level of abstraction in order to incorporate the GPU execution plan into the model. Component gpgpu offers interfaces to plan kernel execution in predefined order, at the same time giving ability to profile kernel execution if desired. IV. CONFIGURATION EXAMPLE As an example we performed training of the network consisted of a single Soft-Max layer on MNIST data-set [2]. Data-set is consisted of 60,000 training images of hand written Arabic numerals and of 10,000 test images. In configuration, we have split training data-set in 50,000 training and 10,000 validation images, as shown on Fig. 4. An example of such network configuration is show on Fig. 5. Input layer allocates a continuous block of global memory on GPU. Although not relevant for this example, this memory is always split equally to hold real and imaginary data, using parameters rpipesize and ipipesize. Other parameters in input layer are determined by the input data size (for MNIST these are 28x28 pixel gray-level images). Layer of type softmax is real-valued layer. Parameters of the layer are descriptive, as show on Fig. 5, and require no further explanation. For loss calculation we used cross-entropy function, simply defined through errorcalc layer. <images source="mnist"> <trainset> <offset>0</offset> <size>50000</size> <minibatchsize>1</minibatchsize> <path>train-images.idx3-ubyte</path> <labels>train-labels.idx1-ubyte</labels> </trainset> <testset> <offset>0</offset> <size>10000</size> <path>t10k-images.idx3-ubyte</path> <labels>t10k-labels.idx1-ubyte</labels> </testset> <validationset> <offset>50000</offset> <size>10000</size> <path>train-images.idx3-ubyte</path> <labels>train-labels.idx1-ubyte</labels> </validationset> </images> Fig. 4. Configuration of MNIST data-set. <?xml version="1.0"?> <neneta> <layer type="input" id="input1"> <rpipesize> </rpipesize> <ipipesize> </ipipesize> <outputsize>10</outputsize> <inputchannels>1</inputchannels> <layer type="softmax" id="sm1"> <input>input1</input> <channels>1</channels> <outputsize>10</outputsize> <actfunc>softmax</actfunc> <weightsdev>1</weightsdev> <weightsmean>0</weightsmean> <bias>0.1</bias> <layer type="errorcalc" id="err1"> <input>sm1</input> <channels>10</channels> <errorfunc>crossentropy</errorfunc> </neneta> Fig. 5. Simple Soft-Max layer configuration. MIPRO 2017/DC VIS 211

4 Fig. 6. Three training epochs on MNIST data-set. Training progress was monitored using the plotting interface for gnuplot [13], as shown on Fig. 6. More detailed training results are obtained from the log file and here are presented in Table I. TABLE I DETAILS OF THREE TRAINING EPOCHS ON MNIST DATA-SET. Ep. Train. Loss Train. Acc. [%] Val. Loss Val. Acc. [%] V. PERFORMANCE COMPARISION We compared the performance of the CPU and GPU running the same example configuration described in previous section. Properties of the OpenCL devices used to run simulations are given in Table II. TABLE II CPU AND GPU DEVICE PROPERTIES Property CPU GPU Name Phenom II X4 965 Radeon HD 5770 Vendor AMD AMD Max. proc. elements Max. clock freq. [MHz] Max. gl. mem. size [B] Max. lo. mem. size [B] Max. work group size Max. work items sizes 1024,1024, ,256,256 Measured execution time for forward and backpropagation for both devices is shown on Fig. 7. Measurement is taken on one entire training epoch (50,000 images). It is obvious that for most of the training epoch, algorithm execution is approximately 10 times faster on the GPU than it is on the CPU. In this simulation, both CPU and GPU were not dedicated computing devices (OS and graphics are running on them). This could explain some of the peaks in the graph. It is interesting to analyze shortly the parameters of the devices and how they relate to the algorithm performance. Fig. 7. Performance comparison CPU-GPU. CPU clock speed is four times of the GPU clock speed, but number of processing elements on the GPU is 200 times higher (AMD Radeon HD 5770 Juniper graphics card has 10 computing elements, each computing element has 16 stream cores and each stream core has 5 processing elements). Based on this, one could expect even higher performance gain when executing the algorithm on GPU. To explain the obtained result, it must be taken into account that GPU has a SIMD (Single Instruction Multiple Data) processor architecture. That means that all processing elements within the given work group are always executing the same instruction. If particular care is not taken during kernel development to avoid problems that can arise due to limitations of such architecture (for example branching divergence) full performance gain of GPU cannot be achieved. Another point to consider (and maybe more relevant for this example) is the well known memory transfer bottleneck that occurs during data transfer between host (CPU) memory and GPU memory. To cope with this problem it is desirable to transfer as much as possible data at a time to the GPU global memory (entire mini batches) and let GPU work on multiple passes through the network on this data. VI. CONCLUSIONS Although initially intended to serve as a platform for research of complex-valued neural network, due to its simplicity and extensibility neneta can be used for training of real-valued neural networks as well. Platform already offers a number of different types of neural network layers, and moreover, with opening the code to the public we hope to attract more researchers contributing to it. We are aware that some of already implemented OpenCL kernels are far from optimal from execution time point of view. Our goal for the future is to further improve the code base in that sense, and to improve the design and quality aspects of it as well. REFERENCES [1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, MIPRO 2017/DC VIS

5 [2] Yann LeCun and Corinna Cortes. The mnist database of handwritten digits, [3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages , [4] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, Software available from tensorflow.org. [5] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arxiv preprint arxiv: , [6] Akira Hirose. Complex-valued neural networks. Springer Science & Business Media, [7] Danilo P Mandic and Vanessa Su Lee Goh. Complex valued nonlinear adaptive filters: noncircularity, widely linear and neural models, volume 59. John Wiley & Sons, [8] Bjarne Stroustrup. The C++ Programming Language. Addison- Wesley Professional, 4th edition, [9] Jonathan Tompson and Kristofer Schlachter. An introduction to the opencl programming model. Person Education, 49, [10] John E. Stone, David Gohara, and Guochun Shi. Opencl: A parallel programming standard for heterogeneous computing systems. IEEE Des. Test, 12(3):66 73, May [11] Deshanand Singh. Implementing fpga design with the opencl standard. Altera whitepaper, [12] Robert C Martin. Clean code: a handbook of agile software craftsmanship. Pearson Education, [13] T Williams, C Kelley, HB Bröker, J Campbell, R Cunningham, D Denholm, E Elber, R Fearick, C Grammes, and L Hart. Gnuplot 5.0.5: An interactive plotting program, URL gnuplot. info. [14] Keun-Yung Byun, Chun-Su Park, Jee-Young Sun, and Sung-Jea Ko. Vector radix 2 2 sliding fast fourier transform. Mathematical Problems in Engineering, 2016, [15] Boris Schling. The Boost C++ Libraries. XML Press, [16] Benedict R Gaster. The opencl c++ wrapper api, MIPRO 2017/DC VIS 213

3D Deep Convolution Neural Network Application in Lung Nodule Detection on CT Images

3D Deep Convolution Neural Network Application in Lung Nodule Detection on CT Images Fonova zl953@nyu.edu Abstract Pulmonary cancer is the leading cause of cancer-related death worldwide, and early stage