Convolutional Neural Networks (CNNs) for Power System Big Data Analysis

Size: px

Start display at page:

Download "Convolutional Neural Networks (CNNs) for Power System Big Data Analysis"

Martina Robbins
5 years ago
Views:

1 Convolutional Neural Networks (CNNs) for Power System Big Analysis Siby Jose Plathottam, Hossein Salehfar, Prakash Ranganathan Electrical Engineering, University of North Dakota Grand Forks, USA Abstract The concept of automated power system data analysis using Deep Neural Networks (as part of the routine tasks normally performed by Independent System Operators) is explored and developed in this paper. Specifically, we propose to use the widely-used Deep neural network architecture known as Convolutional Neural Networks (CNNs). To this end, a 2-D representation of power system data is developed and proposed. To show the relevance of the proposed concept, a multi-class multi-label classification problem is presented as an application example. Midcontinent ISO (MISO) data sets on wind power and load is used for this purpose. TensorFlow, an open source machine learning platform is used to construct the CNN and train the network. The results are discussed and compared with those from standard Feed Forward Networks for the same data. Index Terms Deep Learning, Machine Learning, Convolutional NN, Feed Forward NN, wind power generation, Artificial intelligence I. INTRODUCTION The capabilities of Artificial Intelligence (AI) programs have grown in unforeseen way during the last 3 to 4 years. These include tasks like computer vision with above 9% accuracy [], [2], playing computer games with human level skills [3], defeating the reigning world champion in the ancient board game of Go [4], [5]. The last task was not expected to be accomplished by a computer until the next decade. Much of the above and other progress can be attributed to a class of Machine Learning (ML) algorithms called Deep Neural Networks (DNN s) also known as Deep Learning (DL) [6]. It must be noted that the game playing AI program also used a ML concept know as Reinforcement Learning (RL) to perform its task. The architecture of DNN s is not task-specific. Hence their same general learning algorithm may be repurposed for other tasks. One example of this application of learning algorithm was how Google used the same algorithm that learnt to play Go to optimize the operation of their data center cooling system and improved efficiency by 4% [7]. The success that AI applications, using DNN s, have achieved in solving tasks once thought to be only solvable by human experts lends out the hope that these techniques can also be applied to complex power system problems. One potential application is in the Independent System Operator (ISO) domains. Independent System Operator s (ISO s) are entities which coordinate the generation and transmission of electric power within their control area [8]. The human operators of an ISO take decisions like dispatching generation, scheduling tie line interchanges, fixing spot energy prices every few minutes to ensure system stability, power quality, and fairness to all utilities. They are aided by optimization algorithms like Security Constrained Economic Dispatch and Optimal Power Flow programs to make decisions. Thus, the final decision has a human in the loop. As distributed generation resources in the form of wind turbine generators and solar photovoltaics (PV s) are continually being added, the power system operation and control is becoming increasingly complex. In other words, a paradigm shift is happening in the electric power systems domain [9]. It is prudent that advanced computational tools be developed for ISO s to ensure that tomorrow s power grids are more reliable and cost effective than those of today. An AI platform using ML that resides within an ISO s data center to actively monitor and react in real time to a continuous stream of data from the grid is such a tool. This is not an impossible task due to the continual improvements in ML algorithms and the ever-increasing capability of distributed computing techniques. However, as with most promising new technologies, success is not guaranteed. A truly successful ISO AI decision system would require multiple iterations of concepts and algorithms. One possible approach is to use an ensemble consisting of DNN s of different architectures with specific expertise. This may also require the use of RL. The various functionalities that an ISO AI decision system would comprise of are illustrated in Fig.. The objective of this paper is to use a type of DNN s architecture called Convolutional Neural Networks (CNN s) to perform classification of large power system data sets. CNN s can be used as one of the building blocks of an ISO AI decision system, to perform analysis of large volumes of historical data. The next section gives a brief background on CNN s and how they are useful in processing data sets with data points that are sequentially, and spatially related to each. This work has been supported by the NSF and North Dakota EPSCOR Program through grant #IIA

Operator Query Operator Feedback Classification ISO AI Forecast Artificial Intelligence (AI) Machine Learning (ML) Artificial Neural Network (ANN) State Estimation Generation Dispatch Interchange

2 Operator Query Operator Feedback Classification ISO AI Forecast Artificial Intelligence (AI) Machine Learning (ML) Artificial Neural Network (ANN) State Estimation Generation Dispatch Interchange Convolutional Neural Network (CNN) Optimization EnergyPrice Figure 2. CNN within AI Domain AI Response Critic System 2-D Array Convolutional Layer Convolution Filter Feature Map Figure. Functionalities inside an ISO AI Decision System II. CNN ARCHITECTURE A. Convolutional layer operation The simplest form of a DNN may be that of an ordinary Feed Forward Neural Network (FFNN) with 2 or more hidden layers. However, many of the best recent results using DNN s have come due to the use of CNN s, originally proposed by Lecun et al in 998 []. An illustration of how the CNN architecture fits within the wider world of AI programs is shown in Fig.2. CNN s share many similarities with FFNN s. The main conceptual difference between the two is that CNN s preserve the spatial relationship between data points, while FFNN s do not. This is one reason why CNN s have found such success in image recognition and other similar complex tasks. An image is a 2-dimensional (2-D) array of pixel values having fixed height, width, and color channels. A CNN can analyze an image as a 2-D array of numbers (or as 3-D, if there is more than one color channel) having a shape the same as that of an original image. However, in the case of a FFNN the 2-D array would need to be flattened into a -D array potentially destroying any spatial relationship between data points. Another major difference is that unlike FFNN s an individual neuron in a CNN is not connected to each pixel in the image at the same time. Instead, each neuron in a CNN has a window of specific height and width (i.e., a weight matrix) through which it analyzes an image patch having the same height and width. This window is known as the convolution filter and it works by sliding over the entire area of the image one patch at a time. The convolution operation (i.e., matrix multiplication of pixel values with the convolution filter weights) produces one value for each image patch. By sliding over the entire image matrix produces a feature map whose dimensions depend on both the dimension of the image and the convolution filter. An illustration of the convolution operation is given in Fig. 3. -D Array Flattening Input Layer Sliding Window Feed Forward Layer Hidden Layer Figure 3. An illustration of the difference between operations in a convolutional layer and feed forward layer. The shape of the feature map, i.e., it s width (FM W) and height (FM H) may be calculated by using Equations () and (2), respectively. Where I W and I H are the width and the height of the 2-D array, respectively. CF W and CF H are the convolutional filter width and height, respectively, and S W and S WH are strides of the sliding window along the width and height, respectively. The size, number, and stride of the convolution filters are hyper parameters for the CNN that must be finetuned for each application. I CF W W FM = + () W SW I CF H H FM = + (2) H S H

3 The input to a CNN is in no way limited to pixel values from image data. A CNN can be applied to any type of data with sequential information. One application where CNN s have been highly successful has been in the field of computational biology to classify DNA sequences [] and to predict the specificity of DNA-protein bindings [2]. In the present paper the authors extend the use of CNN s in processing the big set of sequential data collected by ISO s, namely the 24-hr power generation and load data. A recent work used a Deep Learning (DL) technique, Auto Encoders, to predict solar PV power generation [3]. B. Training and inference using CNN s Using a ML architecture, like a CNN, to write an AI program is conceptually different from traditional programming. Rather than writing instructions for each step to complete a task, here one must specify a learning algorithm for the network and provide many training samples of inputoutput pairs. The process through which the weights and biases of the CNN adjust themselves using the learning algorithm and data samples is known as training. Almost all learning algorithms use some variations of the backpropagation algorithm [4]. In this work, a mini-batch gradient descent technique is used where the training data set is split into multiple mini-batches and each batch is consecutively trained. The learning algorithm uses a function knows as the loss function to measure the difference between the output produced by the CNN, and the actual targeted output. The selection of the loss function is primarily dependent on whether the neural network is performing a classification or a regression task. In this work, since the neural network is performing a classification task, a crossentropy loss function [5] is used. Training a network is generally a time-consuming process, and it may take many hours, days, or even weeks to fully train a CNN from scratch, depending upon the number of training data samples and the complexity of the CNN. The process in which the input data is fed into an already trained network to produce outputs is known as inference. Inference can be performed quickly by any properly trained neural network, and may take only a few milli-seconds or less. This is because the underlying mathematical operations in the trained network are simple and are performed in parallel. To test how well the network is learning during (or after) training, it is necessary to measure the loss for input data that was not part of the training dataset. For this, a portion of the training data is separated before the start of training. This is referred to as the validation data. Only when the loss due to validation data decreases in tandem with the loss due to training data, can the neural network be said to be learning. If the opposite happens, i.e., the training error decreases while the validation error increases, the network is said to be memorizing and would not be able to generalize and produce good results during inference. III. CLASSIFICATION USING CNN The various tasks performed by ML algorithms can be broadly classified into two areas, namely Classification and Regression. CNN s can perform both of these tasks, though classification is the more a widely-used application. In this work, the CNN is trained to perform a multi-class and multilabel classification task as illustrated in Fig. 4. Each input data can have features from multiple classes, but within each class it can only be classified under one label. This work will use what is known as the one hot encoding concept to represent the labels, whereby each neuron in the output layer corresponds to a unique label [6]. Correspondingly, a sample of the output data used for training is a vector with a size equal to the total number of labels. The values within this vector may either be or. It is possible to convert the output from a neuron to take on a value of either or using a discrete step function, but this severely limits the learning ability of the network. Instead, it is more advantageous to compute the probability value of each output using a continuously differentiable function like the Sigmoid activation function. The Sigmoid can squeeze the output from a neuron to a value between and [7], [8] and represent the probability of a decision. Input Class Class 2 Class m Label Label 2 Label n Figure 4. A multi-class and multi-label classification task A. Classification using class seperated Softmax activation Probabilities can also be calculated using the Softmax function (4), but this function can take advantage of the fact that probabilities of class labels within a class are mutually exclusive: y i = n e j= xi e x j where y i is the probability that the input being classified belongs to the i th label, x i is the output from the neuron corresponding to the i th label and n is the number of labels within the class. Also, n y i= i = (5) In this work, there are separate Softmax activations for each class. Hence, the loss function is calculated separately for each class and added together using (6). k (4) m n k k = (, ˆ i i ) (6) loss xentropy y y k= i= where, y i k is the neural network output corresponding to the i th label in the k th class and y i k is the actual output corresponding to the i th label in the k th class. m is the number of classes, and n k is the total number of labels in the k th class.

4 Both Softmax and Sigmoid activation functions will be used separately and their performances are then compared. IV. ELECTRIC POWER SYSTEM DATA STREAMS In an electric power system, the sum of power generated, power consumed, power losses, and energy stored must be equal to zero at every time instant. A. Training data for CNN This paper uses wind power generation and load data from Midcontinent ISO (MISO) for analysis by a CNN. Each of data sample corresponds to a 24-hr wind power generation and actual load for a day. Hence one sample of input data consists of 48 unique generation and load values. In this work, as an example the CNN will be trained to extract 3 different features from the 24 hr data, and predict their labels (classification) The features are mean wind power, standard deviation of wind power, and the fraction of total load that is served by wind power generators. The vector corresponding to an output data sample is illustrated in Fig. 5. <4 MW >4 MW <85 MW LOW MED HIGH LOW MED HIGH LOW HIGH Wind Power Strength >85 MW <5 MW >5 MW <23 MW Wind Power Variability Figure 5. Sample of training output >23 MW <7.5 % >7.5 % Wind Power Load Share map [9]. A second convolutional layer may be added to work on the feature maps produced by the previous convolutional layer. There may be m number of such convolutional layers. Using large number of layers will result in finer representation of the input data features. However, this also increases the computational time as well as the memory required. The output of the feature maps from the last convolutional layer is flattened and given as input to a FFNN. Finally, in the output layer the probabilities of different classes are calculated using a Softmax or Sigmoid activation. A. Implementing the CNN computational graph using TensorFlow In order to train the CNN using the power system data, a computational graph was developed in Google s TensorFlow machine learning library [2], [2]. One of the many advantages of using TensorFlow is the possibility of visualizing the computational graph of the algorithm after coding has been performed. The computational graph used to implement the CNN of this work is shown in Fig. 7. The details of the layers are given in Table I and Table II. A cross entropy function [5] is used to calculate the loss or error between the CNN output and the actual labels during the training process. The Adam optimization algorithm [22] was used in this work to train the CNN weights. A stochastic gradient descent method was implemented by changing the size of the training batch in each epoch. For comparison, the same data set were also processed using a standard feed forward network (FFNN) having a single hidden layer. The number of neurons in the hidden layer were chosen such that the number of parameters in the FFNN will be roughly equal to the number of parameters in the CNN. The details of the layers in the FFNN are given in Table III. V. CNN APPLICATION TO POWER SYSTEM DATA STREAM Like the case of pixel data from images, the CNN processes a 2-D array containing power data without flattening that data. To do this, the data is arranged in the form of a stack of 2-D arrays. The width of the array will correspond to the number of time blocks (24 in this case) and the height of the array will correspond to the number of data sources (wind generation and load in this case). This concept is illustrated in Fig. 6. Depending on the number of data sources and time blocks for the problem, the size of the 2-D array increases. Time Blocks Sources Figure 6. Stacking 2-D arrays of power system generation and load data The operations within the CNN that are used to process the 24-hr power data arranged as stacks of 2-D arrays are explained below. The st convolutional filter in the CNN will process a 2-D array of size (2 24) using a filter of size 2 I W. An n number of such filters may be used. Each filter will produce a feature map of size FM W. A non-linear activation function like ReLU is applied to each element of the feature Figure 7. Computational graph for CNN generated using Tensorflow

5 TABLE I. HYPERPARAMETRES - CONVOLUTIONAL LAYER IN CNN Convolutional layers within CNN Layer Conv Layer Conv layer 2 Filter size Number of filters 4 4 Feature map size 9 4 Parameters Weights 2 6 4= =96 Biases 4 4 TABLE II. HYPERPARAMETRES - FULLY CONNECTED LAYER IN CNN FFNN layers within CNN Layer Fully connected layer Output layer Number of inputs 4 4 = 56 8 Number of 8 8 neurons/outputs Parameters Weights 56 8= =64 Biases 8 8 TABLE III. HYPERPARAMETRES FOR FFNN FFNN layers Layer First layer Output layer Number of inputs 48 2 Number of neurons/outputs 2 8 Parameters Weights 48 2 = =96 Biases 2 8 Tables I-III indicate that CNN has four layers, while the FFNN has only two layers, even though the number of unique parameters (weights and biases) in both are nearly the same (68 for CNN and 692 for FFNN). If another power data source were to be added as input, the number of weights would have increased by 288 for the FFNN. However, for the CNN, only 24 additional weights would be required. VI. RESULTS The training inputs of the CNN were obtained from MISO data from 25, 26, and 27 [23]. This corresponds to wind power generation and load data for 4 days, i.e values. Out of these, data corresponding to days were considered as the training data and the remaining 4 days as the validation data. The learning rate was set at -6 for all the epochs. The training and inference were done on an Intel Core i7 PC with 8 GB RAM. The training time was about 4 hours for the CNN & about 2.5 hours for the standard FFNN. The change in cross entropy loss w.r.t. the training epochs is shown in Fig. 8. The improvement in average classification accuracy during training is shown in Fig. 9. The classification accuracy for each class is given in Fig.. Cross Entropy Loss Cross Entropy loss for Training - Sigmoid Cross Entropy loss for Validation - Sigmoid Cross Entropy loss for Training - Softmax Cross Entropy loss for validation - Softmax 5 Epochs 5 2 Figure 8. Cross entropy loss w.r.t. epochs for CNN Accuracy Accuracy Training Accuracy with Sigmoid Validation Accuracy with Sigmoid Training Accuracy with Softmax Validation Accuracy with Softmax 5 Epochs 5 2 Figure 9. Average accuracy w.r.t. epochs for CNN Class_accuracy Class2_accuracy Class3_accuracy 5 Epochs 5 2 Figure. Accuracy for classes using Softmax activation w.r.t. epochs Changes in the value of weights in the st convolutional layer as training progresses can be visualized as a histogram as shown in Fig. using the visualization tool of TensorFlow known as Tensorboard. Here Y-axis represents the epochs while the X-axis is the spread of values and the Z-axis is the number of parameters taking a particular value. The change in their weight values over the epochs indicate that CNN is learning during the training process. Figure. Evolution of weights in the st convolutional filter w.r.t. epochs visualized as a histogram for Softmax (left) & Sigmoid (right) output layers. The FFNN was also trained with the same data set, and optimization algorithm as the CNN. The change in crossentropy loss and average classification accuracy as training progressed is shown in Fig. 2 and Fig. 3 respectively. Cross Entropy loss Accuracy Cross Entropy Loss for Training - Softmax Cross Entropy loss for Validation - Softmax Cross Entropy Loss for Training - Sigmoid Cross Entropy Loss for Validation - Sigmoid 5 Epochs 5 2 Figure 2. Cross entropy loss w.r.t. epochs for FFNN Training Accuracy with Softmax Validation Accuracy with Softmax Training Accuracy with Sigmoid Validation Accuracy with Sigmoid 5 Epochs 5 2 Figure 3. Average accuracy w.r.t. epochs for FFNN

6 TABLE IV. COMPARING THE CROSS ENTROPY LOSS AND ACCURACY BETWEEN CNN AND FFNN. Cross Entropy Loss Average Accuracy Neural Network type Training Validation Training Validation CNN (Softmax output) % 79 % CNN (Sigmoid output) % 77 % FFNN (Softmax output) % 78 % FFNN (Sigmoid output) % 72 % A. Discussion of results The final classification accuracies obtained are given in Table IV. From the results, it can be observed that training a CNN using 24-hr power data sets is feasible. Since mini-batch gradient descent is used, the plots corresponding to training data are noisy. The cross-entropy loss as well the classification accuracy is improving for both training and validation data sets. This would indicate that CNN is generalizing and not just memorizing. In case of the output layers, the Softmax can be said to be marginally better than the Sigmoid for average classification accuracy. Also, it can be seen that the trends for loss and accuracy have not entirely plateaued in case of the CNN. There is a noticeable difference when using validation data which would indicate that FFNN is less effective at generalizing. VII. CONCLUSION This work has proposed the use of Machine Learning algorithms like Convolutional Neural Networks (CNN s) to develop an ISO AI decision system that can aid or even replace the human operators to efficiently control complex power grids of tomorrow. The operation of CNN and the concept of feeding the CNN with power data in the form of stacks of 2-D arrays was introduced. The CNN was trained using power data from MISO to perform multi-class multilabel classification. The utility of TensorFlow for training and analyzing neural networks was also discussed. Up to 9% accuracy was obtained on the training data set, and 79% accuracy on the validation data set was observed using a Softmax classifier. The results underscored the feasibility of using CNN s for Big data analysis in power systems. There is still significant scope for improvement by using bigger data sets, experimenting with activation function like Sigmoid or Exponential linear unit (ELU), and fine tuning the CNN hyper parameters. REFERENCES [] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 22, pp [2] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, Going Deeper with Convolutions, Sep. 24. [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, Playing Atari with Deep Reinforcement Learning, Dec. 23. [4] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, Mastering the game of Go with deep neural networks and tree search, Nature, vol. 529, no. 7587, pp , Jan. 26. [5] S. Byford, Google s AlphaGo AI beats Lee Se-dol again to win Go series 4- - The Verge. [Online]. Available: [6] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol. 52, no. 7553, pp , May 25. [7] R. Evans and J. Gao, DeepMind AI reduces energy used for cooling Google data centers by 4%, Google Green Blog, 26. [Online]. Available: [8] Role of ISOs and RTOs, ISO/RTO Resource Council. [Online]. Available: [Accessed: -Mar-27]. [9] M. Sarwar and B. Asad, A review on future power systems; technologies and research for smart grids, in 26 International Conference on Emerging Technologies (ICET), 26, pp. 6. [] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE, vol. 86, no., pp , 998. [] R. Rizzo, A. Fiannaca, M. La Rosa, and A. Urso, A Deep Learning Approach to DNA Sequence Classification, in 2th International Meeting Computational Intelligence Methods for Bioinformatics and Biostatistics, 26, pp [2] H. Zeng, M. D. Edwards, G. Liu, and D. K. Gifford, Convolutional neural network architectures for predicting DNA protein binding, Bioinformatics, vol. 32, no. 2, pp. i2 i27, Jun. 26. [3] A. Gensler, J. Henze, B. Sick, and N. Raabe, Deep Learning for solar power forecasting An approach using AutoEncoder and LSTM Neural Networks, in 26 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 26, pp [4] S. S. Haykin and Simon, Neural networks: a comprehensive foundation, 2nd ed. Prentice Hall, 999. [5] P. Golik, P. Doetsch, and H. Ney, Cross-entropy vs. squared error training: a theoretical and experimental comparison, in Interspeech, 23, pp [6] A. Géron, Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems, st ed. O Reilly Media, Inc, 27. [7] S. J. Hanson, J. D. Cowan, C. L. Giles, C. Conference on Neural Information Processing Systems - Natural and Synthetic (6: 992: Denver, and C. Neural Information Processing Systems Conference (5: 992: Denver, The power of approximating: a comparison of activation functions, in Proceedings of the 5th International Conference on Neural Information Processing Systems, 992, pp [8] H. N. Mhaskar and C. A. Micchelli, How to choose an activation function. Morgan Kaufmann Publishers Inc., 993. [9] B. Xu, N. Wang, T. Chen, and M. Li, Empirical Evaluation of Rectified Activations in Convolutional Network, May 25. [2] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, TensorFlow: Large- Scale Machine Learning on Heterogeneous Distributed Systems, Mar. 26. [2] J. Dean and R. Monga, TensorFlow - Google s latest machine learning system, open sourced for everyone, Google Research Blog, 25. [Online]. Available: [22] D. P. Kingma and J. Ba, Adam: A Method for Stochastic Optimization, arxiv, Dec. 24. [23] Market Reports, MISO. [Online]. Available: orts.aspx. [Accessed: 7-Apr-27].

Cross-domain Deep Encoding for 3D Voxels and 2D Images

Cross-domain Deep Encoding for 3D Voxels and 2D Images Jingwei Ji Stanford University jingweij@stanford.edu Danyang Wang Stanford University danyangw@stanford.edu 1. Introduction 3D reconstruction is one