Premature ventricular contraction beat detection with deep neural networks

Size: px

Start display at page:

Download "Premature ventricular contraction beat detection with deep neural networks"

Myra Simon
5 years ago
Views:

1 th IEEE International Conference on Machine Learning and Applications Premature ventricular contraction beat detection with deep neural networks Tae Joon Jun, Hyun Ji Park Nguyen Hoang Minh, Daeyoung Kim School of Computing KAIST Daejeon, Korea {taejoon89, hyunjip, minhhoang, Young-Hak Kim Department of Medicine University of Ulsan College of Medicine Ulsan, Korea Abstract A deep neural networks is proposed for the classification of premature ventricular contraction (PVC) beat, which is an irregular heartbeat initiated by Purkinje fibers rather than by sinoatrial node. Several machine learning approaches were proposed for the detection of PVC beats although they resulted in either achieving low accuracy of classification or using limited portion of data from existing electrocardiography (ECG) databases. In this paper, we propose an optimized deep neural networks for PVC beat classification. Our method is evaluated on TensorFlow, which is an open source machine learning platform initially developed by Google. Our method achieved overall 99.41% accuracy and a sensitivity of 96.08% with total 80,836 ECG beats including normal and PVC from the MIT-BIH Arrhythmia Database. Keywords Arrhythmia; Premature ventricular contraction; Deep neural network; Machine learning; TensorFlow I. INTRODUCTION Cardiovasular diseases (CVDs) are the number one cause of death in the world especially the main reason of death in developed countries. According to the World Health Organization (WHO), over 17.5 million people died from CVDs in 2012, which is more than 30% of all deaths in the world. However, it is known that people with CVDs or at high risk of CVDs can be managed by early detection of the disease and appropriate treatments. Therefore, the detection of CVDs for prevention and treatment is a significant task in medical health domain. There are several types of CVDs including heart valve problems, arrhythmia, and coronary artery disease. Detection of premature ventricular contractions (PVCs) is one of particular interest in case of arrhythmia. The PVC, also known as ventricular premature beat (VPB) is an irregular heartbeat started from ventricles while normal beat is initiated by sinoatrial node. A single PVC beat is commonly detected from healthy people and usually skipped without any symptoms. However, continuous PVC beats occasionally become a ventricular tachycardia (VT) that is a potentially fatal arrhythmia. Even worse, PVC beats become a ventricular fibrillation (VF), which can immediately lead people to death. Therefore detection of PVC beats are important for the patient with CVDs. Various techniques have been proposed for the PVC beat detection over the past several years. Machine learning algorithms such as Hidden Markov Chain [1], [2], principal component analysis [3], support vector machine [4], [5], [6], K-nearest neighbor classification [7], [8], Bayesian filters [9], and neural networks [10], [11] had been applied for the classification of arrhythmia beats including PVCs. Although these approaches have reached auspicious results, they have several limitations. First, the vast majority of these methods only applied in small subset of ECG database. In case of methods using MIT-BIH Arrhythmia database, although there are more than 80,000 ECG beats including normal and PVC beats, most of the previous researches retrieved small subset of data. Understandably, these methods reached extremely high accuracy such as over 99%. We do understand these methods are appropriate for the patient-to-patient PVC detection such as when using Holter recording. However, we do not believe that these approaches are suitable for today s general machine learning tendency especially related with Big data analysis. Second, there are methods that used entire MIT-BIH database for the classification, although they resulted in relatively poor accuracy and sensitivity. In this paper, we propose a highly accurate strategy for classifying PVC beats using deep neural networks that contains multiple hidden layers with 6 different input features extracted from ECG signals. The proposed neural networks is carefully optimized in selection of pre-trained dataset, input feature combinations, k value in k-fold cross validation, hidden layer architecture, regularization methods, the number of training steps, learning rates, activation functions, and optimizer functions. The proposed strategy achieved overall 99.41% accuracy and a sensitivity of 96.08% with total 80,835 ECG beats including normal and PVC from the MIT-BIH Arrhythmia database. As far as we know, our PVC detection method is the most accurate strategy in case of using total patient dataset. The rest of this paper is organized as follows. Section 2 presents the detailed explanation of the proposed classification strategy. Experimental evaluation and results are shown in Section 3 and 4. The conclusion of this paper is presented in Section /16 $ IEEE DOI /ICMLA

2 II. METHODOLOGY Our PVC detection strategy is composed of two main phases, feature extraction and beat classification. Fig. 1 presents the flow diagram of proposed PVC classification. Purpose of feature extraction phase is to detect a single heart beat pattern from sequential ECG signals. In general, combination of three different characteristic points called QRS complex is detected. This QRS complex contains important features that can be used to separate different heart beat types. In our detection method, we adopted well known Pan- Tompkins algorithm [12] to detect 4 different feature points; R-peak amplitude (R-peak), RR interval time (RR), QRSduration time (QRS), and ventricular activation time (VAT). In addition, we extracted two more significant features called Q-peak and S-peak amplitude. Fig. 2 describes 6 features extracted from the ECG signal. 6 features extracted from the first phase are used as inputs of beat classification phase. We built a deep neural network with 6 nodes of input layer, multiple hidden layers, and 2 nodes of output layer. A. Feature Extraction Before the features are extracted from sequential ECG signals, Pan-Tompkins algorithm performs 4 major QRS complex detection steps: band-pass filtering, five-point derivation, squaring, and moving window integration. Additionally, we added 5th step called peak-point detection after the moving window integration step to detect the Q-peak and S-peak. 1) Band-pass filter: Since the sequential ECG signal includes several noises such as muscle noise, power line interference, and baseline wandering, band-pass filtering eliminates these noises with combination of low-pass filter and highpass filter. The transfer function used for the low-pass filter is presented in following: H(z) = (1 z 6 ) 2 (1 z 1 ) 2 (1) When T is sampling rate of ECG signals, amplitude response function for the low-pass filter is shown in following: Fig. 2. Extracted features from the ECG signal H(wT) = sin2 (3ωT) sin 2 (2) (ωt/2) Difference equation for eliminating frequency higher than 11Hz is given in following: y(nt )=2y(nT T ) y(nt 2T )+x(nt ) (3) 2x(nT 6T )+x(nt 12T ) After the ECG signal pass through the low-pass filter, it enters the high-pass filter with transfer function presented in following: H(z) = ( 1+32z 16 + z 32 ) 1+z 1 (4) Similar to equation (2), amplitude response function for the high-pass filter is shown in following: [ sin 2 (16ωT) ] 1/2 H(wT) = (5) cos(wt/2) Difference equation for eliminating frequency lower than 5Hz is given in following: y(nt )=32x(nT 16T ) [y(nt T ) (6) +x(nt ) x(nt 32T )] 2) Five-point derivation: Subsequently, the five-point derivation is carried out to calculate gradients of the filtered ECG signals. Transfer function for the five-point derivation is presented in following: H(z) = 1 8T ( z 2 2z 1 +2z 1 + z 2 ) (7) Amplitude response function and difference equation for the derivation are shown in following: H(wT) = 1 [sin(2ωt)+2sin(ωt)] (8) 4T Fig. 1. PVC detection flow diagram y(nt )= 1 [ x(nt 2T ) 2x(nT T ) 8T +2x(nT + T )+x(nt +2T )] (9) 860

3 3) Squaring: After the five-point derivation, QRS waves of the ECG signal are roughly detected. Through out the squaring phase, ECG signal is converted into positive value and the QRS wave signal is accentuated where the frequencies are relatively higher than others. Difference equation for the squaring is given in following: y(nt )=[x(nt )] 2 (10) 4) Moving window integration: Moving window integration is the last phase of the Pan-Tompkins algorithm. Through out the integration, we can detect starting point of Q wave, peak point of R wave, and end point of S wave. When N is the number of samples in the moving window, equation for the integration is presented in following: y(nt )= 1 [x(nt (N 1)T )+x(nt N (N 2)T ) x(nt )] (11) 5) Peak-point detection: After the moving window integration phase, the three points of the QRS complex are detected; Q wave starting time, R wave peak time, and S wave end time. Detection of the Q-peak and S-peak uses these points as a time stamp. When the window approaches Q wave starting time at the moving window integration step, we set the Q-flag value to true. From the Q wave starting point to the R wave peak point, minimum amplitude and time from the ECG signal are stored. Note that these amplitude information is obtained from the filtered ECG signal not from the raw signal. When the window arrives at R wave peak time, we retrieve the stored value that is Q-peak amplitude and time. After the R wave peak time, we set the Q-flag value to false and refresh the stored minimum value and time. Repeating the similar method, we obtain S- peak amplitude and time between the R wave peak point and the S wave end point, B. Beat Classification Deep learning and deep neural networks are considered as the main machine learning techniques for the proposed PVC beat classification. The main drawback of the traditional neural networks was that when the number of the hidden layers and nodes increased, the model itself easily fell into local minimum point. In 2006, Hinton et al. dramatically solved this problem with pre-training each layer with unsupervised learning [13]. After Hinton s approach, deep neural networks emerged on the horizon with Big data era and General Purpose Graphic Processing Unit technique (GPGPU). Optimization of the proposed deep neural networks considered several parameters; K value in K-fold cross validation, combination of 6 features, hidden layer architecture, Xavier weight initialization method, regularization methods, the number of training steps, learning rates, activation function types, and optimizer function types. 1) Data normalization: Before starting the optimization in earnest, we normalized extracted feature data with Min-Max normalization as data pre-processing step. x = x min (12) Max min 2) Feature combination: The 6 extracted features represent the key characteristics of ECG beat type. To figure out the best feature set, we combined four to six different number of set with every 6 feature. This is the the most important step in the optimization since the domain experts classifies ECG beat types by checking distinguishing points, which are features, from the signal. As we later seen from the evaluation results, decision of which features as input set conclusively affects the overall PVC detection accuracy. 3) K-fold cross validation: K-fold cross validation is a model validation technique for enhancing the accuracy of the statistical analysis when the data set is relatively small. When there is N number of data set, it is divided into K different subsets. We leave the first subset as test set and train the model with remaining K - 1 subsets. After repeatedly performing this step with K different subsets as test set, the overall accuracy of the model is calculated with average accuracy of K different test sets. Fig. 3 describes the overall procedure of K-fold cross validation. Optimization required in this step is to find the most appropriate K value that minimizes over-fitting problem while presents reasonable average accuracy. 4) Hidden layer architecture: After deciding input features and K value, constructing hidden layer architecture is required. In this step, we consider the number of hidden layers and the number of nodes in each layer. Starting from single hidden layer, we continuously increased and modified the number of hidden layers and nodes until the overall accuracy saturates. As result, we constructed 6 hidden layers with L2 and dropout regularization with dropout rate 0.9 [14]. Fig. 4 briefly shows our 6 hidden layer architecture. 5) Xavier initialization: Xavier et al. introduced empirical initialization method for the weights of a matrix [15]. This initialization technique suggests that for the sigmoid and tanh activation functions, model can achieve better accuracy and faster convergence if the weights are initialized randomly with the following range: Fig. 3. K-fold cross validation 861

4 Fig Hidden Layer DNN Architecture x = 6 ;[ x, x] (13) In + Out Where In is the number of input units to the weights and Out is the number of output units. 6) Regularization: The purpose of the regularization in neural networks is to prevent over-fitting of the model by regulating capacity of the networks. We applied L2 regularization to weight vectors not to bias ones. When the number of the hidden layer reaches 5, our model seriously fell into local minimum problem. To prevent both over-fitting and local minimum problems, we applied dropout technique. Selection of dropout rate and layers to apply dropout was important in this phase. Especially, dropout shouldn t be applied to the first hidden layer in the neural networks since our input set only consists maximum to 6 features. when there was a loss from the first layer, overall accuracy of PVC detection dramatically decreased. 7) Training steps: In general, the number of training steps affects the neural networks with over-fitting problem. If the number of training steps is too small, procedure would be terminated without sufficient training, which results poor overall accuracy. In contrast, if the number of training steps is exceedingly large, the model will fell into over-fitting problem which also results in relatively low accuracy. Popular way to find the best training steps is using validation set. When the accuracy of validation set starts to decrease, we stop our model from training and applied it to test set. However, our empirical results supports that well optimized neural networks does not easily fell into over-fitting problem regardless of the number of training steps increases. Therefore, we pointed out some typical number of training steps in evaluation step between from 10 6 to check the relationship between the classification efficiency and the number of training steps. 8) Learning rate: We initialized learning rate with 0.01, since empirically we noticed that this rate works relatively fine. However, we also concluded that initial learning rate itself does not affects the performance since even large initial learning rate, we can manage the cost minimization by applying an exponential decay function to decrease the learning rate as the training continuous. As result, we set initial learning rate to 0.01 and exponential decay rate 0.95 with every 10,000 training steps. 9) Activation function: Activation function defines the output of nodes from given inputs in neural networks. Unlike traditional neural networks, deep learning technique normally handles normalized non-linear activation functions to prohibit inputs of node increasing with out bound. We compared several non-linear activation functions in this work. a) Sigmoid: Sigmoid function is a typical non-linear activation function in neural networks and it is a special case of the logistic function. 1 sigmoid(t) = 1+e t (14) b) Hyperbolic tangent: The hyperbolic tangent (tanh) function is frequently used when negative output value of the activation function is required. Different from the sigmoid function, hyperbolic tangent function maps the inputs between -1 and 1. tanh(x) = sinh(x) consh(x) = 1 e 2x 1+e 2x (15) c) Rectified linear unit: The rectified linear unit (ReLU) function is the most popular activation function in deep neural networks especially when using convolutional neural networks (CNN). relu(x) =max(0,x) (16) d) Exponential linear unit: Clevert et al. introduced exponential linear unit (ELU) in [16]. ELU reduces bias shift effect by allowing negative values while presents relatively well performance compared to ReLU. { x if x < 0 elu(x) = (17) α(exp(x) 1) if x 0 e) Softplus: The softplus function is a smooth approximation to the ReLU. Since the ReLU is not differentiable function, softplus tried to follow the characteristics of ReLU while presenting derivatives at every point. softplus(x) =ln(1 + e x ) (18) f) Softsign: The softsign function can be classified similar to sigmoid and hyperbolic tangent functions. This function provides near zero unit average while prevents function from saturating easily. softsign(x) = z (19) 1+ z 10) Optimizer function: Optimizer function is required to minimize the cost which means difference between actual label values and estimated values. After we calculated softmax cross entropy between estimated value and result value, we compared accuracy over different optimizer functions such as gradient descent, Adam[17], Adagrad[18], and RMSProp[19]. 862

5 III. EVALUATION The proposed PVC classification has been evaluated to validate its overall accuracy and sensitivity with various parameters we introduced in section 3. The measures to describe the accuracy and sensitivity were: a) True Positive (TP): Correctly detected as PVC b) True Negative (TN): Correctly detected as normal c) False Positive (FP): Incorrectly detected as PVC d) False Negative (FN): Incorrectly detected as normal Using these measurements, overall accuracy and sensitivity can be calculated by following equations: Accuracy(%) = TP + TN 100 (20) TP + TN + FP + FN Sensitivity(%) = TP TP + FP 100 (21) The raw ECG signal was obtained from the MIT-BIH Arrhythmia database. In this database, the total number of ECG beats are 109,494. However, from the feature extraction phase, several beats are lost or unclassified as ECG beats. Therefore, we detected 108,022 ECG beats from the database using Pan-Tompkins algorithms which is 98.66% detection accuracy for QRS complex. After the 6 features are extracted, we excluded ECG beats which are neither normal nor PVC. As result, we retrieved 80,836 ECG beats including 74,157 normal beats and 6,678 PVC beats. Next we separated entire data set into training set and test set. Distribution of data was uniformly processed by using divisor and remainder. The implementation of deep neural networks was performed in TensorFlow[20] with GPGPU support. Since we applied K-fold cross validation, total execution time for the training requires K times more than the naive one. Therefore, we used 4 NVIDIA K20m GPUs to accelerate the training execution time. IV. RESULT This section provides the results of the accuracy and sensitivity with optimization steps we took in section 3. Experiments processed in three steps: A. Experiment 1 In this experiment, we optimized simple deep neural networks which has 3 hidden layers with different K values from 3 to 20, and with different feature combinations to find best feature set and K-value. Using entire features is named FC1, and accounting 5 features are named FC2 to FC7. Since the overall accuracy results were around 99% in every experiments, we only presents the sensitivity results for the Experiment 1. Name FC1 FC2 FC3 FC4 FC5 FC6 FC7 TABLE I FEATURE COMBINATIONS Features Rpeak RR QRS VAT Qpeak Speak Rpeak RR QRS VAT Qpeak Rpeak QRS VAT Qpeak Speak Rpeak RR VAT Qpeak Speak Rpeak RR QRS Qpeak Speak Rpeak RR QRS VAT Speak RR QRS VAT Qpeak Speak TABLE II EXPERIMENT 1RESULT -SENSITIVITY(%) K FC1 FC2 FC3 FC4 FC5 FC6 FC B. Experiment 2 From the previous experiment, K = 8with FC1 presented best result. In this experiment, we optimized deep neural networks with different number of hidden layers and number of nodes in each layer. Starting from single hidden layer, we increased the number of layers until it reached 4. Although the accuracy and sensitivity keep increased, training often fell into local minimum point. After 5 hidden layers, most of the training tumbled to local minimum which results sensitivity near zero. Therefore, starting from 5 hidden layers, we applied L2 regularization and dropout with rate 0.9 to the weight matrix and nodes. Especially in 3 hidden layers experiment, we evaluated three different types of hidden layer architecture with the same number of nodes: diamond shape, increasing triangle shape, decreasing triangle shape. Fig. 5 demonstrated these architectural shapes. Between these three shapes, we concluded that diamond shape works better than other two based on experimental results. TABLE III EXPERIMENT 2RESULT L1 L2 L3 L4 L5 L6 L7 Acc(%) Se(%) Fig. 5. Hidden Layer Architectural Shapes 863

6 C. Experiment 3 In the Experiment 2, the best result was experiment with 6 hidden layers using L2 regularization and dropout rate of 0.9 except the first hidden layer. The total number of nodes we used in 6 hidden layers are 10 x 30 x 50 x 100 x 40 x 20. In third experiment, we compared the result of different training steps to figure out minimum training steps required to obtain reasonable boundary of results. Fig. 6 displays the accuracy and sensitivity with different number of training steps. D. Experiment 4 In this experiment, we compared the result of different activation and optimizer functions to tune the neural networks obtained from previous experiments. V. DISCUSSION From the experiments, we achieved the overall accuracy of 99.41% and sensitivity of 96.08%. To achieve this accuracy, we used Min-Max normalization, L2 regularization, dropout with rate of 0.9, 6 hidden layer DNN with 100,000 training steps, 0.01 initial learning rate with 0.95 exponent dacay funtion every 10,000 training steps, Xavier weight initialization, Softsign activation function, and Adam optimizer function. As far as we know, our proposed classification method provides the highest accuracy and sensitivity with MIT-BIH Arrhythmia database. Although, some of the related researches achieved higher results with partial set of database or patient-to-patient approach, we believe that using entire data set as training and test set is more suitable for recent machine learning applications, since they require enormous size of data for training. Fig. 6. Experiment 3 Result TABLE IV EXPERIMENT 4RESULT -ACTIVATION Softplus Relu Elu Tanh Softsign Sigmoid Acc(%) Se(%) TABLE V EXPERIMENT 4RESULT -OPTIMIZER Adam Adagrad RMSProp GradientDescent Acc(%) Se(%) VI. CONCLUSION In this paper, we propose an optimized deep neural networks for PVC beat classification. Our method achieved overall 99.41% accuracy and a sensitivity of 96.08%. We believe our deep neural networks model can be applied to other medical health databases and looking forward to evaluated them as future works. REFERENCES [1] W. T. Cheng and K. L. Chan, Classification of electrocardiogram using hiddenmarkovmodels in Proc. 20th Annu. Int. Conf. IEEE EMBS, 1998, vol. 20, pp [2] D. A. Coast, R. M. Stern, G. G. Cano, and S. A. Briller, An approach to Cardiac arrhythmia analysis using hidden Markov models IEEE Trans. Biomed. Eng., vol. 37, no. 9, pp , Sep [3] G. B. Moody and R. G. Mark, QRS morphology representation and noise estimation using the KarhunenLo eve transform in Proc. Comput. Cardiol., 1989, pp [4] Melgani, Farid, and Yakoub Bazi. Classification of electrocardiogram signals with support vector machines and particle swarm optimization in IEEE Transactions on Information Technology in Biomedicine 12.5 (2008): [5] Faziludeen, Shameer, and P. V. Sabiq. ECG beat classification using wavelets and SVM in Information & Communication Technologies (ICT), 2013 IEEE Conference on. IEEE, [6] Nuryani, Nuryani, Iwan Yahya, and Anik Lestari. Premature ventricular contraction detection using swarm-based support vector machine and QRS wave features in International Journal of Biomedical Engineering and Technology 16.4 (2014): [7] I. Christov, I. Jekova, and G. Bortolan, Premature ventricular contraction classification by the Kth nearest-neighbours rule, in Physiol. Meas., vol. 26, pp , [8] Park, Juyoung, Kuyeon Lee, and Kyungtae Kang. Arrhythmia detection from heartbeat using k-nearest neighbor classifier, Bioinformatics and Biomedicine (BIBM), 2013 IEEE International Conference on. IEEE, [9] Sayadi, Omid, Mohammad B. Shamsollahi, and Gari D. Clifford. Robust detection of premature ventricular contractions using a wave-based Bayesian framework in IEEE Transactions on Biomedical Engineering 57.2 (2010): [10] Wang, Jeen-Shing, et al. ECG arrhythmia classification using a probabilistic neural network with a feature reduction method in Neurocomputing 116 (2013): [11] Javadi, Mehrdad, et al. Classification of ECG arrhythmia by a modular neural network based on mixture of experts and negatively correlated learning in Biomedical Signal Processing and Control 8.3 (2013): [12] Pan, Jiapu, and Willis J. Tompkins. A real-time QRS detection algorithm in IEEE transactions on biomedical engineering 3 (1985): [13] Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets in Neural computation 18.7 (2006): [14] Srivastava, Nitish, et al. Dropout: a simple way to prevent neural networks from overfitting in Journal of Machine Learning Research 15.1 (2014): [15] Glorot, Xavier, and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks in Aistats. Vol [16] Clevert, Djork-Arn, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus) in arxiv preprint arxiv: (2015). [17] Kingma, Diederik, and Jimmy Ba. Adam: A method for stochastic optimization in arxiv preprint arxiv: (2014). [18] Duchi, John, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization in Journal of Machine Learning Research 12.Jul (2011): [19] Tieleman, Tijmen, and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude in COURSERA: Neural Networks for Machine Learning 4.2 (2012). [20] Abadi, Martn, et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015 Software available from tensorflow. org 1 (2015). 864

Deep Learning with Tensorflow AlexNet

Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification