Bagging-Based Logistic Regression With Spark: A Medical Data Mining Method

Similar documents
CS145: INTRODUCTION TO DATA MINING

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

Louis Fourrier Fabien Gaie Thomas Rolf

Fast or furious? - User analysis of SF Express Inc

List of Exercises: Data Mining 1 December 12th, 2015

k-nn Disgnosing Breast Cancer

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Classification using Weka (Brain, Computation, and Neural Learning)

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluating Classifiers

OPTIMIZATION OF BAGGING CLASSIFIERS BASED ON SBCB ALGORITHM

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

Prognosis of Lung Cancer Using Data Mining Techniques

2. On classification and related tasks

Tutorials Case studies

Information Management course

Network Traffic Measurements and Analysis

Rank Measures for Ordering

CS249: ADVANCED DATA MINING

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Business Club. Decision Trees

Comparative Study of Instance Based Learning and Back Propagation for Classification Problems

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

An Empirical Study on Lazy Multilabel Classification Algorithms

Evaluating Classifiers

Weka ( )

CS4491/CS 7265 BIG DATA ANALYTICS

Classification and Regression

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

A Bagging Method using Decision Trees in the Role of Base Classifiers

A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Machine Learning nearest neighbors classification. Luigi Cerulo Department of Science and Technology University of Sannio

Pattern recognition (4)

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION

A Critical Study of Selected Classification Algorithms for Liver Disease Diagnosis

Interpretation and evaluation

Stanford University. A Distributed Solver for Kernalized SVM

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

Evaluating Machine-Learning Methods. Goals for the lecture

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang

A PSO-based Generic Classifier Design and Weka Implementation Study

Seminars of Software and Services for the Information Society

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

Evaluating Classifiers

Improving Ensemble of Trees in MLlib

A Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence

Detecting Thoracic Diseases from Chest X-Ray Images Binit Topiwala, Mariam Alawadi, Hari Prasad { topbinit, malawadi, hprasad

Analysis of classifier to improve Medical diagnosis for Breast Cancer Detection using Data Mining Techniques A.subasini 1

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CloNI: clustering of JN -interval discretization

Practice Questions for Midterm

Using Decision Boundary to Analyze Classifiers

Article Constructing Interactive Visual Classification, Clustering and Dimension Reduction Models for n-d Data

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Artificial Intelligence. Programming Styles

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Supervised Learning for Image Segmentation

NETWORK FAULT DETECTION - A CASE FOR DATA MINING

Use of Synthetic Data in Testing Administrative Records Systems

Bagging for One-Class Learning

An Expert System for Detection of Breast Cancer Using Data Preprocessing and Bayesian Network

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

CLASSIFICATION JELENA JOVANOVIĆ. Web:

Deep Learning With Noise

Data Mining and Knowledge Discovery: Practice Notes

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Using Machine Learning to Optimize Storage Systems

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund

Model s Performance Measures

Distributed Computing with Spark

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Mining Distributed Frequent Itemset with Hadoop

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters.

MEASURING CLASSIFIER PERFORMANCE

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Lecture 25: Review I

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

Classification Algorithms in Data Mining

Optimization using Ant Colony Algorithm

Contents. Preface to the Second Edition

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Transcription:

2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 206) Bagging-Based Logistic Regression With Spark: A Medical Data Mining Method Jian Pan,a*, Yiang Hua2,b, Xingtian Liu3,c, Zhiqiang Chen3,d, Zhaofeng Yan2,e Zhijiang College of Zhejiang University of Technology, Shaoxing 32030, China 2 Jianxing Honors College, Zhejiang University of Technology, Hangzhou, 30023, China 3 College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, 30023, China a pj@zjut.edu.cn, bhya994@qq.com, clt47208@qq.com, dfrudechen@qq.com, eyanzf27@63.co m, *corresponding author Keywords: Medical Data Mining; Bagging; Logistic Regression; Spark Abstract. Medical data in various organizational forms is voluminous and heterogeneous, it is significant to utilize efficient data mining techniques to explore the development rules of diverse diseases. However, many single-node data analysis tools lack enough memory and computing power, therefore, distributed and parallel computing is in great demand. In this paper, we propose a comprehensive medical data mining method consisting of data preprocessing and bagging-based logistic regression with Spark (BLR algorithm) which is improved for better compatibility with Spark, a fast parallel computing framework. Experimental results indicated that although the BLR algorithm took a little more duration than logistic regression (LR), it was 2.2% higher than LR in accuracy and outperformed LR with other common evaluation indexes. Introduction With the rapid development of modern medicine, the real data in diverse organizational forms collected from patients is increasing. Various structured or unstructured data is voluminous and heterogeneous []. It is highly meaningful and valuable for disease prognosis, diagnosis and treatment to utilize a variety of data mining techniques to explore the development rules and the correlation of diseases and discover the actual effect of treatments. There are many data analysis tools such as WEKA and SPSS, their biggest weakness, however, is that they are only able to running at one single node. Provided that the dataset is large enough, it is a challenge for single-node tools with limited memory and computing power. Therefore, distributed and parallel computing has been frequently used. MapReduce and Spark are two of the most popular parallel computing frameworks. In comparison with MapReduce, Spark not only raises processing speed and real-time performance but also achieves high fault tolerance and high scalability based on in-memory computing [2]. In this paper, we propose a comprehensive medical data mining method comprising mainly two steps: ) data preprocessing; 2) bagging-based logistic regression with Spark (BLR algorithm). Initially, the raw data was normalized by z-score standardization in view of the heterogeneity of medical data. The raw data in CSV format was converted into LIBSVM format subsequently for loading into RDD (Resilient Distributed Datasets), the memory-based data object in Spark. The BLR algorithm is based on bagging and logistic regression and is improved for better compatibility with Spark. To elaborate, predictions came out from each logistic regression classifier were integrated into the final prediction. Experimental results indicated that although the BLR algorithm took a little more duration than logistic regression (LR), it outperformed LR with common evaluation indexes. 206. The authors - Published by Atlantis Press 553

Related Work A number of medical data mining methods have been proposed in past decades, Lavindrasana et al. summarized different types of data mining algorithms including frequent itemset mining, classification, clustering, etc. The data analysis methodology must be suitable for the corresponding medical data [3]. Harper compared four classification algorithms comprising discriminant analysis, regression models (multiple and logistic), decision tree and artificial neural networks and concluded that there was not necessarily a single best algorithm but the best performing algorithm depending on the features of the dataset [4]. Taking the dataset we used into consideration, all values of the features are real and continuous, so logistic regression is more suitable than the other methods. All the above works were completed in single-node environment, however, as one of the most fast parallel computing frameworks, Spark implements common machine learning algorithms in its MLlib [5]. Qiu et al. proposed a parallel frequent itemset mining algorithm with Spark (YAFIM), their experimental results indicated that YAFIM outperformed MapReduce around 25 times in a real-world medical application [6]. Compared to the above methods, the BLR algorithm we propose has two distinctions: ) it achieves parallel computing over RDD and take advantage of memory for iterative computations. 2) it is based on bagging and logistic regression and is improved for better compatibility with Spark. Experimental results proved that the BLR algorithm outperformed LR with common evaluation indexes. Data Preprocessing We used Wisconsin Diagnostic Breast Cancer (WDBC) dataset which is available from the UCI machine learning repository [7]. The dataset has 32 attributes ( class label and 30 features, the ID number is excluded) and 569 instances. It has two class labels (Malignant, Benign) for binary classification. All values of the features are real and continuous, which indicates that logistic regression is more appropriate to be used instead of decision tree or Apriori algorithm which prefer discrete values. Fallahi et al. excluded the instances containing missing data and used SMOTE to correct data unbalancing [8]. Different dataset should be preprocessed in different method on the basis of its data features. On the one hand, the WDBC dataset has no missing value, so there is no need to exclude any instance. On the other hand, the positive class (357 Malignant) and the negative class (22 Benign) have reached a relatively balanced state. Bagging-Based Logistic Regression With Spark We propose the bagging-based logistic regression with Spark (BLR algorithm) on the basis of bagging and logistic regression. Spark MLlib has implemented logistic regression, their experimental results proved that logistic regression in Spark MLlib was 00 faster than MapReduce [5]. Experiment shows that real-time prediction model is lower by 70% compared with the average error of the basic model. The APE value under average real-time prediction model reaches 8.98. This means if the bus starts from the Huanglong stop, the error time is between 2.4 to 3.6 minutes with an average error of 3 minutes, since the total travel time of route K93 is about 30~45min. Logistic Regression in Spark MLlib. Logistic regression model can be expressed as exp( α + βmxmi ) m= pi = Py ( i = x i, x2i,, xmi ) = M + exp( α + βmxmi ) m= () where p i is the probability of the ith event when given the values of the independent variables x i, x 2i,, x mi. If p i is larger than the threshold we set, the ith event will be classified into positive class label, and vice versa. M 554

For computing the classification prediction, coefficients β m must be estimated through optimization. Spark MLlib provides two optimizers: Stochastic Gradient Descent (SGD) and Limited memory BFGS (L-BFGS). Quoc Le et al. held the point that SGD has two weaknesses: ) it needs a lot of manual tuning of optimization parameters, e.g., convergence criteria and learning rates. 2) it is hard to achieve parallel computing on clusters. L-BFGS is able to simplify and speed up the procedure of training deep learning algorithms significantly [9]. The main step of L-BFGS method can be expressed as β = k β + k αkhg k k ( 2) where β k, β k+ is the estimated coefficients, H k is Hesse matrix. L-BFGS only stores the pairs {y j, s j } in memory to update Hesse matrix [0]. It is precisely because L-BFGS is memory-intensive and performs better in limited memory that we choose it to become our optimization. It is also should be noted that we used L2 regularization to prevent overfitting. The BLR Algorithm. Bagging-based logistic regression with Spark is on the basis of bagging and logistic regression. Bagging is an effective model to improve classification accuracy. By making bootstrap replicates of the training set with replacement, it can generate multiple versions of a classifier with equal weight and aggregate them into one predictor through plurality vote []. In the process of implementation with Spark, the training set was randomly sampled for five times with replacement, thus five versions of logistic regression classifiers were generated. The aggregated predictor need to compute the final prediction of one instance by integrating all predictions from the five versions of logistic regression classifiers. For one test tuple, we note that p i is the prediction of the ith classifier, n is the amount of the positive class label from the five classifiers, n 0 is the amount of the negative class label, thus 0 p i.0 (3) Assume that the threshold is 0.5, if pi 0.5, n plus one, else, n 0 plus one. The plurality vote is accomplished when either n or n 0 reaches 3. Consider that 0 0.5 2 The final prediction p can be expressed as 5 pi i= 2 0.5 +, n > n 5 p = 5 pi i= 2, n < n0 5 p i 0 (4) (5) p 0.5, and vice versa. Spark improves its A test tuple is classified as positive class if and only if fault tolerance by Lineage, which records the coarse-grained conversions to RDD. Once some partitions of a RDD are lost, Spark can get enough information through Lineage to redo operations and recover the partitions which have been lost. The Lineage graph for the RDDs in the BLR algorithm is shown in Figure. Training set was stored as RDD and was converted into bootstrap sample through the operation randomsplit(weights). Further five logistic regression models were generated on the basis of bootstrap sample. Finally, the models output the predictions and the labels for every test tuple. All data stored as RDD can be recovered by Lineage. Performance Evaluation Experimental Environment. The BLR algorithm and LR were implemented in Spark.2.0 with Scala language 2.0.4. The Hadoop version was 2.2. We made our experiments on a cluster 555

consisting of 5 nodes shown in Table.The computing nodes were all running at Ubuntu 4.04 and JDK.8. Experiments and Performance Evaluation Results. In order to obtain more reliable results, for the training set and the test set, we used bootstrap to sample dataset randomly with replacement. Training Set randomsplit(weights) Bootstrap LogisticRegressionModel Predictions LogisticRegressionWithLBFGS(). run(bootstrap) LogisticRegressionModel.predict (Bootstrap.map(_.features)) Labels Figure Lineage Graph for the RDDs in BLR Table The Status of the Cluster Node Memory Cores Master 2G Worker G Worker 2 G Worker 3 G Worker 4 G It should be noted that the number of the instances in different test sets may not equal but to keep a relatively constant proportion. Whenever a tuple is chosen, it is also likely to be selected again, which guarantees the hybridity of the data and the reliability of the results. Using aforementioned two algorithms, for each training set, two classification models can be learned automatically in Spark. Once classification model was produced, each test tuple will be classified based on its feature values. The whole procedure was repeated 5 times for two algorithms with the 00 iterations in L-BFGS. Table 2 shows the classification results, i.e., the confusion matrixes. Table 2 Confusion Matrixes of Two Algorithms by Bootstrap Bootstra p Confusion Matrix of LR Confusion Matrix of BLR 2 5 3 65 72 2 05 5 04 0 5 65 75 3 4 03 67 0 79 4 09 3 03 4 7 0 83 5 06 6 2 2 556

72 0 7 As shown in the confusion matrixes, for either LR or BLR, the sum of TP and TN is always far greater than the sum of FP and FN, which qualitatively indicates that both algorithms have a high classification accuracy. More to the point, the quantity of FP and FN of BLR is much less than that of LR. For quantitatively comparing the two algorithms, we computed the average of 5 common evaluation indexes including accuracy, sensitivity, specificity, precision and recall based on the confusion matrixes and presented them in Figure 2. 00 99 LR BLR Value(%) 98 97 96 95 accuracy sensitivity specificity precision recall Evaluation Index Figure 2 Common Evaluation Indexes of LR and BLR As illustrated in Figure 2, although the evaluation indexes of both algorithm are over 95%, BLR has higher values than LR in all five indexes. To be more specific, BLR is 2.2%, 3.5%,.4% and 2.4% higher than LR in accuracy, sensitivity (recall), specificity and precision respectively. Another indicator for evaluating the classification performance is Receiver Operating Characteristic (ROC) curve. The accuracy of a classification algorithm is higher if the area under curve (AUC) is close to.0. We used Spark MLlib to obtain the discrete points (FPR, TPR) and fit the ROC curve. Given the AUC is so close to.0 that we focus on the key part of the ROC curve shown in Figure 3. The AUC of BLR is approximately 0.998 and that of LR is about 0.9732. It is apparent that the ROC curve of BLR is above LR. The results show better performance for BLR algorithm than that for LR. LR BLR TPR 0.9 0.8 0.7 0 0. 0.2 0.3 FPR Figure 3 The Key Part of the ROC Curve of LR and BLR 557

We also recorded the average duration of the two algorithms. The average duration of LR and BLR is increasing simultaneously. From the perspective of comparison, for each number of iteration, the average duration of BLR is slightly higher than LR. Conclusion Medical data in various organizational forms is voluminous and heterogeneous, it is meaningful and significant to utilize efficient data mining techniques to explore the development rules and the correlation of diverse diseases and discover the actual effect of treatments. However, it is a challenge for single-node data analysis tools with limited memory and computing power, therefore, distributed and parallel computing is in great demand. In this paper, we propose a comprehensive medical data mining method consisting of mainly two steps: ) data preprocessing; 2) bagging-based logistic regression with Spark (BLR algorithm). Initially, the raw data was normalized by z-score standardization in view of the heterogeneity of medical data. The raw data in CSV format was converted into LIBSVM format subsequently for loading into RDD. The BLR algorithm is based on bagging and logistic regression and is improved for better compatibility with Spark. To elaborate, by making bootstrap replicates of the training set with replacement, five versions of logistic regression classifiers were generated. The aggregated predictor computes the final prediction of one instance by integrating all predictions from the five classifiers. Experimental results indicated that although the BLR algorithm took a little more duration than logistic regression (LR), it was 2.2%, 3.5%,.4% and 2.4% higher than LR in accuracy, sensitivity (recall), specificity and precision respectively. Acknowledgements This work is supported by the Department of Science and Technology, Zhejiang Provincial People s Government (No. 206C33073) and the Zhejiang Xinmiao Talent Grants(No. 205R403040). References [] Cios K J, Moore G W. Uniqueness of medical data mining[j]. Artificial intelligence in medicine, 2002, 26(): -24. [2] Spark[OL]. http://spark.apache.org./ [3] Iavindrasana J, Cohen G, Depeursinge A, et al. Clinical data mining: a review[j]. Yearb Med Inform, 2009, 2009: 2-33. [4] Harper P R. A review and comparison of classification algorithms for medical decision making[j]. Health Policy, 2005, 7(3): 35-33. [5] Spark MLlib[OL]. http://spark.apache.org/mllib/. [6] Qiu H, Gu R, Yuan C, et al. YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark[C]. Parallel & Distributed Processing Symposium Workshops (IPDPSW), 204 IEEE International. IEEE, 204: 664-67. [7] Lichman, M. (203). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. [8] Fallahi A, Jafari S. An expert system for detection of breast cancer using data preprocessing and Bayesian network[j]. Int J Adv Sci Technol, 20, 34: 65-70. [9] Ngiam J, Coates A, Lahiri A, et al. On optimization methods for deep learning[c]. Proceedings of the 28th International Conference on Machine Learning (ICML-). 20: 265-272. 558

[0] Liu D C, Nocedal J. On the limited memory BFGS method for large scale optimization[j]. Mathematical programming, 989, 45(-3): 503-528. [] Breiman L. Bagging predictors[j]. Machine learning, 996, 24(2): 23-40. 559